Apple engineers have developed an AI system that resolves complex references to screen entities and user conversations. The lightweight model might be an excellent solution for on-device virtual assistants.

People are good at breaking down references in conversations with each other. When we use terms like “the underside” or “he,” we understand what the person is referring to based on the context of the conversation and the things we will see.

This is far more difficult for an AI model. Multimodal LLMs like GPT-4 are good for answering questions on images, but are expensive to coach and require quite a lot of computational effort to process each query about a picture.

Apple engineers took a unique approach with their system called ReALM (Reference Resolution As Language Modeling). The paper is value a read to learn more in regards to the development and testing process.

ReALM uses an LLM to process conversational, screen and background elements (alarms, background music) that make up a user’s interactions with an AI virtual agent.

Here is an example of the sort of interaction a user may need with an AI agent.

Examples of a user’s interactions with a virtual assistant. Source: arXiv

The agent must understand conversational units reminiscent of the indisputable fact that when the user says “that one,” they’re referring to the pharmacy’s phone number.

It also needs to grasp the visual context when the user says “the underside,” and that is where ReALM’s approach differs from models like GPT-4.

ReALM relies on upstream encoders to first analyze the weather on the screen and their positions. ReALM then reconstructs the screen in purely textual representations from left to right and top to bottom.

In easy terms, it uses natural language to summarize the user’s screen.

Now when a user asks a matter about something on the screen, the language model processes the textual description of the screen as a substitute of getting to make use of a vision model to process the screen image.

The researchers created synthetic datasets of conversation, screen and background entities and tested ReALM and other models to check their effectiveness in resolving references in conversational systems.

The smaller version of ReALM (80 million parameters) performed comparable to GPT-4 and its larger version (3B parameters) significantly outperformed GPT-4.

ReALM is a tiny model in comparison with GPT-4. Its superior reference resolution makes it an excellent selection for a virtual assistant that may live on-device without sacrificing performance.

ReALM doesn’t perform as well with more complex images or nuanced user requests, but could work well as a virtual assistant within the automobile or on the device. Imagine if Siri could “see” your iPhone screen and reply to references to elements on the screen.

Apple has been just a little slow to get out of the starting blocks, but recent developments just like the MM1 model and ReALM show that loads is occurring behind closed doors.

This article was originally published at dailyai.com