When a human-AI conversation involves many rounds of continuous dialogue, the powerful, large-language machine learning models that drive chatbots like ChatGPT sometimes break down, resulting in a rapid decline in bot performance.

A team of researchers from MIT and elsewhere has identified a surprising reason behind this problem and developed a straightforward solution that permits a chatbot to take care of an uninterrupted conversation without crashing or slowing down.

Their method involves optimizing the key-value cache (which is sort of a conversational store) that forms the core of many large language models. Some methods take out the primary data items when this cache must store more information than it could actually hold. This can result in model failure.

By ensuring that these first few data points are remembered, the researchers’ method allows a chatbot to proceed chatting irrespective of how long the conversation lasts.

The method, called StreamingLLM, allows a model to stay efficient even when a conversation spans greater than 4 million words. Compared to a different method that avoids crashes by always recomputing parts of past conversations, StreamingLLM was greater than 22 times faster.

This could allow a chatbot to have long conversations throughout the workday without having to always restart, enabling efficient AI assistants for tasks similar to copywriting, editing or generating code.

“Using this method, we are able to now provide these large language models permanently. By making a chatbot that we are able to chat with at any time and that may at all times reply to us based on our recent conversations, we could use these chatbots in some recent applications,” says Guangxuan Xiao, a doctoral student in electrical engineering and computer science (EECS). and lead writer of an article on StreamingLLM.

Xiao’s co-authors include his advisor Song Han, an associate professor of EECS, a member of the MIT-IBM Watson AI Lab and a distinguished scientist at NVIDIA; in addition to Yuandong Tian, ​​a research scientist at Meta AI; Beidi Chen, assistant professor at Carnegie Mellon University; and senior writer Mike Lewis, research scientist at Meta AI. The work shall be presented on the International Conference on Learning Representations.

A mysterious phenomenon

Large language models encode data, similar to words in a user query, into representations called tokens. Many models use something called an attention mechanism, which uses these tokens to generate recent text.

Typically, an AI chatbot writes recent text based on the text it just saw and due to this fact stores current tokens in memory, called a KV cache, for later use. The attention mechanism creates a grid that features all of the tokens within the cache, an “attention map” that maps how strongly each token or word is said to each other token.

Understanding these relationships is a feature that permits large language models to generate human-like text.

However, if the cache becomes very large, the eye map can grow to be even larger, slowing down the computation.

Additionally, if encoding content requires more tokens than the cache can hold, the performance of the model will decrease. For example, a well-liked model can store 4,096 tokens, but in a scientific paper it’s about 10,000 tokens.

To get around these problems, researchers use a “sliding cache,” which takes out the oldest tokens so as to add recent tokens. However, the performance of the model often drops as soon as the primary token is removed, causing the standard of newly generated words to quickly decrease.

In this recent paper, the researchers found that the model maintains performance even when the primary token stays within the push cache, even when the cache size is exceeded.

But that did not make any sense. The first word in a novel probably has nothing to do with the last word. Why should the primary word be so necessary for the model to generate the latest word?

In their recent work, the researchers also uncovered the reason behind this phenomenon.

Attention drops

Some models use a softmax operation of their attention mechanism that assigns each token a rating indicating the way it pertains to the opposite tokens. For the softmax operation, all attention values ​​must sum to 1. Since most tokens are usually not strongly linked to one another, their attention scores are very low. The model stores all remaining attention values ​​in the primary token.

The researchers call this primary token an “attention sink.”

“We need an attention sink, and the model chooses to make use of the primary token as an attention sink since it is globally visible – every other token can see it.” We discovered that we at all times must keep the eye sink in cache in an effort to to take care of model dynamics,” says Han.

When constructing StreamingLLM, the researchers found that 4 attention sink tokens originally of the sliding cache resulted in optimal performance.

They also found that the positional encoding of every token must remain the identical whilst recent tokens are added and others are removed. If token 5 falls out, token 6 must remain encoded as 6, even whether it is now the fifth token within the cache.

By combining these two ideas, they enabled StreamingLLM to take care of a continuous conversation while outperforming a well-liked recalculation method.

For example, if the cache accommodates 256 tokens, the recalculation method takes 63 milliseconds to decode a brand new token, while StreamingLLM takes 31 milliseconds. However, when the cache size increases to 4,096 tokens, the recalculation for a brand new token takes 1,411 milliseconds, while StreamingLLM only takes 65 milliseconds.

“StreamingLLM’s progressive approach, specializing in the eye sink mechanism, ensures stable memory usage and performance even when processing texts as much as 4 million tokens long,” says Yang You, a young professor of computer science on the National University of Singapore, who was not involved on this work. “This ability isn’t only impressive; It is transformative and enables the appliance of StreamingLLM in a big selection of AI applications. StreamingLLM’s power and flexibility make it a promising technology poised to revolutionize the best way we approach AI-driven generation applications.”

Tianqi Chen, an assistant professor within the machine learning and computer science departments at Carnegie Mellon University who was also not involved on this research, agreed, saying, “Streaming LLM enables the graceful extension of conversation length of huge language models.” We use it, to enable the deployment of Mistral models on iPhones with great success.”

The researchers also examined using attention sinks during model training by prepending multiple placeholder tokens in all training examples.

They found that training with attention sinks allowed a model to take care of performance with only one attention sink in its cache, relatively than the 4 typically required to stabilize the performance of a pre-trained model.

However, while StreamingLLM allows a model to have a continuous conversation, the model cannot remember words that are usually not stored within the cache. In the long run, researchers want to handle this limitation by exploring methods to retrieve distant tokens or allow the model to recollect previous conversations.

StreamingLLM has been integrated into NVIDIA’s large language model optimization library. TensorRT-LLM.

This work is funded partly by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the US National Science Foundation.

This article was originally published at news.mit.edu