Google researchers have developed a method called Infini-attention that permits LLMs to process infinitely long texts without increasing computational and storage requirements.

The Transformer architecture of an LLM allows it to think about all tokens in a prompt. The complex dot product and matrix multiplications it performs are quadratic complex.

This implies that doubling the tokens in your prompt will lead to 4 times more memory and processing power being required. This is why it’s so difficult to construct LLMs with large context windows without skyrocketing memory and compute requirements.

In a “standard” LLM, information firstly of the prompt content is lost as soon because the prompt becomes larger than the context window. Googles research paper explains how Infini-attention can store data beyond the context window.

How does Infini Attention work?

Infini-Attention combines compressive memory techniques with modified attention mechanisms in order that relevant older information will not be lost.

Once the prompt extends beyond the context length of the model, compressive memory stores information in a compressed format as an alternative of discarding it.

This allows older, less immediately relevant information to be stored without the storage and computing requirements increasing indefinitely because the input increases.

Instead of attempting to retain all older input information, Infini-attention’s compressed memory weights and summarizes information that’s deemed relevant and value keeping.

Infini-attention then uses a “vanilla” attention mechanism, but reuses the key-value states (KV) from each subsequent segment within the model slightly than discarding them.

Here is a diagram showing the difference between Infini-attention and one other prolonged context model Transformer XL.

Infini-Transformer (top) has full context history, while Transformer-XL (bottom) discards old contexts since the KV states are only cached for the last segment. Source: arXiv

The result’s an LLM that takes current input data into consideration locally, but in addition incorporates repeatedly distilled compressed historical data upon which it may focus long-term attention.

The paper states: “This subtle but crucial modification of the eye layer enables LLMs to process infinitely long contexts with limited memory and computational resources.”

How good is it?

Google conducted benchmarking tests with smaller Infini attention models with 1B and 8B parameters. These were in comparison with other prolonged context models akin to Transformer-XL and Memorizing Transformers.

The Infini Transformer achieved significantly lower perplexity scores than the opposite models when processing long-context content. A lower perplexity value means the model is more confident in its output predictions.

In the “passkey retrieval” tests, the Infini Attention models consistently found the random variety of as much as 1 million tokens hidden within the text.

Other models often manage to get the passkey near the tip of the input, but have difficulty finding it in the center or firstly of an extended piece of content. Infini-attention had no problems with this test.

The benchmarking tests are very technical, however the short story is that Infini-Attention outperformed the baseline models at summarizing and handling long sequences while maintaining context over longer periods of time.

Significantly, this superior retention capability was maintained while requiring 114 times less storage.

The benchmark results persuade researchers that Infini-Attention may very well be scaled to handle extremely long input sequences while remaining limited in memory and computing resources.

The plug-and-play nature of Infini-attention means it may be used for continuous pre-training and fine-tuning of existing Transformer models. This could effectively expand their context windows without requiring an entire retraining of the model.

Context windows will proceed to grow, but this approach shows that efficient storage may very well be a greater solution than a big library.

This article was originally published at