StreamingLLM keeps AI models running smoothly indefinitely
VentureBeat presents: AI Unleashed – An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More
Text-to-text large language models (LLMs) such as OpenAI’s ChatGPT, Meta’s Llama 2, Anthropic’s Claude 2 have been at the center of the current AI gold rush in Silicon Valley and the wider enterprise tech world — but by and large, all of them share some of the same issues.
One of these issues is consistently high quality performance over time during a single conversation with a user — where the LLM provides responses that are as helpful, fast, and relevant in the middle of the conversation and at the very end as it does at the beginning, no matter how long that conversation lasts or how many exchanges of dialog it encompasses. This is because LLMs are pre-trained on blocks of data, or sequences, of certain lengths — 4,000 tokens in the case of Llama 2 and many other leading LLMs.
Once a user inputs more tokens than this — even if they are doing so across multiple different prompts — the LLM begins to suffer reduced performance, that is, worse quality responses. This is not acceptable for enterprises looking to have LLMs helping customers or employees in an open-ended fashion.
A new paper published recently by researchers at Meta, the Massachusetts Institute of Technology (MIT), and Carnegie Mellon University (CMU), finds that there is a simple way to help LLMs maintain their performance even for indefinitely long conversations, where the user’s prompts collectively add up to be longer than what the LLM was trained to handle at once.
Event
AI Unleashed
An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.
Learn More
Their work, a new framework for training and deploying LLM inferences dubbed “StreamingLLM,” reveals a number of important findings for other AI researchers and enterprises looking to use LLMs to aid with their business.
The problem StreamingLLM seeks to solve
As anyone who has interacted with a human customer support specialist or even an internal IT tech at your employer knows, it can often take a lengthy conversation and multiple messages exchanged between you and your assigned helper to solve the problem at hand.
But no matter whether you’re a customer or an employee — you want the person assigned to help you to be consistently responsive, informed, and helpful in their communications with you throughout your entire exchange. It can be very frustrating and counterproductive if suddenly, deep into the conversation where you’ve already spent time and energy explaining your issue, your helper begins responding with one-word answers, more slowly, or without giving you the information you need.
Although this can be an issue with some people who are distracted, unmotivated, or exhausted with the conversation, it is endemic for LLMs, as their performance suffers once a conversation with them goes beyond the length of the “context window,” the maximum number of tokens the LLM can respond to at once, and which was used to pre-train them. This is true even though most LLMs are designed to handle open-ended conversations that may go on for many lines.
Even if each of those lines fits within the context window of an LLM — and all of them should, as most LLMs have an upper boundary on the amount of text you can enter in for them to respond to in a single message — together, the cumulative sum of multiple messages in a single conversation adds up to a number of tokens that is larger than those included the LLM’s initial pre-training context window, which causes the LLM’s performance after this point to suffer.
It would be as though when you were talking to a human customer support agent, if once you said a certain number of words to them across a few sentences that added up to some limit unknown to you, they abruptly became stupider and less attentive.
The researchers behind the StreamingLLM framework summarize the problem in their paper as follows: “For example, an ideal ChatBot assistant can stably work over the content of recent day-long conversations. However, it is very challenging for LLM to generalize to longer sequence lengths than they have been pre-trained on.”
While it is possible to expand the length of the token sequences in pre-training LLMs, and already, a number of researchers have done this, it is not possible to account for how long a unique conversation with a given user will last.
So, how do you get an LLM with a fixed context-window length used in pre-training — however long that is — to be able to retain its performance once that length has been eclipsed over multiple messages?
The solution the researchers developed
The researchers developed an innovative solution for maintaining LLM performance once the amount of information in a conversation ballooned past the number of tokens used in the pre-training sequence.
What the researchers discovered was that LLMs pay closer attention to the tokens they are prompted with early on in a conversation or in training.
“A surprisingly large amount of attention score is allocated to the initial tokens,” they write. Why is this the case?
“Due to the sequential nature of autoregressive language modeling, initial tokens are visible to all subsequent tokens, while later tokens are only visible to a limited set of subsequent tokens,” they write. “As a result, initial tokens are more easily trained to serve as attention sinks, capturing unnecessary attention.”
In other words: whatever you put in front of an LLM first when conversing with it can and will be used by it later on in subsequent exchanges of prompt and output, but whatever you prompt it with later on will not necessarily be what the LLM chooses to focus on or reference in its responses.
Yet, the researchers discovered that if the user provides some of the initial tokens later in the conversation with an LLM, in subsequent responses, it’s enough to restore the LLMs performance back to near its peak.
Remember our human customer support analogy earlier? Imagine if, by saying four of the same magic words you said at the beginning of your conversation with them, you could suddenly get them to deliver high-quality responses with you even much later in the conversation.
The researchers dub these initial tokens that grab most of the LLM’s attention, fittingly, as “attention sinks,” and note that for most LLMs, “the introduction of four initial tokens, as attention sinks, suffices to restore the LLM’s performance…adding just one or two doesn’t achieve full recovery.”
By reintroducing attention sink tokens in every single subsequent prompt from a user, the researchers were able to maintain the performance of leading models including LLama 2 and Falcon 40B across prompts consisting of 4 million tokens (a 1000-fold increase from the original context window of just 4,000 tokens) “and potentially even more”, and increased its speed in subsequent responses by 22.2 times.
In other words, Streaming LLM “enables LLMs trained with a finite attention window to work on text of infinite length without finetuning.” Importantly — this “infinite” length text would still need to be delivered to the LLM in chunks limited to the size of its context window. However, it means the LLM could have a never-ending conversation with someone and retain its performance throughout (theoretically).
One token to rule them all (their attention, at least)
Taking their findings another step further, the researchers hypothesized and proved that you could actually get away with adding just a single special token to act as an “attention sink” for an LLM early on, and that, by reintroducing this token later manually or automatically (behind the scenes of a user-or-employee facing LLM), the LLM’s performance could continue to be kept high.
“Introducing a sink token is highly effective in stabilizing the attention mechanism,” the researchers explain. “Simply pairing this sink token with recent tokens sufficiently anchors the model’s performance…Given these findings, we recommend training future LLMs with a sink token in all samples to optimize streaming deployment.”
Asked what specific data should be used for an attention sink, one of the paper’s authors, Guangxuan Xiao of MIT, wrote to VentureBeat in an email that “the ‘attention sinks’ can be any initial tokens; the focus is more on their position than semantics…. These aren’t specific words or concepts; even tokens (e.g., linebreak “n”) without semantic meanings work effectively.”
As for what the researchers hope StreamingLLM will be used for, Xiao said: “We designed StreamingLLM for continuous applications, like multi-round dialogues. It’s perfect for use cases where a model must function non-stop without relying too heavily on past data. A daily assistant LLM exemplifies this. With our method, the model can persist, drawing from recent interactions, eliminating the need for frequent cache refreshes.”
However, the researchers are also clear to note the limitations of their work as well, and were careful to emphasize StreamingLLM does not extend the context window of LLMs, contrary to some hype on X (formerly Twitter) about their work. It also does not ensure that LLM will remember everything said at every point during the conversation.
“In fact, we neither expand the LLMs’ context window nor do we improve their long-term memory,” Xiao told VentureBeat.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.