Efficiently supporting LLMs is becoming more critical as large language models (LLMs) grow to be widely used. Since getting a brand new token involves getting the entire LLM’s parameters, speeding up LLM inference is difficult. The hardware is underutilized throughout generation resulting from this I/O constraint. Offloading-based inference and small-batch inference settings worsen this problem because, on current GPUs, producing a single token takes so long as processing a prompt containing a whole bunch or 1000’s of tokens.

Recent research has added speculative decoding to hurry up LLM inference while keeping the LLM’s output distribution intact, which solves this difficulty. These approaches use a number of draft models to forecast the LLM’s output. The forecasts are structured in a token tree, with each node representing a distinct sequence of speculated tokens. Next, a single forward pass of the LLM is used to confirm the correctness of those postulated tokens concurrently. The amount of tokens the LLM can accept could be increased using a token tree as an alternative of a sequence, as there are multiple alternatives for every token position. 

Researchers show that tree-based speculative decoding approaches have significant limitations though there are extensive studies on the subject. To begin with, they find that the present algorithms for constructing token trees work fantastic for very small trees but fall short when faced with very big ones. Furthermore, it has been noted that current token tree sampling and verification algorithms don’t work well when applied to different inference hyperparameter configurations. Finally, current systems fail miserably when optimizing the scale and structure of their predicted trees, whatever the hardware setup. For large hypothesized trees, the belief that verification time is constant in existing models characterizing the speedup from speculative decoding doesn’t hold, rendering these models useless for identifying the optimal tree dimensions.

Researchers from Carnegie Mellon University, Meta AI, Together AI, and Yandex frame tree constructing as a limited optimization issue and use a dynamic programming method to seek out one of the best speculative token tree. Token generation using this tree structure is shown to be unlimited and expand roughly logarithmically with tree size, each in theory and practice. According to the team, it’s essential to develop a technique for sampling and verifying trees that works well with different inference hyperparameters, doesn’t keep sampling the mistaken tokens, and produces accurate results each time. To tackle this, they implement a sampling strategy that doesn’t update the draft model. This prevents the draft model from making the identical error twice and keeps the output distribution of the goal model intact. This approach is predicated on the SpecInfer algorithm. They provide empirical evidence that this novel sampling and verification approach can achieve high acceptance rates in cold and warm environments. 

The team further presents a hardware-aware tree optimizer that considers the hardware-dependent relationship between the variety of verified tokens and the verification time. This optimizer then uses this relationship to find out one of the best tree form and depth, thereby addressing the ultimate obstacle. Their findings show that the strategy outperforms methods not specific to hardware regarding speed.

They prove Sequoia’s efficacy with comprehensive end-to-end trials and ablation research. Their Sequoia implementation uses CUDA Graphs and is built on top of Hugging Face (and Accelerate). In the offloading mode on a L40 GPU, Sequoia can increase to 4.04× for Llama2-7B and 10.33× for Llama2-70B on a single A100 GPU. Additionally, ablation studies have shown that:

  1. The Sequoia tree structure is more scalable than k-independent sequences (tree size ≤ 512), generating as much as 33% more tokens per decoding step.
  2. The Sequoia sampling and verification algorithm is temperature and top-p robust, providing speedups of as much as 65% and 27%, respectively, in comparison with SpecInfer and top-k sampling and verification algorithms.
  3. The Sequoia hardware-aware tree optimizer can mechanically determine the optimal tree size and depth of varied hardware.  

Check out the PaperAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

Don’t Forget to affix our Telegram Channel

You can also like our FREE AI Courses….

This article was originally published at www.marktechpost.com