LongLLM-Extrapolation. Updated daily
[LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration]
- Author: Jun Zhao, Can Zu, Hao Xu, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, Xuanjing Huang (Fudan University)
- TL.DR: The paper introduces LongAgent, a novel approach using multi-agent collaboration to extend the context handling capabilities of LLMs like LLaMA to 128k tokens, surpassing traditional models like GPT-4 in long-text processing tasks. By employing a leader to direct team members in gathering information and an inter-member communication mechanism to resolve inconsistencies caused by hallucinations, LongAgent demonstrates significant improvements in long-text retrieval and multi-hop question answering. This method presents a promising solution for efficiently processing extensive text inputs while addressing the challenges of high training costs and inference latency in LLMs.
Data Engineering for Scaling Language Models to 128K Context
- Author: Yao Fu(University of Edinburgh), Rameswar Panda(MIT-IBM Watson AI Lab), Xinyao Niu(University of Melbourne), Xiang Yue(Ohio State University), Hannaneh Hajishirzi(University of Washington), Yoon Kim(MIT), Hao Peng(UIUC)
- TL.DR: The study explores the effectiveness of continual pretraining for extending the context length of language models to 128K, emphasizing the importance of data engineering in terms of both quantity and quality. It finds that training on 500 million to 5 billion tokens, with a focus on domain balance and avoiding naive length upsampling, enables models to effectively utilize information across extended contexts. This approach, which is both effective and affordable, outperforms existing long-context models and narrows the performance gap with state-of-the-art models like GPT-4 128K.
With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation
- Author: Y. Wang, D. Ma, D. Cai
- TL.DR: The proposed Temp-Lora method enhances long text generation, like novel writing or extensive translations, by embedding context information into a temporary module within the model's parameters, instead of using traditional memory-intensive methods. This innovative approach allows for high-quality text generation with significantly reduced hardware demands. It demonstrated impressive improvements in benchmarks, including lower perplexity and higher BLEU scores, while also cutting down computational costs and speeding up the process. Temp-Lora stands out by being efficient, compatible with existing methods, and effective in handling long texts without permanent changes to the model's structure.
LM-Infinite: Simple On-The-Fly Length Generalization For Large Language Models -Author: Chi Han(University of Illinois Urbana-Champaign), Qifan Wang(Meta), Wenhan Xiong(Meta), Yu Chen(Meta), Heng Ji(UIUC), Sinong Wang(Meta)
- TL.DR: LM-Infinite introduces a straightforward, efficient approach for enhancing LLMs ability to generate longer texts without the need for additional training. By implementing a Λ-shaped attention mask and setting a distance limit, this method effectively addresses the issue of length generalization failure in LLMs, without requiring parameter updates. It's compatible with models using relative-position encoding, offering significant improvements in fluency and quality of generated text for sequences up to 128k tokens. Moreover, LM-Infinite is computationally light, offering a 2.72x speedup in decoding time. The technique promises a practical solution for extending LLMs' applicability to longer contexts, with code to be shared post-publication.
- Author: Saurav Pawar (Technology Innovation Institute, UAE), S.M Towhidul Islam Tonmoy(Islamic University of Technology, Bangladesh), S M Mehedi Zaman(Islamic University of Technology, Bangladesh), Vinija Jain(Stanford University & 4Amazon GenA), Aman Chadha (Stanford University & 4Amazon GenA), Amitava Das(AI Institute, University of South Carolina)
- TL.DR: The survey paper discusses the importance and challenges of extending context length in LLMs for improving NLP applications. It highlights the need to overcome LLMs' limitations in handling long text sequences for better comprehension and generation capabilities. The paper reviews existing strategies for context length extension, evaluates their effectiveness, and addresses the lack of consensus on evaluation standards among researchers. It aims to guide researchers in understanding context extension techniques and encourages further exploration and standardization efforts in this evolving area.
Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization
- Author:
- TL.DR:
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
- Author: Jiaheng Liu(Alibaba Group), Zhiqi Bai(Alibaba Group), Yuanxing Zhang(Alibaba Group), Chenchen Zhang(Alibaba Group), Yu Zhang(Alibaba Group), Ge Zhang(University of Waterloo), Jiakai Wang(Alibaba Group), Haoran Que(Alibaba Group), Yukang Chen(The Chinese University of Hong Kong), Wenbo Su(Alibaba Group), Tiezheng Ge(Alibaba Group), Jie Fu(The Hong Kong University of Science and Technology), Wenhu Chen(University of Waterloo), Bo Zheng(Alibaba Group)
- TL.DR: The paper proposes E^2-LLM, an efficient method for extending the context length of LLMs without the need for long-context training data or high computational costs. By using short training sequences (e.g., 4k tokens) and a single training procedure, E2-LLM enables LLMs to handle various context lengths at inference time efficiently. It incorporates novel augmentation methods based on RoPE (Rotary Position Embeddings) to enhance model robustness to different context lengths. The approach significantly reduces the computational requirements and demonstrates its effectiveness across multiple benchmark datasets for long-context tasks.
Extending LLMs' Context Window with 100 Samples
- Author: Yikai Zhang(Shanghai Jiao Tong University, Shanghai Artificial Intelligence Lab, GAIR), Junlong Li(Shanghai Jiao Tong University, GAIR), Pengfei Liu(Shanghai Jiao Tong University, Shanghai Artificial Intelligence Lab, GAIR)
- TL.DR: The paper explored the distribution of attention score in different layers in LLMs in an entropy perspective, found that the attention entropy of foremost layer presenting a trend of uniform ascent. And proposed entropy-aware ABF (EABF) to mantain stable attention entropy in layers. In EABF, a scaling factor conduct to the product of QK in each layer expect the first two, and only the part that exceeds the training length is scaled. This method introduced only 100 samples into fine-tuning, and performed the best on their downstream tasks in the context size of 32k.
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache Code
- Author: Bin Lin*(Alibaba Group), Tao Peng*(Alibaba Group), Chen Zhang*(Shanghai Jiao Tong University), Minmin Sun(Alibaba Group), Lanbo Li(Alibaba Group), Hanyu Zhao(Alibaba Group), Wencong Xiao(Alibaba Group), Qi Xu(Alibaba Group), Xiafei Qiu(Alibaba Group), Shen Li(Alibaba Group), Zhigang Ji(Shanghai Jiao Tong University), Yong Li(Alibaba Group), Wei Lin(Alibaba Group)
- TL.DR: The paper aims to address the challenges of resource management in cloud-based LLM service systems, proposing a distributed attention algorithm, DistAttention to decouple of KV Caches’ computation from the standard transformer block, and a distributed LLM service system, DistKV-LLM. DistAttention partitions KV cache into manageable units, achieving distributed processing and storage of the attention module. DistKV-LLM dynamically manages the KV Cache and efficiently orchestrates all accessible GPU and CPU memory across data centers, thus enabling high-performance LLM services adaptable to a wide range of context lengths. It is a service support strategy rather than a model length extrapolation strategy.
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning Code
- Author: Hongye Jin(Texas A&M University), Xiaotian Han(Texas A&M University), Jingfeng Yang(Amazon), Zhimeng Jiang(Texas A&M University), Zirui Liu(Rice University), Chia-Yuan Chang(Texas A&M University), Huiyuan Chen(Case Western Reserve University), Xia Hu(Rice University)
- TL.DR: The work introduces SelfExtend, a novel method that leverages existing Large Language Models' (LLMs) inherent capabilities to handle longer contexts than they were trained for, without the need for fine-tuning. SelfExtend utilizes a bi-level attention mechanism—grouped attention for distant token dependencies and neighbor attention for adjacent token relationships—built on the original self-attention mechanism to extend LLMs' context window effortlessly during inference. The approach requires minimal code adjustments and has been proven effective across various benchmarks, enabling LLMs to process longer input sequences efficiently.
- Author: Yuhan Chen(Gaoling School of Artificial Intelligence, Renmin University of China), Ang Lv(Gaoling School of Artificial Intelligence, Renmin University of China), Ting-En Lin(Alibaba Group), Changyu Chen(Gaoling School of Artificial Intelligence, Renmin University of China), Yuchuan Wu(Alibaba Group), Fei Huang(Alibaba Group), Yongbin Li(Alibaba Group), Rui Yan(Gaoling School of Artificial Intelligence, Renmin University of China)
- TL.DR: This paper proposed attention buckets, an method that gathering multiple rope-base parallel processes. The method built on that the prediction logits showing relatively higher as the length adaptation improves among different repo base. A total of six searched rope base were used in the implementation, and they get a improvement on Tool using tasks. Its max token length of tasks is just 4096 but is insightful.
LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models
- Author: Jianxin Yang (Sun Yat-sen University)
- TL.DR: LongQLoRA is a method designed to extend the context lengths of large language models like LLaMA2 efficiently, using fewer training resources. It integrates Position Interpolation, QLoRA, and the Shift Short Attention mechanism from LongLoRA to effectively increase context lengths from 4096 to up to 12,000 tokens on a single 32GB V100 GPU within just 1000 fine-tuning steps. LongQLoRA shows strong performance on the PG19 and Proofpile datasets, surpassing LongLoRA and closely matching the performance of MPT-7B-8K for context lengths of 8192. It also successfully extends the context length of Vicuna-13B models, demonstrating improved generation quality in both long and short contexts. Ablation studies further explore the impact of LoRA rank, fine-tuning steps, and attention patterns during inference, contributing to our understanding of efficient context extension in language models.
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
- Author: Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu (Microsoft)
- TL.DR: LongLLMLingua is a strategy aimed at optimizing large language models (LLMs) for long context scenarios through prompt compression. It addresses the challenges of high computational costs, long latency, and reduced performance by enhancing the models' ability to focus on key information within the prompts. Through evaluations across various applications, including QA, few-shot learning, summarization, and more, LongLLMLingua has been shown to significantly improve performance and reduce both costs and latency. Notably, it achieved a performance increase of up to 17.1% with around four times fewer tokens needed for input in benchmarks like GPT-3.5-Turbo, alongside notable cost savings and speed improvements in processing times.
Scaling Laws of RoPE-based Extrapolation
- Author: Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, Dahua Lin (Fudan University, Shanghai AI lab)
- TL.DR: This work investigates the extrapolation capabilities of Large Language Models (LLMs) using Rotary Position Embedding (RoPE) and proposes a novel framework, the Scaling Laws of RoPE-based Extrapolation, to improve these capabilities. By adjusting the rotary base value and the context length used in fine-tuning, the authors found significant enhancements in the models' ability to handle much longer texts than seen during training, achieving extrapolation up to 1 million tokens with only 16K training length on LLaMA2 models. This study offers a comprehensive understanding of how RoPE's parameters influence LLMs' extrapolation performance and presents a methodological approach to extend their application range significantly.
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length
- Hongye Jin(Texas A&M University), Xiaotian Han(Texas A&M University), Jingfeng Yang(Amazon), Zhimeng Jiang(Texas A&M University), Chia-Yuan Chang(Texas A&M University), Xia Hu(Rice University)
- TL.DR: The paper introduces "GrowLength," a method to speed up the pretraining of LLMs by starting with short sequence lengths (128 tokens) and gradually increasing to longer ones (up to 4096 tokens). This strategy reduces computational costs and improves efficiency, allowing models to handle more tokens quickly and perform better. The method is simple, effective, and doesn't need extra engineering, offering a practical solution for faster and more efficient pretraining of LLMs.
Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory
- Authors: Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, Guannan Zhang(Ant Group)
- TL.DR: The TiM (Think-in-Memory) mechanism enhances LLMs for long-term interactions by introducing a novel memory system that stores and updates historical thoughts, avoiding the common problem of biased or inconsistent reasoning with repeated recall. TiM works by recalling relevant past thoughts before response generation and updating memory with new insights after, using principles like insert, forget, and merge for dynamic memory management. Additionally, it incorporates Locality-Sensitive Hashing for efficient long-term conversation retrieval. Tests on real and simulated dialogues show that TiM significantly improves LLMs' response quality in prolonged interactions.
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
- Author: Yukang Chen(CUHK), Shengju Qian(CUHK), Haotian Tang(MiT), Xin Lai(CUHK), Zhijian Liu(CUHK), Song Han(MIT, NVIDIA), Jiaya Jia(CUHK)
- TL.DR: LongLoRA is a novel fine-tuning method designed to efficiently extend the context sizes of pre-trained large language models (LLMs) without significantly increasing computational costs. It introduces two main improvements: the shifted sparse attention (S2-Attn) mechanism for efficient fine-tuning with sparse local attention, saving computation during the context extension process, and an optimized version of LoRA (Low-Rank Adaptation) that focuses on trainable embeddings and normalization for parameter-efficient fine-tuning. This approach allows for extending the context size of models like Llama2 from standard lengths to up to 100,000 tokens with minimal computational overhead, demonstrating strong performance on various tasks. LongLoRA maintains the original architecture of the models and is compatible with existing techniques, offering a practical solution for enhancing LLMs' ability to handle longer contexts effectively.
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training Code
- Author: Dawei Zhu(Peking University & Microsoft Corporation), Nan Yang(Microsoft Corporation), Liang Wang(Microsoft Corporation), Yifan Song(Peking University), Wenhao Wu(Peking University), Furu Wei(Microsoft Corporation), Sujian Li(Peking University)
- TL.DR: The paper introduces Positional Skip-wisE (PoSE) training, a novel method for extending the context length of Large Language Models (LLMs) without the need for intensive Full-length fine-tuning. PoSE cleverly simulates longer inputs within a fixed context window by dividing it into chunks and applying distinct skipping bias terms to manipulate each chunk's position indices. This technique allows the model to adapt to any position within the target length efficiently, significantly reducing memory and time costs. The authors successfully extended the LLaMA model to handle 128k tokens using only a 2k training context window and demonstrated PoSE's compatibility with RoPE-based LLMs and position interpolation strategies. This method opens the possibility of scaling LLMs to potentially infinite lengths, bounded only by inference memory constraints.
YaRN: Efficient Context Window Extension of Large Language Models
- Author: Bowen Peng(Nous Research), Jeffrey Quesnelle(Nous Research), Honglu Fan(EleutherAI, University of Geneva), Enrico Shippole
- TL.DR: YaRN (Yet another RoPE extensioN) is a new, compute-efficient method that significantly extends the context window of transformer-based language models like LLaMA, with far less computational cost than previous approaches. It enables these models to handle and extrapolate to much longer sequences than they were initially trained on, setting a new standard in context window extension. Additionally, YaRN can go beyond the context limitations of fine-tuning datasets.
Focused Transformer: Contrastive Training for Context Scaling
- Author: Szymon Tworkowski(IDEAS NCBR, University of Warsaw), Konrad Staniszewski(IDEAS NCBR, University of Warsaw), Mikołaj Pacek(IDEAS NCBR, University of Warsaw), Yuhuai Wu(xAI), Henryk Michalewski(University of Warsaw, Google DeepMind) Piotr Miło´s(IDEAS NCBR, Institute of Mathematics, Polish Academy of Sciences, deepsense.ai)
- TL.DR: The Focused Transformer (FoT) addresses the challenge of large language models losing efficiency as external memory scales, by refining the structure of the memory's (key, value) pairs through a contrastive learning-inspired training process. This technique allows for extending the effective context length of models without succumbing to the distraction of irrelevant information. By applying this method to fine-tune existing large-scale models, such as 3B and 7B OpenLLaMA, the enhanced versions, named LongLLaMA, show improved performance on tasks requiring extensive context, managing up to a 256k context length effectively for information retrieval.
LONGNET: Scaling Transformers to 1,000,000,000 Tokens
- Author: Jiayu Ding(Xi’an Jiaotong University), Shuming Ma(Microsoft Research), Li Dong(Microsoft Research), Xingxing Zhang(Microsoft Research), Shaohan Huang(Microsoft Research), Wenhui Wang(Microsoft Research), Nanning Zheng(Xi’an Jiaotong University), Furu Wei(Microsoft Research)
- TL.DR: This paper proposed LongNet to scaling Transformers to 1B tokens, which is a sparse Transformer method implemented by proposed dilated attention. Dilated attention consists of a series of attention patterns for modeling both short-range and long-range dependency. The number of attention patterns can be extended according to the sequence length. The computational efficiency has been significantly enhanced in this method. However, authors only evaluated its ppl performance on 32k and this approach lacks modeling of global details.
Extending Context Window of Large Language Models via Positional Interpolation
- Author: Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian(Meta)
- TL.DR: We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least ∼600× smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
Randomized Positional Encodings Boost Length Generalization of Transformers Code
- Author: Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness (DeepMind. The Swiss AI Lab, IDSIA, USI & SUPSI.)
- TL.DR: This work identifies the failure of Transformers to generalize to sequences of arbitrary length as a problem rooted in positional encodings being out-of-distribution for longer sequences. To address this, the authors introduce a novel randomized positional encoding scheme designed to simulate longer sequence positions, allowing the model to generalize to unseen sequence lengths more effectively. Their extensive evaluation across 6000 models and 15 tasks shows a significant improvement in generalization capabilities, with an average increase of 12.0% in test accuracy for sequences longer than those seen during training.
Landmark Attention: Random-Access Infinite Context Length for Transformers
- Author: Amirkeivan Mohtashami, Martin Jaggi(EPFL)
- TL.DR: The paper introduces a new method to handle longer contexts in Transformers without sacrificing the random-access flexibility of their attention mechanism. By using "landmark tokens" to represent input blocks, the method enables efficient selection of relevant context directly through the attention mechanism, avoiding the need for separate retrieval systems. This approach allows processing of much longer contexts seamlessly, improving performance and extending context length capacity significantly (e.g., up to 32k tokens for LLaMA 7B), making it competitive with state-of-the-art models like Transformer-XL and GPT-4 but with fewer resources.
Exploring Length Generalization in Large Language Models
- Author: Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur (Google )
- TL.DR: This paper explores the ability of transformer-based language models to generalize from short to longer problem instances in reasoning tasks, an important aspect of out-of-distribution generalization. Through empirical studies, the authors find that simply fine-tuning transformers on tasks requiring length generalization leads to significant deficiencies, regardless of the model's size. However, they discover that leveraging the in-context learning capabilities of pretrained large language models, combined with scratchpad prompting (which involves asking the model to outline solution steps before providing the final answer), significantly enhances length generalization capabilities. The study also conducts failure analyses to identify common mistake patterns, suggesting pathways for future improvements in enabling language models to handle longer and more complex problem instances effectively.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
- Author: Ofir Press(University of Washington, Facebook AI Research), Noah A. Smith(University of Washington, Allen Institute for AI), Mike Lewis(Facebook AI Research)
- TL.DR: ALiBi (Attention with Linear Biases) introduces a novel and efficient approach for enabling transformer models to handle longer sequence extrapolation than seen during training, without adding positional embeddings to word embeddings. Instead, ALiBi biases attention scores based on the distance between tokens, allowing a model trained on sequences of length 1024 to effectively manage sequences up to 2048. This method not only trains faster and uses less memory compared to traditional sinusoidal position embeddings but also outperforms other position methods on benchmarks like WikiText-103 due to its inductive bias towards more recent information.
RoFormer: Enhanced Transformer with Rotary Position Embedding
- Author: Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu(Zhuiyi Technology Co., Ltd.)
- TL.DR: The paper explores different methods of incorporating positional information into transformer-based models and introduces a new technique called Rotary Position Embedding (RoPE). RoPE uniquely combines rotation matrices to encode absolute positions with the ability to capture explicit relative position dependencies within the self-attention mechanism. This approach offers advantages like adaptability to various sequence lengths, diminishing dependency between tokens as their relative distance increases, and enhancing linear self-attention models with relative positioning capabilities. The improved transformer model, dubbed RoFormer, is tested across several long text classification benchmarks, showing superior performance compared to existing methods. The paper also provides theoretical insights to explain these experimental findings.