Data Engineering for Scaling Language Models to 128K Context
Data Engineering for Scaling Language Models to 128K Context
💡 Meta Data
Title | Data Engineering for Scaling Language Models to 128K Context |
---|---|
Journal | |
Authors | Yao Fu; Rameswar Panda; Xinyao Niu; Xiang Yue; Hannaneh Hajishirzi; Yoon Kim; Hao Peng |
Pub. date | 2024-02-15 |
期刊标签 | |
DOI | 10.48550/arXiv.2402.10171 |
附件 | Fu et al_2024_Data Engineering for Scaling Language Models to 128K Context.pdf |
📜 研究背景 & 基础 & 目的
论文主要研究了如何通过数据工程的方法,将语言模型的上下文长度扩展到128K个token。这项研究的重点在于数据工程,作者们提出了一个假设:长上下文建模的能力,特别是利用任意输入位置信息的能力,主要是通过大规模预训练获得的,并且这种能力可以通过轻量级的持续预训练在适当的数据混合上扩展到训练期间未见过的更长上下文(例如,从4K扩展到128K)。
📊 研究内容
“(1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context;” (Fu 等, 2024, p. 1) 数据量较少
“(2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that na ̈ıvely upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important.” (Fu 等, 2024, p. 1) 对于质量来说使用上采样可以大幅提高能力
“which asks the model to precisely recite the information in a given sentence where the sentence (the “needle”) is placed in an arbitrary location of a 128K long document (the “haystack”).” (Fu 等, 2024, p. 1) 干草堆测试的定义
“attention has quadratic complexity” (Fu 等, 2024, p. 1) transformer原始注意力就是一个平方复杂度的注意力
“We hypothesize that the capability to utilize information at arbitrary locations within long context length is (mostly) already acquired during pretraining, even for models pretrained on substantially shorter 4K contexts.” (Fu 等, 2024, p. 1) 作者认为模型在预训练过程就学习到了利用位置信息的能力
“because, as we observe, this results in perplexiy degradations in other domains (Table 5).” (Fu 等, 2024, p. 2) 只是单独对一个邻域的数据上采样会使得其它邻域的性能下降。
“we use 80K compared to Together’s 32K, which does not generalizes beyond 32K;” (Fu 等, 2024, p. 3) 使用80k的上下文进行训练
“data mixture: we use SlimPajama which has balanced domains compared to YaRN, which uses book-only PG19;” (Fu 等, 2024, p. 3) 使用邻域数据更平衡的混合数据集
“length upsampling: we upsample long sequences compared to LongLoRA, which does not.” (Fu 等, 2024, p. 3) 邻域数据进行平衡性的长文本上采样
“Another important related work is the previous LLaMA Long (Xiong et al., 2023) work and the concurrent XVERSE (XVerse, 2024) work, which continue pretraining the model on 32K sequences for about 500 billion tokens. These works are implicitly motivated by the view that longcontext modeling is a new capability that must be “injected” through large-scale training. We instead hypothesize that the base model has mostly already acquired this capability through large-scale pretraining, and thus a lightweight continual pretraining on relatively small data (e.g., 5B tokens) is enough to extend these capabilities to much longer context lengths (Fig. 3).” (Fu 等, 2024, p. 3) 与另一种观点对比,另一种观点是大模型的长上下文能力是通过大规模的继续预训练注入的。
“We use the SlimPajama (Soboleva et al., 2023) dataset for continual pretraining. This dataset is an open-source reproduction of the LLaMA (Touvron et al., 2023a) pretraining data mixture, consisting of 82% web data (67% from CommonCrawl and 15% from C4), 4.5% code (Github), 4.5% Wikipedia, 4.5% books, 2.5% Arxiv, and 2.0% StackExchange.” (Fu 等, 2024, p. 3) 本文使用的SlimPajama数据集的构成
“Since this dataset closely mirrors that used to pretrain the LLaMA models, there is less concern of distribution shift during continual pretraining; it is therefore used by many recent works like Fuzhao Xue & You (2023).” (Fu 等, 2024, p. 3) 即本文继续预训练使用的数据集与llama预训练使用的数据集相比分布比较接近,偏移较少,对预训练权重影响不大。
“Directly
upsampling long data changes the domain mixture, e.g., upsampling
sequences longer than 100K will increase the portion of the books
domain. Likewise, changes in the domain mixture will result in shifts of
the length distribution.” (Fu
等, 2024, p. 3)
不能直接上采样,会改变不同邻域数据的混合比例
🔤直接上采样长数据会改变域混合,例如,上采样序列大于100K会增加图书域的比例。同样,域混合的变化也会导致长度分布的变化。🔤
“Per-source Upsampling: This retains the domain mixture, then upsamples long documents within each domain.” (Fu 等, 2024, p. 3) 核心做法
“For training, we use a constant learning rate 2e-5. We modify the base of RoPE positional encoding to adjust it to longer context, as in Xiong et al. (2023). We pack all data to 80K chunks regardless of the document boundary, following common practice (Raffel et al., 2020; Touvron et al., 2023a). We set the batch size to be 4M tokens. Note that this batch size is the same as training on 4K context length, as we increase the length of a chunk but decrease the number of chunks in a batch. We train the model on 5B tokens, which translates to 5B (size of data) / 4M (batch size) = 2000 optimization steps.” (Fu 等, 2024, p. 4) 训练的一些超参数
“In Table 3 we show that our method not only improves precise retrieval, but maintains short context performance, evidenced by strong MMLU (Hendrycks et al., 2020) score” (Fu 等, 2024, p. 5) MMLU是一种短文本测评方法
“Our method outperforms LongLoRA and Yarn Mistral (even though Mistral 7B is a stronger base model than LLaMA 2 7B we use). Our 13B model performance closes the gap to GPT-4 128K, and we anticipate that future scaling and instruction tuning will further improve performance. While there are other long-context benchmarks in InfiniBench (Zhang et al., 2023), in our initial experiments we found that models often had trouble understanding the instruction (because they are not instruction tuned). Hence we focus on the BookQA benchmark where base LLMs performed reasonably without instruction tuning.” (Fu 等, 2024, p. 6) 其它的长上下午基准程序都是基于指令的,但本文并没有对模型进行instruct tune 因此模型难以理解指令。
“Our hypothesis is that precise retreival over long-range context is an intrinsic capability obtained by large-scale pretraining, even when the pretraining context length is substantially shorter (4K in many cases). If this hypothesis is true, then lightweight continual pretraining should be enough to extend this capability to much longer context lengths than see in training. That is, we would not need data-intensive continual pretraining as used by Xiong et al. (2023) and XVerse (2024).” (Fu 等, 2024, p. 6) 本文的观点
“At 500M to 1B tokens, the model achieves relatively good performance within its continually pretrained 80K context, but does not generalize to 80K-128K range. After 5B tokens, the model performs well on 0-80K, and can generalize to unseen lengths 80K-128K.” (Fu 等, 2024, p. 6) 数据集越大则模型的处理长上下文能力也越强,不过最终会收敛
“Our results suggest that for supervised finetuning, since training on long-context is substantially cheaper than previously thought, future work may dive deeper on the solutions for 100K length finetuning and reasoning, which so far has almost no open-source work to our knowledge. For pretraining research, currently there is no definite answer as to whether long-context continual pretraining should be combined with other capabilities, such as math (Azerbayev et al., 2023) and code (Chen et al., 2021), which typically require hundreds of billions of tokens. Our results suggest that long-context continual pretraining could be a separate stage after code and math pretraining.” (Fu 等, 2024, p. 7) 对未来的展望,表示本文的方法可加入作为预训练的一环
“Recall that this strategy keeps the mixture ratio of the data sources the same as the original data, i.e., 67% CommonCrawl (CC), 15% C4, 4.5% Github, 4.5% Wikipedia, 4.5% books, 2.5% Arxiv and 2.0% StackExchange for SlimPajama. Then in each of the domains, we upsample sequences longer than 4K from about 30% to about 70%.” (Fu 等, 2024, p. 7) 注意是每一个邻域中都分别进行上采样
“In contrast, globally upsampling long sequences (without considering their domain), or intentionally upsampling code/ book/ Arxiv (since they are long) changes both the domain mixture and the length distribution.” (Fu 等, 2024, p. 7) per-source上采样只改变了训练数据集的长度分布,而其余的这些方法不仅改变了长度分布,也改变了邻域的混合比例
“Table 5 compares the per-domain loss differences of all the data mixture against the baseline original mixture. We report the differences of the validation loss, where a more than 0.01 loss change is considered significant, following common pretraining practice (Kaplan et al., 2020; Peng et al., 2023; Hoffmann et al., 2022).” (Fu 等, 2024, p. 7) 通过对比实现来说明per-source是最平衡的一种方法,不会提高太多短文本的损失
“Note that LongLoRA (Chen et al., 2023b) uses the original data mixture without length upsampling, so our results also explains why we achieve better performance than LongLoRA (Fig. 1). We see that the original data mixture without length upsampling, despite achieving a very close loss, underperforms on precise retrieval. Per-source length upsampling significantly improves precise retrieval. This observation also serves as strong evidence why only using test loss, the evaluation used in most prior work (Chen et al., 2023a; Peng et al., 2023; Chen et al., 2023b; Xiao et al., 2023; Anthropic, 2023), may conceal the underlying model differences.” (Fu 等, 2024, p. 8) 没有上采样虽然损失相近,但是在干草堆任务上表现不佳。因此如果只是使用损失来进行模型的评估显然是片面的,这会掩盖模型潜在的差异。
“Long-context language model research at the 100K-level is still a developing research area. This work only studies continual pretraining, and research on instruction finetuning language models on tasks of 100K context length (e.g., repolevel code understanding) is still limited. So far there seems to no open-source instruction-finetuned 100K context language models. We hope our work serve as a basis for future work on 100K-level long context superivsed finetuning.” (Fu 等, 2024, p. 8) 100k级别上下文的微调仍然是一个问题
🚩 研究结论
论文总结了研究成果,指出通过持续预训练可以有效地扩展语言模型的上下文长度,并为未来的长上下文指令微调研究奠定了基础。
📌 感想 & 疑问
这篇论文提出了一种通过数据工程的方法来进行长下文建模,主要是通过加长继续预训练的上下文长度,并且选用新的混合邻域数据集,对每一个邻域都进行长文本的上采样,从而提高了数据质量,实验证明通过这种方法得到的模型在干草堆实验上的效果与chatgpt接近,并且不会损失太多在短文本上的性能。