Transformer Feed-Forward Layers Are Key-Value Memories
Transformer Feed-Forward Layers Are Key-Value Memories
💡 Meta Data
Title | Transformer Feed-Forward Layers Are Key-Value Memories |
---|---|
Journal | |
Authors | Mor Geva; Roei Schuster; Jonathan Berant; Omer Levy |
Pub. date | 2021-09-05 |
期刊标签 | |
DOI | 10.48550/arXiv.2012.14913 |
附件 | Geva et al_2021_Transformer Feed-Forward Layers Are Key-Value Memories.pdf |
📜 研究背景 & 基础 & 目的
前馈层占据了 Transformer 模型参数的三分之二,但其在网络中的作用尚未被充分探索。作者发现 Transformer 语言模型中的前馈层可以作为键值记忆(key-value memories)来操作。每个键(key)与训练示例中的文本模式相关联,每个值(value)则诱导输出词汇表上的概率分布。作者发现 Transformer 语言模型中的前馈层可以作为键值记忆(key-value memories)来操作。每个键(key)与训练示例中的文本模式相关联,每个值(value)则诱导输出词汇表上的概率分布。前馈层的输出是其记忆的组合,并通过模型层的残差连接逐步细化,以产生最终的输出分布。
📊 研究内容
前馈层与键值神经记忆非常相似,唯一的区别是神经记忆使用 softmax 作为非线性函数,而 Transformer 中的前馈层不使用归一化函数。
其中
“We posit that the key vectors K in feed-forward layers act as pattern detectors over the input sequence, where each individual key vector ki corresponds to a specific pattern over the input prefix x1, . . . , xj. To test our claim, we analyze the keys of a trained language model’s feed-forward layers. We first retrieve the training examples (prefixes of a sentence) most associated with a given key, that is, the input texts where the memory coefficient is highest.” (Geva 等, 2021, p. 2)
表明神经元中的\(k_i\)代表了一种模式,而对应的参数矩阵K(某一列向量)充当这个模式的模式检测器。当检测到有对应模式时,则表现为对应的\(k_i\)值较高,相当于注意力中的得分,而其对应的第二层参数矩阵V(某一列向量)代表了这种模式对应的token的概率分布(乘嵌入矩阵进行转换),而将得分与V向量概率分布相乘后得到的概率分布即是最终要得到token的概率分布。即这是一种混合响应输出。
“Comparing equations 1 and 2 shows that feedforward layers are almost identical to key-value neural memories; the only difference is that neural memory uses softmax as the non-linearity f (·), while the canonical transformer does not use a normalizing function in the feed-forward layer. The hidden dimension dm is essentially the number of memories in the layer, and the activation m = f (x · K>), commonly referred to as the hidden layer, is a vector containing an unnormalized non-negative coefficient for each memory. We refer to each mi as the memory coefficient of the ith memory cell.” (Geva 等, 2021, p. 2) ffn中的kv解释与self attention中kv的区别
“We
assume that patterns stored in memory cells originate from examples the
model was trained on. Therefore, given a key
kithat corresponds to the i-th hidden dimension of the
-th
feed-forward layer, we compute the memory coefficient
ReLU(xj· ki
) for every prefix x1, . . . , xj of every
sentence from the WikiText103’s training set.3” (Geva
等, 2021, p. 3)
与key向量相乘后得到的是一个数值,即memory
coefficient,用前缀中最后一个token的输入向量乘以神经元对应的模式检测器ki,即得到这个前缀相对于这个输入模式的匹配程度。
“Then,
we retrieve the top-t trigger examples, that is, the t prefixes whose
representation at layer
yielded the highest inner product with ki
.”
(Geva
等, 2021, p. 3) 对于每一层的每一个ki都找到前t个memory
coefficient最大的句子(前缀)
“For
every layer
and memory dimension i, we compare the top-ranked token according to v
i , (argmax(pi)) to the next token w
i in the top1 trigger
example according to
ki(the example whose memory coefficient for ki
is the
highest).” (Geva
等, 2021, p. 4)
即对应vi得到的概率分布的对应token与前面ki中得到的最高memory
coeffcient的句子的下一个token进行对比。
“Next,
we take the next token of ki’s top-1 trigger example (w
i
), and find where it ranks in the value vector’s distribution
pi. Figure 5 shows that the rank of the next token of a trigger example increases through the layers, meaning that w
i tends to get higher probability in the upper layers.” (Geva
等, 2021, p. 5)
对于ki得分最高的句子,其下一个token即要预测的token在ki对应的vi的概率分布中的位置(rank)。
“Here, the validation set is used (rather than the training set used to find trigger examples) since we are trying to characterize the model’s behavior at inference time, not find the examples it “memorizes” during training.” (Geva 等, 2021, p. 6) 🔤为什么用验证集的原因🔤
“While there are cases where a single memory cell dominates the output of a layer, the majority of outputs are clearly compositional. We count the number of instances where the feed-forward layer’s top prediction is different from all of the memories’ top predictions.” (Geva 等, 2021, p. 6) 意思为原始vi得到的概率分布对应的token与和Key计算加权后的yi得到的概率分布对应的token是否一致。这里的token都是预测的前缀的下一个token。
“The fraction of examples in a random sample of 4,000 examples where the layer’s prediction is different from the prediction of all of its memories.” (Geva 等, 2021, p. 7) 图8表明了v的混合相应输出与原始v完全不同,而且相同的例子都是一些停用词。因此混合相应输出非常有意义。
“Figure 9 shows that roughly a third of the model’s predictions are determined in the bottom few layers. This number grows rapidly from layer 10 onwards, implying that the majority of “hard” decisions occur before the final layer.” (Geva 等, 2021, p. 7) 即某一层得到的最终输出等于经过整个模型得到的最终输出。
🚩 研究结论
- 前馈层的作用
作者提出前馈层模拟键值记忆,并展示了实验结果,表明键与可解释的输入模式相关联,值在模型上层诱导与下一个标记分布相关的输出词汇表分布。
- 研究意义
这些发现为理解 Transformer 语言模型的工作原理提供了新的视角,并为现代 NLP 模型的研究开辟了新的研究方向。
📌 感想 & 疑问
该文从kv角度解读了transformer中前馈层的作用,很具有启发性,并得出了深层学习句子的高级特征,浅层学习句子的表面特征(即句子以某个word为结尾)的结论。