flash attention

vllbc 发布于 2025-07-15 收录于 Attention LLM

Safe softmax 并没有 1-pass 算法，那么 Attention 会不会有呢？有！这就是 FlashAttention！

在使用 online attention 的情况下，从头开始计算 attention score 的过程如下： \(\operatorname{NOTATIONS}\)

\(Q[k,:]:\) the \(k\) -th row vector of \(Q\) matrix. \(\begin{aligned}O[k,:]:\mathrm{~the~}k\text{-th row of output }O\mathrm{~matrix.}\\\mathbf{V}[i,i]:\mathrm{~the~}k\text{-th row of output }O\mathrm{~matrix.}\end{aligned}\) \(V[i,:]{:\text{ the }i\text{-th row of }V\text{ matrix}}.\) \(\{\boldsymbol{o}_i\}{:}\sum_{j=1}^ia_jV[j,:]\), a row vector storing partial aggregation result \(A[k,:i]\times V[:i,:]\) BODY

GQA

vllbc 发布于 2025-07-15 收录于 Attention LLM

如上图所示，GQA 就是在 MHA 和 MQA 之间做了一个平衡。对 query heads 进行分组，分成几组就对应多少个 kv heads，然后每一组内的 query Heads 共享相同的 KV head。 GQA 可以在减少计算量和 KV Cache 同时确保模型效果不受到大的影响。

grpo

vllbc 发布于 2025-07-15 收录于 RLHF LLM

GRPO (trl 库)

重要参数

Num_generations: Number of generations to sample. The effective batch size (num_processes * per_device_batch_size * gradient_accumulation_steps) must be evenly divisible by this value.
generation_batch_size: Batch size to use for generation. If None, it defaults to the effective training batch size: per_device_train_batch_size * num_processes * steps_per_generation.
steps_per_generation: Number of optimization steps per generation. If None, it defaults to gradient_accumulation_steps.
Num_iterations: Number of iterations per batch (denoted as μ in the algorithm).
Per_device_train_batch_size
Num_processes (world_size)

trl 库的重要参数比较少。其中根据官方文档，generation_batch_size = `per_device_train_batch_size * num_processes * steps_per_generation Gradient_accumulation_steps 一般就是 steps_per_generation (对应 verl 中的 mini_batch_size / n_gpus / ppo_micro_batch_size_per_gpu)，可以理解为 per_device_train_bs (对应 verl 中的 ppo_micro_batch_size_per_gpu) 是使用梯度累计后的 bs，乘 gpu 数，再乘梯度累计的 steps 就是总的 batch_size（对应 verl 中的 train_batch_size * rollout. N）。所以注意，总的 batch_size (generation_batch_size) 是已经 rollout 采样后的 bs，除以 num_generations 才是针对 prompts 的 bs（verl 中的 train_batch_size）。下面是_get_train_sampler 方法的注释，对每一个 prompt 重复 num_generations 是该方法实现的。

online attention

vllbc 发布于 2025-07-15 收录于 Attention LLM

3-pass

\(\mathsf{NO}\) TATIONS

\(\{m_i\}{:}\max_{j=1}^i\left\{x_j\right\}\), with initial value \(m_0=-\infty.\) \(\{d_i\}{:}\sum_{j=1}^ie^{x_j-m_N}\), with initial value \(d_0=0,d_N\) is the denominator of safe softmax. \(\{a_i\}{:\text{ the final softmax value}}.\)

BODY \(\textbf{for }i\leftarrow 1, N\textbf{ do}\) \[m_i\leftarrow\max\left(m_{i-1},x_i\right)\] \(\mathbf{end}\)

\(\textbf{for }i\leftarrow 1, N\textbf{ do}\) \[d_i\leftarrow d_{i-1}+e^{x_i-m_N}\] \(\mathbf{end}\)

\(\textbf{for }i\leftarrow 1, N\textbf{ do}\) \[a_i\leftarrow\frac{e^{x_i-m_N}}{d_N}\] \(\mathbf{end}\)

这是 3 step 计算 attention 的方法，每一步都需要上一步的结果才可以继续计算。这样的话由于 sram 中没有足够的存储空间，因此需要多次访存。 ### Online attention \[\begin{aligned} d_i^{\prime}& =\sum_{j=1}^ie^{x_j-m_i} \\ &= \left(\sum_{j=1}^{i-1} e^{x_j-m_i}\right)+e^{x_i-m_i} \\ &= \left(\sum_{j=1}^{i-1} e^{x_j-m_{i-1}}\right)e^{m_{i-1}-m_i}+e^{x_i-m_i} \\ &= d_{i-1}' e^{m_{i-1}-m_i}+e^{x_i-m_i} \end{aligned}\] 找到迭代式之后就可以从 3 step 降到 2 step \[\begin{aligned}&\mathbf{for~}i\leftarrow1,N\textbf{ do}\\&&&m_i&&\leftarrow&\max\left(m_{i-1},x_i\right)\\&&&d_i^{\prime}&&\leftarrow&d_{i-1}^{\prime}e^{m_{i-1}-m_i}+e^{x_i-m_i}\\&\mathbf{end}\\&\mathbf{for~}i\leftarrow1,N\textbf{ do}\\&&&a_i\leftarrow&&\frac{e^{x_i-m_N}}{d_N^{\prime}}\\&\mathbf{end}\end{aligned}\] 好像 FLOPs 计算量并没有减少，甚至还略有增加，因为现在每次都需要计算额外的 scale

paged attention

vllbc 发布于 2025-07-15 收录于 Attention LLM

参考

# 图解大模型计算加速系列之：vLLM核心技术PagedAttention原理

ppo

vllbc 发布于 2025-07-15 收录于 RLHF LLM

PPO (openrlhf 库)

重点记录一下 experience 的采集过程。训练其实很简单。Actor 在 RLHF 会进行 auto-regressive decoding，而 critic, reward 和 reference 则只会 prefill，不会 decode。所以，我们将 actor 的推理特定称为 rollout，而其他模型的推理称为 inference。