qwen

------------------------------------------------------------------------------------------------------
# Qwen-1
+ Embedding and output projection. (Untied embedding for input embedding and output projection)
+ ROPE
+ QKV bias required
+ Pre-Norm & RMSNorm
+ SwiGLU

------------------------------------------------------------------------------------------------------
# Qwen-2
- Multi-Head Attention

+ MoE
+ Grouped Query Attention
+ Dual Chunk Attention
+ YARN
+ Expert Granularity
+ Expert Routing
+ Expert Initialization
+ Shared Experts

------------------------------------------------------------------------------------------------------
# Qwen-2.5
+ More control tokens. 3 -> 22

------------------------------------------------------------------------------------------------------
# Qwen-3
- QKV bias
- Shared experts

+ QK-Norm
------------------------------------------------------------------------------------------------------

参考

Qwen 各版本主要结构变化