------------------------------------------------------------------------------------------------------
# Qwen-1
+ Embedding and output projection. (Untied embedding for input embedding and output projection)
+ ROPE
+ QKV bias required
+ Pre-Norm & RMSNorm
+ SwiGLU
------------------------------------------------------------------------------------------------------
# Qwen-2
- Multi-Head Attention
+ MoE
+ Grouped Query Attention
+ Dual Chunk Attention
+ YARN
+ Expert Granularity
+ Expert Routing
+ Expert Initialization
+ Shared Experts
------------------------------------------------------------------------------------------------------
# Qwen-2.5
+ More control tokens. 3 -> 22
------------------------------------------------------------------------------------------------------
# Qwen-3
- QKV bias
- Shared experts
+ QK-Norm
------------------------------------------------------------------------------------------------------
参考
Qwen
各版本主要结构变化