Layer Norm
pre-norm
Pre-norm:\(X_t+1=X_{t}+F_{t}(Norm(X_{t}))\)
\(先来看Pre-norm^{+},递归展开:\) \[X_{t+1}=X_t+F_t(Norm(X_t))\] \(=X_{0}+F_{1}(Norm(X_{1}))+\ldots+F_{t-1}(Norm(X_{t-1}))+F_{t}(Norm(X_{t}))\) 其中,展开\(^{+}\)后的每一项( \(F_{1}( Norm( X_{1}) ) , \ldots\), \(F_{t- 1}( Norm( X_{t- 1}) )\), \(F_{t}( Norm( X_{t}) )\))之间都是同一量级的, 所以\(F_1(Norm(X_1))+\ldots F_{t-1}(Norm(X_{t-1}))+F_t(Norm(X_t))\)和 \(F_1(Norm(X_1))+\ldots F_{t-1}(Norm(X_{t-1}))\)之间的区别就像t和t-1的区别一样,我们可以将 其记为\(X_t+ 1= \mathscr{O} ( t+ 1)\) . 这种特性就导致当t足够大的时候,\(X_{t+1}\)和\(X_t\)之间区别可以忽略不计(直觉上),那么就有: