Discussion about this post

User's avatar
The AI Architect's avatar

Brilliant walkthrough on the decoder architecture. The way you broke down pre-LayerNorm vs post-LayerNorm makes way more sense now, i always wondered why modern LLMs switched to Pre-LN. When I was implementing transformres last year I kept getting gradient issues and didnt realize the LayerNorm placement was key. Really appreciate the dimension tracking alongside the code too.

Expand full comment

No posts

Ready for more?