Brilliant walkthrough on the decoder architecture. The way you broke down pre-LayerNorm vs post-LayerNorm makes way more sense now, i always wondered why modern LLMs switched to Pre-LN. When I was implementing transformres last year I kept getting gradient issues and didnt realize the LayerNorm placement was key. Really appreciate the dimension tracking alongside the code too.
Brilliant walkthrough on the decoder architecture. The way you broke down pre-LayerNorm vs post-LayerNorm makes way more sense now, i always wondered why modern LLMs switched to Pre-LN. When I was implementing transformres last year I kept getting gradient issues and didnt realize the LayerNorm placement was key. Really appreciate the dimension tracking alongside the code too.
Thank you! I’m glad that you found it helpful!