10 Confusing LLM Concepts, Explained Simply
The role of CPU/ GPU/ TPU in LLM workflows, Pruning, Quantization, and more
10. On-policy vs. Off-policy learning
A Policy is the strategy by which an AI agent chooses actions in a given state. For an LLM, the model is the policy itself.
With on-policy learning for an LLM, the model learns from its own responses/ outputs.
A commonly used approach to train LLMs to improve them in math/ coding tasks goes like this:
Given a query or prompt, a model produces a group of responses
The responses are scored using a verifier or a reward model
An algorithm like GRPO (Group Relative Policy Optimization) is used to train the model to produce responses that score above average while pushing down below-average responses.
In contrast to the above, with off-policy learning, an LLM learns from the responses/outputs that are not generated by itself. These could come from a stronger model (the teacher model), a different version of the same model, or a dataset.
A commonly used approach for training LLMs to improve their performance in specific domains is called Distillation. In this approach:
Responses or reasoning traces are generated using a strong (teacher) model
A weaker (student) model is trained on those to imitate the teacher's responses
There’s also a recently introduced learning approach that's gaining popularity and combines on-policy and off-policy learning. It is called On-policy distillation. Curious readers are encouraged to learn more about it.
The images used in this lesson come from my book LLMs In 100 Images, which is a collection of 100 easy-to-follow visuals that explain the most important concepts you need to master to understand LLMs today.
Grab your copy today at a 30% discount using this link.
9. Pretraining vs. Mid-training vs. Post-training
All three are phases of training LLMs before they are released to end users.
Pretraining is the initial training phase that teaches a model the structure of a language and gives it basic factual knowledge of the world.
Pretraining occurs on massive datasets such as Common Crawl/ FineWeb, which consists of trillions of tokens from the web. This process is usually highly compute-intensive, and a pre-trained model is called the “Base model”.
Mid-training is the next phase, which involves further training on the base model using higher-quality, domain-specific data to improve its capabilities. This data might focus on domains like health, law, math, or code, or on improving reasoning, extending context length, or adding a new language.
Post-training is the final phase that teaches a model to become a useful and human-value-aligned assistant, rather than one that just produces the next token. Some common post-training techniques include:
Supervised fine-tuning (SFT) on datasets consisting of instruction-response pairs
RL training methods like RLHF to teach a model to produce human-value aligned responses, or RLVR to teach a model to reason well through math/ code related problems to correctly solve them
8. Zero-shot vs. One-shot vs. Few-shot prompting
All three are approaches for prompting an LLM to complete a task well.
Zero-shot prompting is when a user describes a task in the prompt and the model uses what it has already learned during its training to complete it.
One-shot prompting is when a user provides a single example of how to complete a task in the prompt.
Few-shot prompting is when a user describes how to complete a task with several examples in the prompt.
7. CPU vs. GPU vs. TPU in LLM workflows
The role of different semiconductor chips in LLM training and inference is frequently confused.









