Into AI

Into AI

10 Confusing LLM Concepts, Explained Simply

The role of CPU/ GPU/ TPU in LLM workflows, Pruning, Quantization, and more

Dr. Ashish Bamania's avatar
Dr. Ashish Bamania
May 30, 2026
∙ Paid

10. On-policy vs. Off-policy learning

A Policy is the strategy by which an AI agent chooses actions in a given state. For an LLM, the model is the policy itself.

With on-policy learning for an LLM, the model learns from its own responses/ outputs.

A commonly used approach to train LLMs to improve them in math/ coding tasks goes like this:

  • Given a query or prompt, a model produces a group of responses

  • The responses are scored using a verifier or a reward model

  • An algorithm like GRPO (Group Relative Policy Optimization) is used to train the model to produce responses that score above average while pushing down below-average responses.

In contrast to the above, with off-policy learning, an LLM learns from the responses/outputs that are not generated by itself. These could come from a stronger model (the teacher model), a different version of the same model, or a dataset.

A commonly used approach for training LLMs to improve their performance in specific domains is called Distillation. In this approach:

  • Responses or reasoning traces are generated using a strong (teacher) model

  • A weaker (student) model is trained on those to imitate the teacher's responses

There’s also a recently introduced learning approach that's gaining popularity and combines on-policy and off-policy learning. It is called On-policy distillation. Curious readers are encouraged to learn more about it.


The images used in this lesson come from my book LLMs In 100 Images, which is a collection of 100 easy-to-follow visuals that explain the most important concepts you need to master to understand LLMs today.

Grab your copy today at a 30% discount using this link.


9. Pretraining vs. Mid-training vs. Post-training

All three are phases of training LLMs before they are released to end users.

Pretraining is the initial training phase that teaches a model the structure of a language and gives it basic factual knowledge of the world.

Pretraining occurs on massive datasets such as Common Crawl/ FineWeb, which consists of trillions of tokens from the web. This process is usually highly compute-intensive, and a pre-trained model is called the “Base model”.

Mid-training is the next phase, which involves further training on the base model using higher-quality, domain-specific data to improve its capabilities. This data might focus on domains like health, law, math, or code, or on improving reasoning, extending context length, or adding a new language.

Post-training is the final phase that teaches a model to become a useful and human-value-aligned assistant, rather than one that just produces the next token. Some common post-training techniques include:

  • Supervised fine-tuning (SFT) on datasets consisting of instruction-response pairs

  • RL training methods like RLHF to teach a model to produce human-value aligned responses, or RLVR to teach a model to reason well through math/ code related problems to correctly solve them


8. Zero-shot vs. One-shot vs. Few-shot prompting

All three are approaches for prompting an LLM to complete a task well.

Zero-shot prompting is when a user describes a task in the prompt and the model uses what it has already learned during its training to complete it.

One-shot prompting is when a user provides a single example of how to complete a task in the prompt.

Few-shot prompting is when a user describes how to complete a task with several examples in the prompt.


7. CPU vs. GPU vs. TPU in LLM workflows

The role of different semiconductor chips in LLM training and inference is frequently confused.

User's avatar

Continue reading this post for free, courtesy of Dr. Ashish Bamania.

Or purchase a paid subscription.
© 2026 Dr. Ashish Bamania · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture