Build and train a Diffusion LLM from scratch

An end-to-end guide to training a LLaDA-style Diffusion LLM and using it to generate text.

Dr. Ashish Bamania

Jun 14, 2026

∙ Paid

In the previous lessons on ‘Into AI’, we learned how to build and train an LLM from scratch.

Build and train an LLM from scratch

Dr. Ashish Bamania

December 31, 2025

Read full story

We then deepened our understanding by building and training a Mixture-of-Experts (MoE) LLM from scratch.

Build and Train a Mixture-of-Experts (MoE) LLM from Scratch

Dr. Ashish Bamania

Mar 20

Read full story

Both these models were trained using the next-token prediction objective and generate tokens one at a time, left to right, autoregressively.

But this is not the only way that a model can be used to generate text.

We have Diffusion LLMs that can generate tokens in parallel using a process called Diffusion. One of the most successful examples of this type of LLM is LLaDA (Large Language Diffusion with mAsking), which we discussed in depth in the following lesson.

Diffusion LLMs, Explained Simply

Dr. Ashish Bamania

Apr 23

Read full story

Google also released its experimental open-source model, DiffusionGemma, this week, which operates on the same principle.

In this lesson, we will take our understanding to the next level and learn to:

Implement a 13-million-parameter Diffusion LLM from scratch
Train it on a publicly available pre-training dataset using a free GPU
Generate text using it

Let’s begin!

Setting up the environment

We will code in PyTorch, use the Hugging Face datasets library for the training dataset, and transformers library to obtain the tokenizer.

The code is meant to run on Google Colaboratory and uses the free NVIDIA T4 GPU to train our model.

# Install packages
!uv pip install torch datasets transformers tqdm

# PyTorch core imports
import torch
import torch.nn as nn
import torch.nn.functional as F

# For numerical operations
import math

# For data processing
from torch.utils.data import DataLoader, Dataset

# Tokenizer
from transformers import AutoTokenizer

# Optimizer
import torch.optim as optim

# For mixed-precision training 
from torch import amp
from torch.nn.utils import clip_grad_norm_

# To visualise progress bar
from tqdm import tqdm

# Hide deprecation warnings
import warnings
warnings.filterwarnings('ignore')

# Set the device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Setting up the tokenizer

Instead of building the tokenizer from scratch, we will use the BPE tokenizer for GPT-2. This gives us a 50,257-token vocabulary made up of subwords.

tokenizer = AutoTokenizer.from_pretrained("gpt2")

print(f"Original vocabulary size: {tokenizer.vocab_size}")
# Original vocabulary size: 50257

The end-of-sequence (EOS) token is at the last index in the vocabulary.

print(tokenizer.eos_token)
# <|endoftext|>

EOS_ID = tokenizer.eos_token_id 

print(EOS_ID)
# 50256

To train a diffusion model, we need a special <MASK> token. This token acts as a placeholder for the model to identify the input positions it should fill. We simply append it as a new ID at the end of the vocabulary.

MASK_ID = tokenizer.vocab_size

print(MASK_ID)
# 50257

# Increase the vocabulary size by 1 for the newly added <MASK> token
VOCAB_SIZE = tokenizer.vocab_size + 1 

print(VOCAB_SIZE)
# 50258

Next, we create two helper functions to encode and decode text using the tokenizer.

# Convert input text into a list of token IDs using the tokenizer
def encode(text): 
    return tokenizer.encode(text, add_special_tokens=False)

# Remove any MASK_ID tokens from the sequence and convert the list of token IDs back into readable text
def decode(ids):
    ids = [i for i in ids if i != MASK_ID]
    return tokenizer.decode(ids, skip_special_tokens=True)

Getting our data ready

We will train our diffusion LLM on the TinyStories dataset. It is a synthetic dataset of short stories that contains the vocabulary used by a 3-year-old, generated by GPT-3.5 and GPT-4.

We will use a subset of this dataset that is small enough to train our model on a free-tier GPU, yet rich enough for it to learn semantic details. It is downloaded as follows.

from datasets import load_dataset

# Download the 5000 stories from the 'train' split of TinyStories
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")

Each row in the dataset is one complete story. Check out an example of one of them.

print(f"Example Story: \n\n{dataset[0]["text"]}")

"""
Example Story: 

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.
Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."
Together, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.
"""

We pre-process this dataset of stories by:

Tokenizing each story (converting sub-words into token IDs)
Joining them into a single list of token IDs
Inserting an EOS token between stories to help the model learn the boundaries between different stories

def clean_and_tokenize(dataset):
    token_ids = []

    for text in dataset["text"]:
        # Remove leading/trailing whitespace from each story
        story = text.strip()

        # Skip empty entries
        if not story:
            continue

        # Encode the story into token IDs and append to the sequence
        token_ids.extend(encode(story))

        # Add an EOS token to separate stories
        token_ids.append(EOS_ID)

    return token_ids

print("Preprocessing dataset...")

token_ids = clean_and_tokenize(dataset)

print(f"Total training tokens: {len(token_ids):,}")
# Total training tokens: 1,033,087

If you’ve previously trained an autoregressive LLM from scratch, you must be familiar with the standard approach of shifting the input sequence by one token and using it as the target during training.

Diffusion LLMs aren’t trained this way.

Instead of next-token prediction, they use fixed-length, probabilistically masked sequences, and the model is trained to predict the original clean input from these masked versions.

Continue reading this post for free, courtesy of Dr. Ashish Bamania.

Or purchase a paid subscription.