Cross-Entropy loss measured against the actual next token in the text sequence. Phase 2: Alignment (Fine-Tuning)
Below is a concise, structured outline and content plan you can turn into a detailed PDF report. It covers theory, architecture, data, training, evaluation, deployment, costs, safety, and appendices with code snippets and references—suitable for a technical audience (researchers/engineers). Use this as a template to expand into a full PDF; I’ll provide the first ~12 pages of full text below the outline to get you started.
You are going to implement the architecture described in the 2017 paper "Attention Is All You Need" (specifically the decoder-only stack, popularized by OpenAI). You need exactly three components:
class PositionalEncoding(nn.Module): def __init__(self, d_model, max_len=512): super().__init__() pe = torch.zeros(max_len, d_model) position = torch.arange(max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) self.register_buffer('pe', pe) def forward(self, x): return x + self.pe[:x.size(1)] build a large language model %28from scratch%29 pdf
The heart of the transformer is self-attention, which allows tokens to weigh their relationship with other tokens in the sequence.
After training, generate text:
that specifically examines the complications of pre-training, tokenization, and transformer architecture for achieving state-of-the-art performance. It is available on ResearchGate Technical PDF Guides & Slides Sebastian Raschka’s LLM Slides : A concise PDF titled " Developing an LLM: Building, Training, Finetuning Cross-Entropy loss measured against the actual next token
import torch import torch.nn as nn import torch.nn.functional as F class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-6): super().__init__() self.eps = eps self.weight = nn.Parameter(torch.ones(dim)) def forward(self, x): variance = x.pow(2).mean(-1, keepdim=True) return x * torch.rsqrt(variance + self.eps) * self.weight class FeedForward(nn.Module): def __init__(self, dim, hidden_dim): super().__init__() # SwiGLU variant implementation self.w1 = nn.Linear(dim, hidden_dim, bias=False) self.w2 = nn.Linear(hidden_dim, dim, bias=False) self.w3 = nn.Linear(dim, hidden_dim, bias=False) def forward(self, x): return self.w2(F.silu(self.w1(x)) * self.w3(x)) class TransformerBlock(nn.Module): def __init__(self, dim, num_heads, hidden_dim): super().__init__() self.attention_norm = RMSNorm(dim) self.ffn_norm = RMSNorm(dim) # Core layers self.attention = nn.MultiheadAttention(embed_dim=dim, num_heads=num_heads, batch_first=True) self.feed_forward = FeedForward(dim, hidden_dim) def forward(self, x, causal_mask): # Pre-LN Residual Connections h = x + self.attention_forward(self.attention_norm(x), causal_mask) out = h + self.feed_forward(self.ffn_norm(h)) return out def attention_forward(self, x, mask): # Simplified wrapper for causal multi-head attention attn_output, _ = self.attention(x, x, x, attn_mask=mask, need_weights=False) return attn_output Use code with caution. 4. The Two-Stage Training Process
import tiktoken enc = tiktoken.get_encoding("gpt2")
A language model assigns probability to a sequence of tokens: Use this as a template to expand into
import torch import torch.nn as nn
import torch import torch.nn as nn import torch.nn.functional as F class GroupedQueryAttention(nn.Module): def __init__(self, d_model, n_heads, n_kv_heads, d_k): super().__init__() self.n_heads = n_heads self.n_kv_heads = n_kv_heads self.d_k = d_k self.q_proj = nn.Linear(d_model, n_heads * d_k, bias=False) self.k_proj = nn.Linear(d_model, n_kv_heads * d_k, bias=False) self.v_proj = nn.Linear(d_model, n_kv_heads * d_k, bias=False) self.out_proj = nn.Linear(n_heads * d_k, d_model, bias=False) def forward(self, x): B, T, C = x.shape q = self.q_proj(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2) k = self.k_proj(x).view(B, T, self.n_kv_heads, self.d_k).transpose(1, 2) v = self.v_proj(x).view(B, T, self.n_kv_heads, self.d_k).transpose(1, 2) # Repeat KV heads to match query heads for GQA calculation num_queries_per_kv = self.n_heads // self.n_kv_heads k = k.repeat_interleave(num_queries_per_kv, dim=1) v = v.repeat_interleave(num_queries_per_kv, dim=1) # Compute Scaled Dot-Product Attention with Causal Mask scores = torch.matmul(q, k.transpose(-2, -1)) / (self.d_k ** 0.5) mask = torch.tril(torch.ones(T, T, device=x.device)).view(1, 1, T, T) scores = scores.masked_fill(mask == 0, float('-inf')) attn = F.softmax(scores, dim=-1) context = torch.matmul(attn, v).transpose(1, 2).contiguous().view(B, T, -1) return self.out_proj(context) Use code with caution. Activation Functions and Normalization
Remember: Every expert builder started with a single block. Your block is the nanoGPT. Your blueprint is the PDF.