2.1 [DL] Landscape

Quick refresher Input Output Child Arch Activation Loss Non-sequential data Binary classification Multil-layer Sigmoid Binary cross entropy Multiclass calssification Normalization is about around zero to avoid vanishing gradient Relu/softmax Categorical cross-entry Multiclass classification - Image recognization CNN Regression Regularization is about dropping or penalty to avoid overfitting Linear/no activation MSE Sequential data Regression RNN/LSTM Multiclass classification - Text autocompletion (char-by-char) Text autocompletion (token-by-token) RNN with embedding Translation attention+RNN/ED -> self-attenion/multi-head/Transformer-RNN NLP - understand the context BERT NLP - text generation GPT Full landscape Problem type note Network topology Activation function Loss function Adjust weights Binary classification - logical OR/AND/NAND functions - Single perceptron Sign y != y_hat x[i] is the magnitude y is the direction L_R = 0.1 w[i] += y * L_R * x[i] Binary classification - logical XOR function - Perceptron cannot separate XOR with a single straight line Multi perceptron - Two layer network Sign contains discontinuity. Need continous functions to use gradient descent Defferentiable active function: tanh for hidden, logistic for output Sum of errors: 1.multiple errors may cancel each other out. 2.sum depends on num of examples Mean squared error Gradient derivative is the sensitivity: 1.slope is steep, derivate is large, step are large. 2.slope is positive, minus, move to left w = w - L_R * de/dw Binary classification - if a patient has a specific condition based on a num of input variables (From bellow optimized) - logistic output unit Sigmoid Binary Cross Entropy - Multiclass classification - classify MNIST handwritten digits - Need a multiclass output - Sign - MSE - SDG (DL framework) - Higher level Configuration Configuration Automatic Saturated neuron to Vanishing gradients Mitigation (Input) & Hidden layer Input layer - Input normalization/standardization to control the range of input Hidden layer - Batch normalization Weights in every layer - Weight initialization using tanh hidden activation function Different activation function: relu - Gradient direction diverge BATCH_SIZE Fixed lr vibrating when converging dynamic L_R of Adam - Output layer glorot_uniform for sigmoid sigmoid When using sigmoid funcction in output layer, use binary cross-entry loss function - Some sigmoid may have same value - softmax is mutual exclusive Categorical cross-entry loss function - Multiclass classification - classify CIFAR-10 objects/ImageNet DS 1000 objetcs - Image needs to extract spatial features Adding convolutional layers CNN Heavy computing for large image Max-pooling by model.add(MaxPooling2D(pool_size=(2,2), stride=2)) Edge pixel not learned Padding by padding=‘same’ Inefficient CNN Depth-wise separable convolutions; EffificentNet - ConLayer is relu; - Output layer is softmax - Categorical cross-entry loss function - (Use a pre-trained network/model e.g. ResNet-50) - Arch: AlexNet VGGNet - building block GoogLeNet - Inception module ResNet - Skip-Connection - - - (Customize a pre-trained network/model) - Transfer learning - Replace the output layers and retrain Fine-tuning model - Retrain directly the upper layers Data augmentation - Create more training data from extsing data - - - Regression - predict a numeral value rather than a class/predict demand and price for an item - Need to output any number without restriction from activation The output layer consists of a single neuron with a linear activation function Linear activaiton/no activation function MSE - ALL PROBLEM - training error rise/raise at end, not good - Deeper(adding more layers) and wider(adding more neurons) network - - - ALL PROBLEM- overfitting (test error rise at end, not training error) Regularizaiton - Drop-out neurons by model.add(Dropout(0.3)) - Wright decay adding a penalty term to the loss function eary stopping Sequential data - Regression - Predict book sales based on historical sales data - Need depends on previous inputs and requires memory. Fully Connected Networks (FCNs) cannot capture temporal dependencies. To handle variable-length sequences, FCNs can only process fixed-length inputs. Add a Recurrent layer RNN model.add(SimpleRNN(128, activation=‘relu’)) - BPTT Unroll automatically - Long sequence network, Weight multplication leads to gadient vanishing or explore - Use LSTM layer instead of RNN layer LSTM(64,input_shape=(10,8)), Sequential data - Multiclass classification - Text autocompletion (char-by-char) - How to generate a sequence step by step Autoregression How to avoid greedy selection during the generation process Beam size Text autocompletion (token-by-token) - Speech recognition - Translation RNN Neural Language Models (RNN but autoregression token-by-token) with embedding layer - Translation RNN/ED+attenion Tranlation involves two languages, need two LM Encoder-Decoder network When processing long texts, easy to ‘forget’ earlier information Attention mechanism - ALL - every timestep has all context - FOCUS - Dynanically focus on different parts of context at different time steps - Attention network Self-attenion without RNN one attention serial and slow Self-attention layer - Mutli attentions(forms an attention layer) - Replace RNN with FCL - Self - Q and KV all from self, remove dependency No dependencies between words, in parellel and fast Capture features in one dimension only multi-head self-attention layer: multi-head can capture differenent aspects of features for one input Transformer - multi-head attention layer - multi-head self-attention layer - Norm layer - Mutli-encoder-decoder-modules NLP - understand the context - better at understanding context , mainly used for comprehension tasks such as classification and labeling. - but has weaker generation capabilities, needs to be combined with other models to perform text generation. ● Sentiment analysis ● Spam analysis ● Classify second sentence as entailemnt, contradiction, or neutral ● Identify words that answer a question E LLM -BERT Birdirectional Predicting the middle part from the surrounding context — cloze-style tasks. (Pretained) as ● masked language model ● next-sentence prediction Encoder Representations from Transformers only NLP - phrase the problem in a way that probability of a given completion can be interpreted as solution Good at generating coherent and logically structured text, suitable for scenarios such as dialogue generation, article continuation, and creative writing ● Sentiment analysis ● Entailment ● Similarity ● Multiple choice D LLM - GPT Generative predict next word Pre-trained as LM Transformer decoder only without cross-attention ● GPT-2: Scaling the model make zero-shot much better ● GPT-3: providng in-context(learning at inference time) examples(few-shots) improves accuracy - Codex/Copilot: Based on GPT-3, supervised fine-tuned (with code/docstring/ut as data) to generate Python code - InstructGPT/ChatGPT: Based on GPT-3, SFT+RLHF, align model with user’s intention ● GPT-4: Fine-tuned with RLHF, align model wtih Multi-Modal(text, image) Input

June 30, 2025 · 5 min · Hongyao Tang

3.1 [GPT-2] Implementation

Quick refresher Overall flow Architecture GPT-2 class GPTModel(nn.Module): def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits MultiHeadAttention class MultiHeadAttention(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().__init__() # New assert (d_out % num_heads == 0), \ "d_out must be divisible by num_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length),diagonal=1)) self.dropout = nn.Dropout(dropout) # New self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs def forward(self, x): b, num_tokens, d_in = x.shape queries = self.W_query(x) keys = self.W_key(x) # Shape: (b, num_tokens, d_out) values = self.W_value(x) ####################### decompose ################################# # We implicitly split the matrix by adding a `num_heads` dimension # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) queries = queries.transpose(1, 2) keys = keys.transpose(1, 2) values = values.transpose(1, 2) #################################################################### attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights) ########################## compose ################################# # Shape: (b, num_tokens, num_heads, head_dim) context_vec = (attn_weights @ values).transpose(1, 2) # Combine heads, where self.d_out = self.num_heads * self.head_dim context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) #################################################################### context_vec = self.out_proj(context_vec) # optional projection return context_vec Step 1. Data preparation S - Collecting raw data T - Raw data is a .txt file with natural language ...

July 1, 2025 · 35 min · Hongyao Tang

4.1 [SFT] Classification SFT

微调语言模型最常见的方法 方法 分类微调 (遵循人类)指令微调 Instruction 提示不能有指令 模型被指示将英语句子翻译成德语 Scope 经过分类微调的模型只能预测它在训练过程中遇到的类别。例如,它可以判断某条内容是“垃圾消息”还是“非垃圾消息” 经过指令微调的模型通常能够执行更广泛的任务 Cap 判断给定文本是否为垃圾消息 - 预训练后的大语言模型能够进行文本补全,这意味着给定任意一个片段作为输入,模型能够生成一个句子或撰写一个段落 - 然而,预训练后的大语言模型在执行特定指令时往往表现不佳,比如无法完成像“纠正这段文字的语法”或“将这段话变成被动语态”这样的指令 Classification finetuning (Email Spam) Datasets S - Download raw datasets T save to sms_spam_collection/SMSSpamCollection.tsv sms: Short Message Service 短信 Ham 是你想要的, 是火腿, “非垃圾邮件” Spam 是你不想要的, 是午餐肉, “垃圾邮件”或“垃圾信息” A import urllib.request import zipfile import os from pathlib import Path def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path): if data_file_path.exists(): print(f"{data_file_path} already exists. Skipping download and extraction.") return # Downloading the file with urllib.request.urlopen(url) as response: with open(zip_path, "wb") as out_file: out_file.write(response.read()) # Unzipping the file with zipfile.ZipFile(zip_path, "r") as zip_ref: zip_ref.extractall(extracted_path) # Add .tsv file extension original_file_path = Path(extracted_path) / "SMSSpamCollection" os.rename(original_file_path, data_file_path) print(f"File downloaded and saved as {data_file_path}") url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip" zip_path = "sms_spam_collection.zip" extracted_path = "sms_spam_collection" data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv" try: download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError) as e: print(f"Primary URL failed: {e}. Trying backup URL...") url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/sms%2Bspam%2Bcollection.zip" download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) ***S - Data preprocessing 加载tsv到 pandas DataFrame中 为简单起见,我们会使用一个较小的数据集(这将有助于更快地微调大语言模型),每个类别包含 747 个实例 标签是ham, spam string,需要数字化 创建不同的数据集 现在我们处理的是包含不同长度文本消息的垃圾消息数据集。为了像处理文本块那样对这些消息进行批处理 T ...

July 4, 2025 · 15 min · Hongyao Tang

4.2 [SFT] Instruction SFT

Datasets S - Download raw datasets T 该数据集保存在一个相对较小的 JSON 格式的文件中(仅 204 KB)。JSON(JavaScript 对象表示法)是一种既便于人类阅读又适合机器处理的数据交换结构,有点儿类似于 Python 字典 指令数据集包含 1100 个指令-回复对 A import json import os import urllib def download_and_load_file(file_path, url): if not os.path.exists(file_path): with urllib.request.urlopen(url) as response: text_data = response.read().decode("utf-8") with open(file_path, "w", encoding="utf-8") as file: file.write(text_data) with open(file_path, "r", encoding="utf-8") as file: data = json.load(file) return data file_path = "instruction-data.json" url = ( "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch" "/main/ch07/01_main-chapter-code/instruction-data.json" ) data = download_and_load_file(file_path, url) print("Number of entries:", len(data)) print("Example entry:\n", data[50]) # 从 JSON 文件中加载的 data 列表包含了1100 个指令数据集样本 >>> type(data) <class 'list'> >>> len(data) 1100 >>> >>> from rich.pretty import pprint >>> pprint(data[0]) { │ 'instruction': 'Evaluate the following phrase by transforming it into the spelling given.', │ 'input': 'freind --> friend', │ 'output': 'The spelling of the given phrase "freind" is incorrect, the correct spelling is "friend".' } ***S - Data preprocessing before any encapsulation split by portion TA ...

July 4, 2025 · 13 min · Hongyao Tang

5.1 [Align] RLHF

LLM development flow 学习阶段 Pretraining SFT Align - RLHF(PPO) 目标 学习语言结构、常识、语义关系 让模型学会“如何正确回答问题” 通过强化学习优化模型,使其输出更符合人类偏好 数据 海量网页、书籍、代码等 人工构造的问答对、对话、代码任务等 人类对多个模型回答进行排序,用于训练奖励模型 方法 / 步骤 自监督语言建模(预测下一个词) 标准监督学习 1. 训练奖励模型(Reward Model) 2. 使用 RL(如 PPO)优化输出 S - 解决序列决策问题(Sequential Decision Making),即在多步决策中平衡短期与长期收益。 T - RL 核心是通过与环境(Environment)的交互学习最优决策策略。通过试错获得反馈(奖励或惩罚),最终目标是最大化累积奖励。 类比:类似于训练小狗完成动作——做对了给零食(正奖励),做错了不鼓励(负奖励),最终小狗学会“坐下”或“握手”。 S - 不仅能完成目标任务,还能理解人类的主观意图和价值观 T - RLHF 核心思想:通过人类的主观反馈替代或修正环境奖励,让AI更符合人类价值观 类比:AI的“家教辅导班”. 想象你在教一个孩子学画画,但无法直接用分数评价每幅画的好坏(传统奖励函数设计困难)。于是你请了一位美术老师(人类),对孩子的画作进行点评(反馈),告诉TA哪些线条更优美、哪些配色需要改进。 T - PPO(近端策略优化) PPO的核心思想:制定一个“安全范围”,让学员每次训练量只能小幅调整,确保稳定进步 类比:健身教练的“安全训练计划”. 想象你是一名健身教练,学员要通过不断调整训练动作来增强肌肉(最大化奖励)。但直接让学员每天大幅增加训练量(策略突变),可能会导致受伤(训练崩溃) 核心原理 策略梯度(Policy Gradient) 基础思想:根据动作的“好坏”(优势函数)调整策略。比如,某个动作让学员举得更重(高奖励),就多鼓励这个动作。 问题:如果学员突然尝试过重的动作(策略突变),可能导致肌肉拉伤(训练崩溃 PPO的改进——Clip机制 “安全阈值”:规定每次训练量变化不超过±20%(类比Clip阈值ε=0.2) A R 模型 是否使用SFT (用人工问答对训练模型学会“如何回答问题”) 是否使用RLHF (优化模型输出,使其更符合人类偏好) 说明 GPT-1 / GPT-2 / GPT-3 ❌ ❌ 仅使用大规模无监督预训练(语言建模) GPT-3.5 ✅ ✅ 使用 SFT + 奖励模型 + RLHF(PPO)进行对齐训练 GPT-4 / GPT-4-turbo ✅ ✅ 同样使用 SFT + RLHF,训练过程更复杂,可能加入 DPO 等新技术 ChatGPT(所有版本) ✅ ✅ ChatGPT 是在 GPT-3.5 / GPT-4 基础上,经过 SFT + RLHF 微调得到的对话模型

July 4, 2025 · 1 min · Hongyao Tang