DL | Hongyao Tang

2.1 [DL] Landscape

Quick refresher Input Output Child Arch Activation Loss Non-sequential data Binary classification Multil-layer Sigmoid Binary cross entropy Multiclass calssification Normalization is about around zero to avoid vanishing gradient Relu/softmax Categorical cross-entry Multiclass classification - Image recognization CNN Regression Regularization is about dropping or penalty to avoid overfitting Linear/no activation MSE Sequential data Regression RNN/LSTM Multiclass classification - Text autocompletion (char-by-char) Text autocompletion (token-by-token) RNN with embedding Translation attention+RNN/ED -> self-attenion/multi-head/Transformer-RNN NLP - understand the context BERT NLP - text generation GPT Full landscape Problem type note Network topology Activation function Loss function Adjust weights Binary classification - logical OR/AND/NAND functions - Single perceptron Sign y != y_hat x[i] is the magnitude y is the direction L_R = 0.1 w[i] += y * L_R * x[i] Binary classification - logical XOR function - Perceptron cannot separate XOR with a single straight line Multi perceptron - Two layer network Sign contains discontinuity. Need continous functions to use gradient descent Defferentiable active function: tanh for hidden, logistic for output Sum of errors: 1.multiple errors may cancel each other out. 2.sum depends on num of examples Mean squared error Gradient derivative is the sensitivity: 1.slope is steep, derivate is large, step are large. 2.slope is positive, minus, move to left w = w - L_R * de/dw Binary classification - if a patient has a specific condition based on a num of input variables (From bellow optimized) - logistic output unit Sigmoid Binary Cross Entropy - Multiclass classification - classify MNIST handwritten digits - Need a multiclass output - Sign - MSE - SDG (DL framework) - Higher level Configuration Configuration Automatic Saturated neuron to Vanishing gradients Mitigation (Input) & Hidden layer Input layer - Input normalization/standardization to control the range of input Hidden layer - Batch normalization Weights in every layer - Weight initialization using tanh hidden activation function Different activation function: relu - Gradient direction diverge BATCH_SIZE Fixed lr vibrating when converging dynamic L_R of Adam - Output layer glorot_uniform for sigmoid sigmoid When using sigmoid funcction in output layer, use binary cross-entry loss function - Some sigmoid may have same value - softmax is mutual exclusive Categorical cross-entry loss function - Multiclass classification - classify CIFAR-10 objects/ImageNet DS 1000 objetcs - Image needs to extract spatial features Adding convolutional layers CNN Heavy computing for large image Max-pooling by model.add(MaxPooling2D(pool_size=(2,2), stride=2)) Edge pixel not learned Padding by padding=‘same’ Inefficient CNN Depth-wise separable convolutions; EffificentNet - ConLayer is relu; - Output layer is softmax - Categorical cross-entry loss function - (Use a pre-trained network/model e.g. ResNet-50) - Arch: AlexNet VGGNet - building block GoogLeNet - Inception module ResNet - Skip-Connection - - - (Customize a pre-trained network/model) - Transfer learning - Replace the output layers and retrain Fine-tuning model - Retrain directly the upper layers Data augmentation - Create more training data from extsing data - - - Regression - predict a numeral value rather than a class/predict demand and price for an item - Need to output any number without restriction from activation The output layer consists of a single neuron with a linear activation function Linear activaiton/no activation function MSE - ALL PROBLEM - training error rise/raise at end, not good - Deeper(adding more layers) and wider(adding more neurons) network - - - ALL PROBLEM- overfitting (test error rise at end, not training error) Regularizaiton - Drop-out neurons by model.add(Dropout(0.3)) - Wright decay adding a penalty term to the loss function eary stopping Sequential data - Regression - Predict book sales based on historical sales data - Need depends on previous inputs and requires memory. Fully Connected Networks (FCNs) cannot capture temporal dependencies. To handle variable-length sequences, FCNs can only process fixed-length inputs. Add a Recurrent layer RNN model.add(SimpleRNN(128, activation=‘relu’)) - BPTT Unroll automatically - Long sequence network, Weight multplication leads to gadient vanishing or explore - Use LSTM layer instead of RNN layer LSTM(64,input_shape=(10,8)), Sequential data - Multiclass classification - Text autocompletion (char-by-char) - How to generate a sequence step by step Autoregression How to avoid greedy selection during the generation process Beam size Text autocompletion (token-by-token) - Speech recognition - Translation RNN Neural Language Models (RNN but autoregression token-by-token) with embedding layer - Translation RNN/ED+attenion Tranlation involves two languages, need two LM Encoder-Decoder network When processing long texts, easy to ‘forget’ earlier information Attention mechanism - ALL - every timestep has all context - FOCUS - Dynanically focus on different parts of context at different time steps - Attention network Self-attenion without RNN one attention serial and slow Self-attention layer - Mutli attentions(forms an attention layer) - Replace RNN with FCL - Self - Q and KV all from self, remove dependency No dependencies between words, in parellel and fast Capture features in one dimension only multi-head self-attention layer: multi-head can capture differenent aspects of features for one input Transformer - multi-head attention layer - multi-head self-attention layer - Norm layer - Mutli-encoder-decoder-modules NLP - understand the context - better at understanding context , mainly used for comprehension tasks such as classification and labeling. - but has weaker generation capabilities, needs to be combined with other models to perform text generation. ● Sentiment analysis ● Spam analysis ● Classify second sentence as entailemnt, contradiction, or neutral ● Identify words that answer a question E LLM -BERT Birdirectional Predicting the middle part from the surrounding context — cloze-style tasks. (Pretained) as ● masked language model ● next-sentence prediction Encoder Representations from Transformers only NLP - phrase the problem in a way that probability of a given completion can be interpreted as solution Good at generating coherent and logically structured text, suitable for scenarios such as dialogue generation, article continuation, and creative writing ● Sentiment analysis ● Entailment ● Similarity ● Multiple choice D LLM - GPT Generative predict next word Pre-trained as LM Transformer decoder only without cross-attention ● GPT-2: Scaling the model make zero-shot much better ● GPT-3: providng in-context(learning at inference time) examples(few-shots) improves accuracy - Codex/Copilot: Based on GPT-3, supervised fine-tuned (with code/docstring/ut as data) to generate Python code - InstructGPT/ChatGPT: Based on GPT-3, SFT+RLHF, align model with user’s intention ● GPT-4: Fine-tuned with RLHF, align model wtih Multi-Modal(text, image) Input

3.1 [GPT-2] Implementation

Quick refresher Overall flow Architecture GPT-2 class GPTModel(nn.Module): def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits MultiHeadAttention class MultiHeadAttention(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().__init__() # New assert (d_out % num_heads == 0), \ "d_out must be divisible by num_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length),diagonal=1)) self.dropout = nn.Dropout(dropout) # New self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs def forward(self, x): b, num_tokens, d_in = x.shape queries = self.W_query(x) keys = self.W_key(x) # Shape: (b, num_tokens, d_out) values = self.W_value(x) ####################### decompose ################################# # We implicitly split the matrix by adding a `num_heads` dimension # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) queries = queries.transpose(1, 2) keys = keys.transpose(1, 2) values = values.transpose(1, 2) #################################################################### attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights) ########################## compose ################################# # Shape: (b, num_tokens, num_heads, head_dim) context_vec = (attn_weights @ values).transpose(1, 2) # Combine heads, where self.d_out = self.num_heads * self.head_dim context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) #################################################################### context_vec = self.out_proj(context_vec) # optional projection return context_vec Step 1. Data preparation S - Collecting raw data T - Raw data is a .txt file with natural language ...

4.1 [SFT] Classification SFT

微调语言模型最常见的方法方法分类微调 (遵循人类)指令微调 Instruction 提示不能有指令模型被指示将英语句子翻译成德语 Scope 经过分类微调的模型只能预测它在训练过程中遇到的类别。例如，它可以判断某条内容是“垃圾消息”还是“非垃圾消息” 经过指令微调的模型通常能够执行更广泛的任务 Cap 判断给定文本是否为垃圾消息 - 预训练后的大语言模型能够进行文本补全，这意味着给定任意一个片段作为输入，模型能够生成一个句子或撰写一个段落 - 然而，预训练后的大语言模型在执行特定指令时往往表现不佳，比如无法完成像“纠正这段文字的语法”或“将这段话变成被动语态”这样的指令 Classification finetuning (Email Spam) Datasets S - Download raw datasets T save to sms_spam_collection/SMSSpamCollection.tsv sms: Short Message Service 短信 Ham 是你想要的, 是火腿, “非垃圾邮件” Spam 是你不想要的, 是午餐肉, “垃圾邮件”或“垃圾信息” A import urllib.request import zipfile import os from pathlib import Path def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path): if data_file_path.exists(): print(f"{data_file_path} already exists. Skipping download and extraction.") return # Downloading the file with urllib.request.urlopen(url) as response: with open(zip_path, "wb") as out_file: out_file.write(response.read()) # Unzipping the file with zipfile.ZipFile(zip_path, "r") as zip_ref: zip_ref.extractall(extracted_path) # Add .tsv file extension original_file_path = Path(extracted_path) / "SMSSpamCollection" os.rename(original_file_path, data_file_path) print(f"File downloaded and saved as {data_file_path}") url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip" zip_path = "sms_spam_collection.zip" extracted_path = "sms_spam_collection" data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv" try: download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError) as e: print(f"Primary URL failed: {e}. Trying backup URL...") url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/sms%2Bspam%2Bcollection.zip" download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) ***S - Data preprocessing 加载tsv到 pandas DataFrame中为简单起见，我们会使用一个较小的数据集（这将有助于更快地微调大语言模型），每个类别包含 747 个实例标签是ham, spam string，需要数字化创建不同的数据集现在我们处理的是包含不同长度文本消息的垃圾消息数据集。为了像处理文本块那样对这些消息进行批处理 T ...

4.2 [SFT] Instruction SFT

Datasets S - Download raw datasets T 该数据集保存在一个相对较小的 JSON 格式的文件中（仅 204 KB）。JSON（JavaScript 对象表示法）是一种既便于人类阅读又适合机器处理的数据交换结构，有点儿类似于 Python 字典指令数据集包含 1100 个指令-回复对 A import json import os import urllib def download_and_load_file(file_path, url): if not os.path.exists(file_path): with urllib.request.urlopen(url) as response: text_data = response.read().decode("utf-8") with open(file_path, "w", encoding="utf-8") as file: file.write(text_data) with open(file_path, "r", encoding="utf-8") as file: data = json.load(file) return data file_path = "instruction-data.json" url = ( "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch" "/main/ch07/01_main-chapter-code/instruction-data.json" ) data = download_and_load_file(file_path, url) print("Number of entries:", len(data)) print("Example entry:\n", data[50]) # 从 JSON 文件中加载的 data 列表包含了1100 个指令数据集样本 >>> type(data) <class 'list'> >>> len(data) 1100 >>> >>> from rich.pretty import pprint >>> pprint(data[0]) { │ 'instruction': 'Evaluate the following phrase by transforming it into the spelling given.', │ 'input': 'freind --> friend', │ 'output': 'The spelling of the given phrase "freind" is incorrect, the correct spelling is "friend".' } ***S - Data preprocessing before any encapsulation split by portion TA ...

5.1 [Align] RLHF

LLM development flow 学习阶段 Pretraining SFT Align - RLHF(PPO) 目标学习语言结构、常识、语义关系让模型学会“如何正确回答问题” 通过强化学习优化模型，使其输出更符合人类偏好数据海量网页、书籍、代码等人工构造的问答对、对话、代码任务等人类对多个模型回答进行排序，用于训练奖励模型方法 / 步骤自监督语言建模（预测下一个词）标准监督学习 1. 训练奖励模型（Reward Model） 2. 使用 RL（如 PPO）优化输出 S - 解决序列决策问题（Sequential Decision Making），即在多步决策中平衡短期与长期收益。 T - RL 核心是通过与环境（Environment）的交互学习最优决策策略。通过试错获得反馈（奖励或惩罚），最终目标是最大化累积奖励。类比：类似于训练小狗完成动作——做对了给零食（正奖励），做错了不鼓励（负奖励），最终小狗学会“坐下”或“握手”。 S - 不仅能完成目标任务，还能理解人类的主观意图和价值观 T - RLHF 核心思想：通过人类的主观反馈替代或修正环境奖励，让AI更符合人类价值观类比：AI的“家教辅导班”. 想象你在教一个孩子学画画，但无法直接用分数评价每幅画的好坏（传统奖励函数设计困难）。于是你请了一位美术老师（人类），对孩子的画作进行点评（反馈），告诉TA哪些线条更优美、哪些配色需要改进。 T - PPO（近端策略优化） PPO的核心思想：制定一个“安全范围”，让学员每次训练量只能小幅调整，确保稳定进步类比：健身教练的“安全训练计划”. 想象你是一名健身教练，学员要通过不断调整训练动作来增强肌肉（最大化奖励）。但直接让学员每天大幅增加训练量（策略突变），可能会导致受伤（训练崩溃）核心原理策略梯度（Policy Gradient）基础思想：根据动作的“好坏”（优势函数）调整策略。比如，某个动作让学员举得更重（高奖励），就多鼓励这个动作。问题：如果学员突然尝试过重的动作（策略突变），可能导致肌肉拉伤（训练崩溃 PPO的改进——Clip机制 “安全阈值”：规定每次训练量变化不超过±20%（类比Clip阈值ε=0.2） A R 模型是否使用SFT (用人工问答对训练模型学会“如何回答问题”) 是否使用RLHF (优化模型输出，使其更符合人类偏好) 说明 GPT-1 / GPT-2 / GPT-3 ❌ ❌ 仅使用大规模无监督预训练（语言建模） GPT-3.5 ✅ ✅ 使用 SFT + 奖励模型 + RLHF（PPO）进行对齐训练 GPT-4 / GPT-4-turbo ✅ ✅ 同样使用 SFT + RLHF，训练过程更复杂，可能加入 DPO 等新技术 ChatGPT（所有版本） ✅ ✅ ChatGPT 是在 GPT-3.5 / GPT-4 基础上，经过 SFT + RLHF 微调得到的对话模型