DL & LLM

1.1 [PyTorch] Intro

S - While leveraging power of NVIDIA GPUs for DL, need a high-level interface T - Built on CUDA, python framework Interface Type Contributor Intro Used in PyTorch DL framework Meta dynamic computing graph, easy to debug research TensorFlow DL framework Google support deployment Production Keras Highlevel API Independent first, then merged into TF easy to use for prototyping, default run on TF Education R Ultimately call highly optimized CUDA routines abstracts away the complexity of direct CUDA programming, allowing you to write intuitive Python code Close look into PyTorch Core components ...

1.2 [PyTorch] MNIST

Problem statement FashionMNIST Problem Statement Classify grayscale images of fashion items (e.g., shirts, shoes, bags) into one of 10 categories. Like MNIST, each image is 28×28 pixels, but the content is clothing-related instead of digits. MNIST Problem Statement Classify grayscale images of handwritten digits (0–9) into their correct numeric labels. Each image is 28×28 pixels, and the task is a 10-class classification problem. The mainstream PyTorch will solve FashionMNIST probelm. The minor stream of TensorFlow will demostarte how to solve MNIST problem. ...

2.1 [DL] Landscape

Quick refresher Input Output Child Arch Activation Loss Non-sequential data Binary classification Multil-layer Sigmoid Binary cross entropy Multiclass calssification Normalization is about around zero to avoid vanishing gradient Relu/softmax Categorical cross-entry Multiclass classification - Image recognization CNN Regression Regularization is about dropping or penalty to avoid overfitting Linear/no activation MSE Sequential data Regression RNN/LSTM Multiclass classification - Text autocompletion (char-by-char) Text autocompletion (token-by-token) RNN with embedding Translation attention+RNN/ED -> self-attenion/multi-head/Transformer-RNN NLP - understand the context BERT NLP - text generation GPT Full landscape Problem type note Network topology Activation function Loss function Adjust weights Binary classification - logical OR/AND/NAND functions - Single perceptron Sign y != y_hat x[i] is the magnitude y is the direction L_R = 0.1 w[i] += y * L_R * x[i] Binary classification - logical XOR function - Perceptron cannot separate XOR with a single straight line Multi perceptron - Two layer network Sign contains discontinuity. Need continous functions to use gradient descent Defferentiable active function: tanh for hidden, logistic for output Sum of errors: 1.multiple errors may cancel each other out. 2.sum depends on num of examples Mean squared error Gradient derivative is the sensitivity: 1.slope is steep, derivate is large, step are large. 2.slope is positive, minus, move to left w = w - L_R * de/dw Binary classification - if a patient has a specific condition based on a num of input variables (From bellow optimized) - logistic output unit Sigmoid Binary Cross Entropy - Multiclass classification - classify MNIST handwritten digits - Need a multiclass output - Sign - MSE - SDG (DL framework) - Higher level Configuration Configuration Automatic Saturated neuron to Vanishing gradients Mitigation (Input) & Hidden layer Input layer - Input normalization/standardization to control the range of input Hidden layer - Batch normalization Weights in every layer - Weight initialization using tanh hidden activation function Different activation function: relu - Gradient direction diverge BATCH_SIZE Fixed lr vibrating when converging dynamic L_R of Adam - Output layer glorot_uniform for sigmoid sigmoid When using sigmoid funcction in output layer, use binary cross-entry loss function - Some sigmoid may have same value - softmax is mutual exclusive Categorical cross-entry loss function - Multiclass classification - classify CIFAR-10 objects/ImageNet DS 1000 objetcs - Image needs to extract spatial features Adding convolutional layers CNN Heavy computing for large image Max-pooling by model.add(MaxPooling2D(pool_size=(2,2), stride=2)) Edge pixel not learned Padding by padding=‘same’ Inefficient CNN Depth-wise separable convolutions; EffificentNet - ConLayer is relu; - Output layer is softmax - Categorical cross-entry loss function - (Use a pre-trained network/model e.g. ResNet-50) - Arch: AlexNet VGGNet - building block GoogLeNet - Inception module ResNet - Skip-Connection - - - (Customize a pre-trained network/model) - Transfer learning - Replace the output layers and retrain Fine-tuning model - Retrain directly the upper layers Data augmentation - Create more training data from extsing data - - - Regression - predict a numeral value rather than a class/predict demand and price for an item - Need to output any number without restriction from activation The output layer consists of a single neuron with a linear activation function Linear activaiton/no activation function MSE - ALL PROBLEM - training error rise/raise at end, not good - Deeper(adding more layers) and wider(adding more neurons) network - - - ALL PROBLEM- overfitting (test error rise at end, not training error) Regularizaiton - Drop-out neurons by model.add(Dropout(0.3)) - Wright decay adding a penalty term to the loss function eary stopping Sequential data - Regression - Predict book sales based on historical sales data - Need depends on previous inputs and requires memory. Fully Connected Networks (FCNs) cannot capture temporal dependencies. To handle variable-length sequences, FCNs can only process fixed-length inputs. Add a Recurrent layer RNN model.add(SimpleRNN(128, activation=‘relu’)) - BPTT Unroll automatically - Long sequence network, Weight multplication leads to gadient vanishing or explore - Use LSTM layer instead of RNN layer LSTM(64,input_shape=(10,8)), Sequential data - Multiclass classification - Text autocompletion (char-by-char) - How to generate a sequence step by step Autoregression How to avoid greedy selection during the generation process Beam size Text autocompletion (token-by-token) - Speech recognition - Translation RNN Neural Language Models (RNN but autoregression token-by-token) with embedding layer - Translation RNN/ED+attenion Tranlation involves two languages, need two LM Encoder-Decoder network When processing long texts, easy to ‘forget’ earlier information Attention mechanism - ALL - every timestep has all context - FOCUS - Dynanically focus on different parts of context at different time steps - Attention network Self-attenion without RNN one attention serial and slow Self-attention layer - Mutli attentions(forms an attention layer) - Replace RNN with FCL - Self - Q and KV all from self, remove dependency No dependencies between words, in parellel and fast Capture features in one dimension only multi-head self-attention layer: multi-head can capture differenent aspects of features for one input Transformer - multi-head attention layer - multi-head self-attention layer - Norm layer - Mutli-encoder-decoder-modules NLP - understand the context - better at understanding context , mainly used for comprehension tasks such as classification and labeling. - but has weaker generation capabilities, needs to be combined with other models to perform text generation. ● Sentiment analysis ● Spam analysis ● Classify second sentence as entailemnt, contradiction, or neutral ● Identify words that answer a question E LLM -BERT Birdirectional Predicting the middle part from the surrounding context — cloze-style tasks. (Pretained) as ● masked language model ● next-sentence prediction Encoder Representations from Transformers only NLP - phrase the problem in a way that probability of a given completion can be interpreted as solution Good at generating coherent and logically structured text, suitable for scenarios such as dialogue generation, article continuation, and creative writing ● Sentiment analysis ● Entailment ● Similarity ● Multiple choice D LLM - GPT Generative predict next word Pre-trained as LM Transformer decoder only without cross-attention ● GPT-2: Scaling the model make zero-shot much better ● GPT-3: providng in-context(learning at inference time) examples(few-shots) improves accuracy - Codex/Copilot: Based on GPT-3, supervised fine-tuned (with code/docstring/ut as data) to generate Python code - InstructGPT/ChatGPT: Based on GPT-3, SFT+RLHF, align model with user’s intention ● GPT-4: Fine-tuned with RLHF, align model wtih Multi-Modal(text, image) Input

3.1 [GPT-2] Implementation

QA Quick refresher Overall flow Architecture GPT-2 class GPTModel(nn.Module): def __init__(self, cfg): super().__init__() self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"]) self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"]) self.drop_emb = nn.Dropout(cfg["drop_rate"]) self.trf_blocks = nn.Sequential( *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]) self.final_norm = LayerNorm(cfg["emb_dim"]) self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False) def forward(self, in_idx): batch_size, seq_len = in_idx.shape tok_embeds = self.tok_emb(in_idx) pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size] x = self.drop_emb(x) x = self.trf_blocks(x) x = self.final_norm(x) logits = self.out_head(x) return logits MultiHeadAttention class MultiHeadAttention(nn.Module): def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False): super().__init__() # New assert (d_out % num_heads == 0), \ "d_out must be divisible by num_heads" self.d_out = d_out self.num_heads = num_heads self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length),diagonal=1)) self.dropout = nn.Dropout(dropout) # New self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs def forward(self, x): b, num_tokens, d_in = x.shape queries = self.W_query(x) keys = self.W_key(x) # Shape: (b, num_tokens, d_out) values = self.W_value(x) ####################### decompose ################################# # We implicitly split the matrix by adding a `num_heads` dimension # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) values = values.view(b, num_tokens, self.num_heads, self.head_dim) # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) queries = queries.transpose(1, 2) keys = keys.transpose(1, 2) values = values.transpose(1, 2) #################################################################### attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1) attn_weights = self.dropout(attn_weights) ########################## compose ################################# # Shape: (b, num_tokens, num_heads, head_dim) context_vec = (attn_weights @ values).transpose(1, 2) # Combine heads, where self.d_out = self.num_heads * self.head_dim context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out) #################################################################### context_vec = self.out_proj(context_vec) # optional projection return context_vec Step 1. Data preparation S - Collecting raw data T - Raw data is a .txt file with natural language ...

4.1 [SFT] Classification SFT

微调语言模型最常见的方法方法分类微调 (遵循人类)指令微调 Instruction 提示不能有指令模型被指示将英语句子翻译成德语 Scope 经过分类微调的模型只能预测它在训练过程中遇到的类别。例如，它可以判断某条内容是“垃圾消息”还是“非垃圾消息” 经过指令微调的模型通常能够执行更广泛的任务 Cap 判断给定文本是否为垃圾消息 - 预训练后的大语言模型能够进行文本补全，这意味着给定任意一个片段作为输入，模型能够生成一个句子或撰写一个段落 - 然而，预训练后的大语言模型在执行特定指令时往往表现不佳，比如无法完成像“纠正这段文字的语法”或“将这段话变成被动语态”这样的指令 Classification finetuning (Email Spam) Datasets S - Download raw datasets T save to sms_spam_collection/SMSSpamCollection.tsv sms: Short Message Service 短信 Ham 是你想要的, 是火腿, “非垃圾邮件” Spam 是你不想要的, 是午餐肉, “垃圾邮件”或“垃圾信息” A import urllib.request import zipfile import os from pathlib import Path def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path): if data_file_path.exists(): print(f"{data_file_path} already exists. Skipping download and extraction.") return # Downloading the file with urllib.request.urlopen(url) as response: with open(zip_path, "wb") as out_file: out_file.write(response.read()) # Unzipping the file with zipfile.ZipFile(zip_path, "r") as zip_ref: zip_ref.extractall(extracted_path) # Add .tsv file extension original_file_path = Path(extracted_path) / "SMSSpamCollection" os.rename(original_file_path, data_file_path) print(f"File downloaded and saved as {data_file_path}") url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip" zip_path = "sms_spam_collection.zip" extracted_path = "sms_spam_collection" data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv" try: download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) except (urllib.error.HTTPError, urllib.error.URLError, TimeoutError) as e: print(f"Primary URL failed: {e}. Trying backup URL...") url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/sms%2Bspam%2Bcollection.zip" download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path) ***S - Data preprocessing 加载tsv到 pandas DataFrame中为简单起见，我们会使用一个较小的数据集（这将有助于更快地微调大语言模型），每个类别包含 747 个实例标签是ham, spam string，需要数字化创建不同的数据集现在我们处理的是包含不同长度文本消息的垃圾消息数据集。为了像处理文本块那样对这些消息进行批处理 T ...