2.1 [DL] Landscape

Quick refresher

Input	Output	Child	Arch	Activation	Loss
Non-sequential data	Binary classification		Multil-layer	Sigmoid	Binary cross entropy
	Multiclass calssification		Normalization is about around zero to avoid vanishing gradient	Relu/softmax	Categorical cross-entry
	Multiclass classification - Image recognization		CNN
	Regression		Regularization is about dropping or penalty to avoid overfitting	Linear/no activation	MSE
Sequential data	Regression		RNN/LSTM
	Multiclass classification - Text autocompletion (char-by-char)
	Text autocompletion (token-by-token)		RNN with embedding
		Translation	attention+RNN/ED -> self-attenion/multi-head/Transformer-RNN
	NLP - understand the context		BERT
	NLP - text generation		GPT

Full landscape

Problem type	note	Network topology	Activation function	Loss function	Adjust weights
Binary classification - logical OR/AND/NAND functions	-	Single perceptron	Sign	y != y_hat	x[i] is the magnitude y is the direction L_R = 0.1 w[i] += y * L_R * x[i]
Binary classification - logical XOR function	-	Perceptron cannot separate XOR with a single straight line Multi perceptron - Two layer network	Sign contains discontinuity. Need continous functions to use gradient descent Defferentiable active function: tanh for hidden, logistic for output	Sum of errors: 1.multiple errors may cancel each other out. 2.sum depends on num of examples Mean squared error	Gradient derivative is the sensitivity: 1.slope is steep, derivate is large, step are large. 2.slope is positive, minus, move to left w = w - L_R * de/dw
Binary classification - if a patient has a specific condition based on a num of input variables (From bellow optimized)	-	logistic output unit	Sigmoid	Binary Cross Entropy	-
Multiclass classification - classify MNIST handwritten digits	-	Need a multiclass output	- Sign	- MSE	- SDG
(DL framework)	-	Higher level	Configuration	Configuration	Automatic
Saturated neuron to Vanishing gradients Mitigation	(Input) & Hidden layer	Input layer - Input normalization/standardization to control the range of input Hidden layer - Batch normalization Weights in every layer - Weight initialization	using tanh hidden activation function Different activation function: relu	-	Gradient direction diverge BATCH_SIZE Fixed lr vibrating when converging dynamic L_R of Adam
-	Output layer	glorot_uniform for sigmoid	sigmoid	When using sigmoid funcction in output layer, use binary cross-entry loss function	-
Some sigmoid may have same value	-		softmax is mutual exclusive	Categorical cross-entry loss function	-
Multiclass classification - classify CIFAR-10 objects/ImageNet DS 1000 objetcs	-	Image needs to extract spatial features Adding convolutional layers CNN Heavy computing for large image Max-pooling by model.add(MaxPooling2D(pool_size=(2,2), stride=2)) Edge pixel not learned Padding by padding=‘same’ Inefficient CNN Depth-wise separable convolutions; EffificentNet	- ConLayer is relu; - Output layer is softmax	- Categorical cross-entry loss function	-
(Use a pre-trained network/model e.g. ResNet-50)	-	Arch: AlexNet VGGNet - building block GoogLeNet - Inception module ResNet - Skip-Connection	-	-	-
(Customize a pre-trained network/model)	-	Transfer learning - Replace the output layers and retrain Fine-tuning model - Retrain directly the upper layers Data augmentation - Create more training data from extsing data	-	-	-
Regression - predict a numeral value rather than a class/predict demand and price for an item	-	Need to output any number without restriction from activation The output layer consists of a single neuron with a linear activation function	Linear activaiton/no activation function	MSE	-
ALL PROBLEM - training error rise/raise at end, not good	-	Deeper(adding more layers) and wider(adding more neurons) network	-	-	-
ALL PROBLEM- overfitting (test error rise at end, not training error) Regularizaiton	-	Drop-out neurons by model.add(Dropout(0.3))	-	Wright decay adding a penalty term to the loss function	eary stopping
Sequential data - Regression - Predict book sales based on historical sales data	-	Need depends on previous inputs and requires memory. Fully Connected Networks (FCNs) cannot capture temporal dependencies. To handle variable-length sequences, FCNs can only process fixed-length inputs. Add a Recurrent layer RNN model.add(SimpleRNN(128, activation=‘relu’))	-	BPTT Unroll automatically	-
Long sequence network, Weight multplication leads to gadient vanishing or explore	-	Use LSTM layer instead of RNN layer LSTM(64,input_shape=(10,8)),
Sequential data - Multiclass classification - Text autocompletion (char-by-char)	-	How to generate a sequence step by step Autoregression How to avoid greedy selection during the generation process Beam size
Text autocompletion (token-by-token) - Speech recognition - Translation	RNN	Neural Language Models (RNN but autoregression token-by-token) with embedding layer
- Translation	RNN/ED+attenion	Tranlation involves two languages, need two LM Encoder-Decoder network When processing long texts, easy to ‘forget’ earlier information Attention mechanism - ALL - every timestep has all context - FOCUS - Dynanically focus on different parts of context at different time steps - Attention network
	Self-attenion without RNN	one attention serial and slow Self-attention layer - Mutli attentions(forms an attention layer) - Replace RNN with FCL - Self - Q and KV all from self, remove dependency No dependencies between words, in parellel and fast Capture features in one dimension only multi-head self-attention layer: multi-head can capture differenent aspects of features for one input Transformer - multi-head attention layer - multi-head self-attention layer - Norm layer - Mutli-encoder-decoder-modules
NLP - understand the context - better at understanding context , mainly used for comprehension tasks such as classification and labeling. - but has weaker generation capabilities, needs to be combined with other models to perform text generation. ● Sentiment analysis ● Spam analysis ● Classify second sentence as entailemnt, contradiction, or neutral ● Identify words that answer a question	E	LLM -BERT Birdirectional Predicting the middle part from the surrounding context — cloze-style tasks. (Pretained) as ● masked language model ● next-sentence prediction Encoder Representations from Transformers only
NLP - phrase the problem in a way that probability of a given completion can be interpreted as solution Good at generating coherent and logically structured text, suitable for scenarios such as dialogue generation, article continuation, and creative writing ● Sentiment analysis ● Entailment ● Similarity ● Multiple choice	D	LLM - GPT Generative predict next word Pre-trained as LM Transformer decoder only without cross-attention ● GPT-2: Scaling the model make zero-shot much better ● GPT-3: providng in-context(learning at inference time) examples(few-shots) improves accuracy - Codex/Copilot: Based on GPT-3, supervised fine-tuned (with code/docstring/ut as data) to generate Python code - InstructGPT/ChatGPT: Based on GPT-3, SFT+RLHF, align model with user’s intention ● GPT-4: Fine-tuned with RLHF, align model wtih Multi-Modal(text, image) Input

Quick refresher#

Full landscape#

Quick refresher

Full landscape