MLG 033 Transformers

Machine Learning Guide

OCDevel

Artificial, Introduction, Learning, Courses, Technology, Ml, Intelligence, Ai, Machine, Education

4.9 • 848 Ratings

🗓️ 9 February 2025

⏱️ 42 minutes

🔗️ Recording | iTunes | RSS

🧾️ Download transcript

Summary

Try a walking desk while studying ML or working on your projects! https://ocdevel.com/walk

Show notes: https://ocdevel.com/mlg/33

3Blue1Brown videos: https://3blue1brown.com/

Background & Motivation:
- RNN Limitations: Sequential processing prevents full parallelization—even with attention tweaks—making them inefficient on modern hardware.
- Breakthrough: “Attention Is All You Need” replaced recurrence with self-attention, unlocking massive parallelism and scalability.
Core Architecture:
- Layer Stack: Consists of alternating self-attention and feed-forward (MLP) layers, each wrapped in residual connections and layer normalization.
- Positional Encodings: Since self-attention is permutation invariant, add sinusoidal or learned positional embeddings to inject sequence order.
Self-Attention Mechanism:
- Q, K, V Explained:
  - Query (Q): The representation of the token seeking contextual info.
  - Key (K): The representation of tokens being compared against.
  - Value (V): The information to be aggregated based on the attention scores.
- Multi-Head Attention: Splits Q, K, V into multiple “heads” to capture diverse relationships and nuances across different subspaces.
- Dot-Product & Scaling: Computes similarity between Q and K (scaled to avoid large gradients), then applies softmax to weigh V accordingly.
Masking:
- Causal Masking: In autoregressive models, prevents a token from “seeing” future tokens, ensuring proper generation.
- Padding Masks: Ignore padded (non-informative) parts of sequences to maintain meaningful attention distributions.
Feed-Forward Networks (MLPs):
- Transformation & Storage: Post-attention MLPs apply non-linear transformations; many argue they’re where the “facts” or learned knowledge really get stored.
- Depth & Expressivity: Their layered nature deepens the model’s capacity to represent complex patterns.
Residual Connections & Normalization:
- Residual Links: Crucial for gradient flow in deep architectures, preventing vanishing/exploding gradients.
- Layer Normalization: Stabilizes training by normalizing across features, enhancing convergence.
Scalability & Efficiency Considerations:
- Parallelization Advantage: Entire architecture is designed to exploit modern parallel hardware, a huge win over RNNs.
- Complexity Trade-offs: Self-attention’s quadratic complexity with sequence length remains a challenge; spurred innovations like sparse or linearized attention.
Training Paradigms & Emergent Properties:
- Pretraining & Fine-Tuning: Massive self-supervised pretraining on diverse data, followed by task-specific fine-tuning, is the norm.
- Emergent Behavior: With scale comes abilities like in-context learning and few-shot adaptation, aspects that are still being unpacked.
Interpretability & Knowledge Distribution:
- Distributed Representation: “Facts” aren’t stored in a single layer but are embedded throughout both attention heads and MLP layers.
- Debate on Attention: While some see attention weights as interpretable, a growing view is that real “knowledge” is diffused across the network’s parameters.

Transcript

Click on a timestamp to play from that location

0:00.0	Welcome back to Machine Learning Guide. I'm your host, Tyler Rinelli. MLG teaches the fundamentals of machine learning and artificial intelligence.
0:09.0	It covers intuition, models, math, languages, frameworks, and more.
0:13.0	Where your other machine learning resources provide the trees, I provide the forest.
0:18.0	Visual is the best primary learning modality, but audio is a great supplement during exercise commute and chores.
0:25.7	Consider MLG your syllabus with highly curated resources for each episode's details at OCdevel.com forward slash MLG.
0:35.6	Speaking of curation, I'm a curator of life hacks, my favorite hack being treadmill desks.
0:40.9	While you study machine learning or work on your machine learning projects, walk.
0:44.8	This helps improve focus by increasing blood flow and endorphins.
0:48.0	This maintains consistency and energy, alertness, focus, and mood.
0:52.6	Get your CDC recommended 10,000 steps while studying or working.
0:56.6	I get about 20,000 steps per day, walking just two miles per hour, which is sustainable without
1:01.1	instability at the mouse or keyboard. Save time and money on your fitness goals. See a link to my
1:06.3	favorite walking desk setup in the show notes. Today we're going to talk about Transformers,
1:10.8	the revolutionary
1:11.9	technology behind large language models, the technology put out by the attention is all you need
1:18.0	white paper. And transformers are not as hairy of a concept as you might think they are. If you
1:23.7	tried reading the attention is all you need white paper and it was just an earful, then stay tuned to this episode.
1:30.2	I'll try to break it down.
1:31.1	There's also a video I'll reference at the end when we talk about the resources by three blue one brown.
1:36.5	It's really not as complex as I thought it was.
1:38.7	In fact, it's sort of a step back in technology in terms of things we're getting more and more compounded
1:45.5	and complex in neural network architectures.
	...

Please login to see the full transcript.

Next episode

Disclaimer: The podcast and artwork embedded on this page are from OCDevel, and are the property of its owner and not affiliated with or endorsed by Tapesearch.

Generated transcripts are the property of OCDevel and are distributed freely under the Fair Use doctrine. Transcripts generated by Tapesearch are not guaranteed to be accurate.