Machine Learning

This pretty much sums up the current AI hype cycle. Beyond that, if you really want to understand some of the concepts and mechanics

Word Vectors & Embeddings

2013 - Google Word2Vec Efficient Estimation of Word Representations in Vector Space

Google’s word vectors had another intriguing property: you could “reason” about words using vector arithmetic. ….this is where the king-man = queen example comes from

…but these associations are also where Because these vectors are built from the way humans use words, they end up reģecting many of the biases that are present in human language.

Words can often have multiple meanings …. so meaning depends on context (john just left, or John is left handed) or bank (financial vs river bank)

word vectors are a way for LLM’s to capture word meaning

Transformers

Current LLM available online, are divided into many layers (a transformer) processes different word vectors to derive meaning

reseach suggests the first few layers focus on understanding sentence syntax while later layers work to develop high-level understanding of a passage

Attention Mechanism Attention is All You Need - 2017

Transformer Feed-Forward Layers Are Key-Value Memories - 2020 Paper

feed-forward layers work by pattern matching: each neuron in the hidden layer matches a speciĝc pattern in the input text.
Andrej Karpathy (https://github.com/karpathy)
- Neural Network Course: https://github.com/karpathy/nn-zero-to-hero
Grant Sanderson (3Blue1Brown)
- Neural Networks Video Series: https://www.3blue1brown.com/topics/neural-networks
Brendan Bycroft
- LLM Visualization: https://bbycroft.net/llm
  - Repository: https://github.com/bbycroft/llm-viz

Important/Useful Papers:

Efficient estimation of word representations in vector space
A Survey on Contextual Embeddings - Prerequisite reading to 2017 Attention paper
Language Models are Few-Shot Learners - this is an important paper after 2017 attention paper

Important Papers

https://github.com/dair-ai/ML-Papers-of-the-Week - check this periodically
https://github.com/aimerou/awesome-ai-papers
https://github.com/daturkel/learning-papers
https://paperswithcode.com/sota - not sure what this one is
https://github.com/daturkel/learning-papers
https://github.com/dmarx/anthology-of-modern-ml

Embedding

inputs
- tokens
- token position
- creates the Input embed (vectorized)

When the embedding is complete

The embedding is then passed through the model, going through a series of layers, called transformers, before reaching the bottom.
think of each layer as receving an input and generating an output. The output will be the n1 instance of the input updated in some way…..perhaps think of each layer as that of a movie storyboard that may be initially penciled in with each successive layer adding detail, color, and context to the frame.

Self Attention Layer - this is basically the phase where the columns in the input embedding “talk to each other”. The purpose is to contextualize each token. If we assume these tokens are words, then each word is associated with other words so that they contain contextual meaning. The word bank, may be updated (in vector space) - to indicate it is in reference to a financial institution rather than a river bank. If the name of the bank is relevant to the token set, then that too will be associated with the word bank such that if you were to later ask “what bank” the model could come back and say “Well Fargo.” The first step is to produce three vectors for each of the T columns from the normalized input embedding matrix. These vectors are the Q, K, and V vectors:

Q: Query vector K: Key vector V: Value vector

The QV is a little like the bank token broadcasting out to all the other token “I’m a bank token, is there anyone out related to me or talking about me??” The KV is a little like all the other token answer to the QV token’s question: “Yep, hi, I’m your name.' The VV is the result of updating the QV and the KV data. This is done for each and every token from the original token embed (now on iteration x) and will then be pushed forward for processing for iteration y.

< more stuuf/steps goes here>

And that’s a complete transformer block!

These form the bulk of any GPT model and are repeated a number of times, with the output of one block feeding into the next, continuing the residual pathway.

As is common in deep learning, it’s hard to say exactly what each of these layers is doing, but we have some general ideas: the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships. In the context of natural language processing, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.