LLM

Basically a "next token predictor".

Predicting following tokens in sequence.

Training

Main objective of ML training is to minimize loss (hence, making more accurate predictions) over time.

Value/Loss

Value = Data's contribution to model's output. Loss = Measuring how wrong the model's outputs are.

Loss Functions (Objective Function)

Loss quantifies the difference between a model’s predictions and actual outcomes.

measures how wrong the model’s predictions are.

Loss function directly influence the effectiveness of model prediction.

accuracy of next token
guided learning
- losses are consumed as a feedback mechanism for models to adjust internal weights and biases
performance optimization
- efficient loss minimization enhances model performance, reduces overfitting, and improves generalization to unseen data.

Standard Loss Functions

Tokens, Vectors/Embeddings

Tokens

Words become tokens.

basic unit that can be encoded.
tokens are usually fraction of a word.

LLMs extract the meaning of words by observing its context using massive amount of data.

LLM looks for words that are close to the main word or those that were not found near the main word while training.

1M tokens are about 750,000 words.

how many words do people speak in a day
- avg women 5000
- avg men 2000

Vector/Embedding

As a result of training, we get a vector (a list of values) that adjusts based on each word's proximity to the main word in the training data.

This vector is known as a word embedding.

Word embedding can have hundreds of values that each represent different aspects of the main word.

the values in an embedding quantify a word’s linguistic features.

Although we do not have understanding of what each value represents or what characteristics it represents, we know that similar words often have similar embeddings.

e.g. I and We have similar embeddings.
Embeddings quantify the closeness.

When we reduce hundreds of values each embedding represents to just two (x and y), we can visualize the embeddings in a 2D space.

This is called dimensionality reduction.

When we reduce the dimensions, we see clustering of similar words.

Grounding

LLMs hallucinate because it is a statistical next-token predictor.

Grounding is a process to limit LLM to answer with less hallucinations.

cross-checking an LLM’s outputs against web search results.
providing citations to users so they can verify.

RLHF (Reinforcement Learning by Human Feedback)

Human beings also contribute to filling the gap and and providing feedback.

AI

LLM