AI

LLM

Basically a "next token predictor".

  • Predicting following tokens in sequence.

Training

Main objective of ML training is to minimize loss (hence, making more accurate predictions) over time.

Value/Loss

Value = Data's contribution to model's output. Loss = Measuring how wrong the model's outputs are.

Loss Functions (Objective Function)

Loss quantifies the difference between a model’s predictions and actual outcomes.

  • measures how wrong the model’s predictions are.

Loss function directly influence the effectiveness of model prediction.

  • accuracy of next token
  • guided learning
    • losses are consumed as a feedback mechanism for models to adjust internal weights and biases
  • performance optimization
    • efficient loss minimization enhances model performance, reduces overfitting, and improves generalization to unseen data.

Standard Loss Functions

Tokens, Vectors/Embeddings

Tokens

Words become tokens.

  • basic unit that can be encoded.
  • tokens are usually fraction of a word.

LLMs extract the meaning of words by observing its context using massive amount of data.

  • LLM looks for words that are close to the main word or those that were not found near the main word while training.

1M tokens are about 750,000 words.

  • how many words do people speak in a day
    • avg women 5000
    • avg men 2000

Vector/Embedding

As a result of training, we get a vector (a list of values) that adjusts based on each word's proximity to the main word in the training data.

  • This vector is known as a word embedding.

Word embedding can have hundreds of values that each represent different aspects of the main word.

  • the values in an embedding quantify a word’s linguistic features.

Although we do not have understanding of what each value represents or what characteristics it represents, we know that similar words often have similar embeddings.

  • e.g. I and We have similar embeddings.
  • Embeddings quantify the closeness.

When we reduce hundreds of values each embedding represents to just two (x and y), we can visualize the embeddings in a 2D space.

  • This is called dimensionality reduction.

When we reduce the dimensions, we see clustering of similar words.

Grounding

LLMs hallucinate because it is a statistical next-token predictor.

Grounding is a process to limit LLM to answer with less hallucinations.

  • cross-checking an LLM’s outputs against web search results.
  • providing citations to users so they can verify.

RLHF (Reinforcement Learning by Human Feedback)

Human beings also contribute to filling the gap and and providing feedback.