What is RAG

LLM versions are frozen in time.

When the model is released, it only contains information up to the time/data it was trained on.

When new information or context is needed, using just the model without RAG will cause model to hallucinate or return a response that it doesn't know.

Hallucination is when a model returns a false information based on lack of training data

RAG allows us to pass a new context and data to the existing model.

Creates a new data in vector space that becomes available for semantic search

RAG and Semantic Search

Injecting external context into its prompt at runtime.

Instead of using only the model's pretrained knowledge.

RAG is about retrieval from connected data sources.

LLM queries for and are reponded with relevant context from the data source.

Purpose is to generate more accurate and context-aware response.

Semantic Search is the method to find relevant information across uploaded files.

Keyword search looks for exact word match.

Semantic search finds conceptually similar content - even if the exact terms don't match.

Vector DB

Semantic search is accomplished using a vector database.

Text is stored as embeddings (numerical representations of meaning).

LLM converts user question to a vector and compares it to stored vectors.

retrieves the most relevant text chunks.

How does retrieval work?

Chunking

Files are automatically broken into smaller sections (e.g., paragraphs or logical blocks).

Embedding

Each chunk is converted into an embedding using OpenAI’s embedding models.

Storage

The embeddings are stored in OpenAI’s internal vector store.

Querying

When a user asks a question, the GPT creates a vector for the prompt and retrieves semantically similar chunks.

Response generation

The retrieved chunks are included as context in the GPT's prompt to generate a more informed answer.

Limitations (as of June 2025)

Rag has very low limitation bar.

Only allowing