RAG (Retrieval-Augmented Generation)
Retrieving relevant documents at query time so a language model can answer with grounded, up-to-date information.
RAG is the standard pattern for getting a language model to answer questions about your own data. Instead of fine-tuning the model, you store your documents in a vector database, retrieve the most relevant chunks for each query, and stuff them into the model's context along with the user's question.
A typical RAG pipeline has four stages: chunking (splitting documents into manageable pieces), embedding (turning each chunk into a vector), retrieval (finding the chunks most semantically similar to the query), and generation (asking the LLM to answer using only the retrieved context).
RAG is cheaper and faster to update than fine-tuning, but it is harder to get right than it looks. Chunk size, embedding model, retrieval strategy, and re-ranking all matter. Most production RAG systems also add hybrid search (keyword + vector) and citation-based generation so users can verify answers.