The problem: LLM alone is a liar
A large language model (LLM) is basically a very sophisticated next-word predictor. It doesn't query databases, doesn't access documents, doesn't validate information. If you ask it something outside its training knowledge, it makes stuff up. Designers call this hallucination. You call it frustration.
I got burned building a chatbot with pure GPT-3.5, no context. The user asked about company internal processes and got a confident, completely wrong answer. The solution has existed for a while: RAG, or Retrieval-Augmented Generation.
🚀 Need this built for you?
I build sites, systems, AI and automation — let's talk.
Talk to Adriano Soluções →RAG is simple in essence: before asking the model to generate an answer, you fetch relevant documents from your own database and pass them along with the question. This way the model works with verified facts, not hallucination. And the best part? You can run everything locally with Llama 2, without paying any external API.

Why local Llama 2 changes the game
Llama 2 is Meta's open source model. Run it on your machine, full control, no call limits, no queue, no surprise bill. An RTX 3060 (that mining card) handles it well. An RTX 4090 runs beautifully.
The cost difference is brutal. OpenAI charges per token. A million tokens gets expensive. Llama 2 runs local: cost is just electricity and the GPU you already paid for.
There's a downside: quality isn't the same. Llama 2 is good, but GPT-4 is better. But for RAG, that gap shrinks significantly. The model doesn't need to be perfect at reasoning — it has the facts in hand.
Practical RAG architecture
RAG has three main components:
- Indexing: you take your documents, split them into chunks, convert each into numbers (embeddings), store them in a database that supports similarity search.
- Retrieval: when a question comes in, you convert it to an embedding too and search for the K most similar documents.
- Generation: you pass the question plus relevant documents to the LLM to generate an answer based on these facts.
In practice I use it like this:
User question
↓
Convert to embedding (using sentence-transformers)
↓
Search vector database (Chroma or Weaviate)
↓
Retrieve top-5 documents
↓
Build prompt: "Answer based on this: [documents]\n\nQuestion: [question]"
↓
Pass to local Llama 2
↓
Return answer with sourceAll running on your machine, offline (except downloading the model the first time).
Step-by-step implementation
1. Install dependencies
You need Python 3.9+. Start like this:
pip install llama-cpp-python sentence-transformers chroma-db
llama-cpp-python runs Llama 2 on CPU/GPU. sentence-transformers generates embeddings. chroma-db is a simple, fast vector database.
2. Download the model
Llama 2 comes in quantized versions to fit on modest GPUs. I use the Q4 version (4-bit quantization). Get it from huggingface.co:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="TheBloke/Llama-2-7B-Chat-GGML",
filename="llama-2-7b-chat.ggmlv3.q4_K_M.bin"
)First time will download ~5GB. Stays local after that.
3. Load documents and create index
from sentence_transformers import SentenceTransformer
import chromadb
# Your document bank (text file, PDF, whatever)
docs = [
"HR Policy: vacation is 30 business days",
"Scoring system: each bug is 5 points",
"Benefit: meal allowance covers up to $50"
]
# Embedding model
embed_model = SentenceTransformer('distiluse-base-multilingual-cased-v2')
# Chroma: vector database
client = chromadb.Client()
collection = client.create_collection(name="docs")
# Index
for i, doc in enumerate(docs):
embedding = embed_model.encode(doc).tolist()
collection.add(
ids=[f"doc_{i}"],
embeddings=[embedding],
documents=[doc]
)Done. Your documents are indexed.
4. Retrieval + Generation
from llama_cpp import Llama
# Load model
llm = Llama(
model_path=model_path,
n_gpu_layers=32, # more is better (if your GPU can handle it)
n_ctx=2048
)
# User question
query = "What is the vacation policy?"
# Search for relevant docs
query_embedding = embed_model.encode(query).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
# Build context
context = "\n".join(results["documents"][0])
# Final prompt
prompt = f"""Use the context below to answer.
Context:
{context}
Question: {query}
Answer:"""
# Generate response
response = llm(prompt, max_tokens=256, temperature=0.1)
print(response["choices"][0]["text"])Run this and you'll get an answer based on your documents, not hallucination.
Production tips
Smart chunking — don't throw entire documents into the index. Split into 200-500 token pieces. A large PDF becomes 50+ chunks. This way search brings back exactly the relevant snippet, not information buried in noise.
Low temperature — during generation, use temperature=0.1 or even 0. You want deterministic answers based on facts, not creative ones.
Relevance validation — search doesn't always bring good docs. Set a threshold: if the similarity score is below 0.6, tell the user you don't have documents on that. Better to be honest than make something up.
Retrieval logs — save which document was used to generate each answer. When the user says the answer is wrong, you see exactly which document was the problem.
Real problems you'll face
Performance slow? Llama 2 7B on GPU handles about 5-10 responses per minute. If you need more, either upgrade to a better GPU or use a smaller model (3B, 2B).
Very technical documents? Standard sentence-transformers fail with specific jargon. Try training an embedding model just on your documents. Embedding fine-tuning is faster than LLM fine-tuning.
Lots of documents? Chroma with local files gets slow above 100k chunks. Migrate to Weaviate or Pinecone. The logic is the same, just better performance.
Still getting hallucinations? Could be a bad chunk in retrieval or weak model. Try increasing n_results (search for more documents) and add a prefix to your prompt: "Answer ONLY based on the provided context. If you don't know, say so."
Next steps
Start small: 5 documents, test on your machine, see if it works. Then scale — more docs, better embedding model, production vector database. RAG is simple, but the details make a difference.
If you want even more control, study fine-tuning Llama 2. But for most cases, local RAG already solves it: less hallucination, private data, zero cost.