RAG Document Assistant: Answer Questions from Your Own Docs with Ollama, ChromaDB and Docker
26 February, 2026 AI
Every company with an internal wiki has the same pattern. HR policies live on Confluence or Notion. Every week, someone asks in Slack: "When can I start using my private medical?", "How much does the company put into the pension?", "Can I carry over unused holiday?" The answers are always in the docs. Nobody searches the docs. Instead, they message HR directly — and every answer to a question that is already documented is time HR could have spent on the work that actually moves things forward but never leaves the backlog.
The obvious fix is a chatbot that reads the documentation and answers questions from it. The non-obvious part is how to build one that actually works — and does not send your internal docs to a third-party API.
This article builds exactly that: a chatbot that indexes .txt files locally, answers questions from them using a local LLM, and runs entirely in Docker. No API keys, no data leaving your machine.
What RAG Is
A raw LLM only knows what it was trained on. It cannot answer questions about your product documentation, your internal policies, or any text it has never seen. Fine-tuning is one solution, but it is expensive, time-consuming, and requires retraining every time the docs change.
RAG — Retrieval-Augmented Generation — is the simpler alternative:
- Index — split your documents into chunks, embed each chunk as a vector, store vectors in a database
- Retrieve — when a question arrives, embed it with the same model and find the most similar chunks
- Generate — pass the retrieved chunks as context to the LLM, ask it to answer from that context only
The LLM never needs to know the docs exist in advance. The relevant text is handed to it at query time. Updating the docs means re-running the indexing step — no model retraining.
Choosing a Vector Store
The vector store is the database that holds the embeddings. There are quite a few options, and the right one depends on how far you want to take the project.
Embedded libraries — no separate service required:
- ChromaDB — the one we use. PersistentClient writes to a directory, which maps cleanly to a Docker volume. Simple Python API, stores both vectors and the original document text, supports metadata filtering. Gets you running in 10 minutes.
- FAISS (Meta) — extremely fast, CPU and GPU, widely used in research. But it only stores vectors — you manage the document text separately. No server to run, but persistence requires saving and loading the index file manually.
- LanceDB — newer, also file-based, supports SQL-style filtering and Lance columnar storage. A good alternative to ChromaDB if you want richer querying on metadata.
- DuckDB + vss extension — if you prefer SQL all the way down. The vector similarity search extension adds an
array_cosine_similarityfunction to DuckDB. Works well if you are already querying metadata with SQL.
Separate services — another Docker Compose entry:
- Qdrant — the strongest production choice of the group. Official Docker image, REST and gRPC APIs, rich payload filtering, good documentation. Adds one service to
docker-compose.ymlbut gives you a proper production-grade store if the project outgrows a single container. - Weaviate — feature-rich, GraphQL API, optional built-in vectorisers. Heavy (~500 MB image) and over-engineered for most small projects.
- Milvus — distributed, scales to billions of vectors. Meaningful only at a scale where you also have a dedicated ops team.
- pgvector — a PostgreSQL extension that adds a
vectorcolumn type and HNSW/IVFFlat indexes. The ideal choice if PostgreSQL is already in your stack: oneCREATE EXTENSIONand you are done. Adds no new service if Postgres is already there. - Redis with the Search module — vectors stored alongside your other application data. Useful if Redis is already your session or cache layer.
- Elasticsearch — the search veterans added dense vector support years ago. Makes sense if you need hybrid search (BM25 keyword + vector) and already run an Elastic cluster.
For this project, ChromaDB is the right call. There is no extra service to run, no HTTP client to configure, and the data lives in a Docker volume like everything else. If you later need payload filtering at scale or a multi-node setup, migrating to Qdrant is straightforward — the embedding and retrieval logic stays the same, only the client calls change.
The Stack
- Docker + Docker Compose — Ollama and the app run as services
ollama/ollama— local LLM server, no install requiredgemma3:4b— Google's 4B model for generation, 3 GB RAMnomic-embed-text— embedding model that converts text to vectors, runs inside Ollama- ChromaDB — lightweight vector store, persists embeddings to disk as a Python library
- Python 3.12 — glues it all together
- FastAPI — serves the API and the chat UI
nomic-embed-text runs inside Ollama alongside the generation model — one server, two models. ChromaDB is an embedded Python library with no separate service required. The indexed data persists in a Docker volume.
How It Works
.txt files → chunk → embed → ChromaDB
↑
question → embed ───────────────┘
→ retrieve top-3 chunks
→ LLM (gemma3:4b) + context
→ streamed answer
The embedding model converts both the documents and the query into the same vector space. Similarity search finds the chunks most likely to contain the answer. The LLM synthesises a response from those chunks and streams it back to the browser token by token.
Project Structure
document-assistant/
├── docker-compose.yml
├── Dockerfile
├── .env
├── ingest.py # indexes docs into ChromaDB (run once)
├── app.py # FastAPI chatbot
├── requirements.txt
├── docs/
│ ├── benefits.txt
│ └── pension.txt
└── static/
└── index.html
Sample Docs
Create two text files in docs/. These represent typical HR policy documents — exactly the kind of content employees rarely read but constantly ask questions about.
docs/benefits.txt:
Employee Benefits Policy
Effective date: 1 January 2026
1. Health & Wellbeing
----------------------
All permanent employees are eligible for private medical insurance through
Bupa from their first day of employment. Cover includes GP consultations,
specialist referrals, diagnostic tests, and inpatient treatment. Dental and
optical cover is included at the Essential tier; employees may upgrade to
the Comprehensive tier and pay the difference via salary sacrifice.
Mental health support is available through the Employee Assistance Programme
(EAP). Employees have access to up to 8 free counselling sessions per year.
The EAP helpline is available 24/7 and is fully confidential.
A gym subsidy of £50 per month is available to all permanent employees after
passing the three-month probationary period. Claims are submitted monthly via
the HR portal and are reimbursed with the following month's payroll.
2. Annual Leave
----------------
The standard annual leave entitlement is 25 days per year plus UK bank
holidays. Employees may carry over up to 5 unused days into the following
leave year. Carried-over days must be used by 31 March or they are forfeited.
Employees may purchase up to 5 additional days of annual leave per year
through the flexible benefits scheme. Elections are made during the annual
benefits window in November and the cost is deducted via salary sacrifice
across 12 months.
3. Life Assurance
------------------
All permanent employees are covered by a group life assurance policy paying
4x annual basic salary to the employee's nominated beneficiary in the event
of death in service. Employees should nominate a beneficiary via the HR
portal. Nominations can be updated at any time.
4. Income Protection
---------------------
Long-term income protection pays 75% of basic salary if an employee is
unable to work due to illness or injury for more than 13 consecutive weeks.
Payments continue until the employee returns to work, reaches state pension
age, or the policy expires, whichever comes first.
5. Flexible Benefits Window
----------------------------
The annual benefits window opens each November for elections effective from
1 January. Outside the annual window, changes can only be made within 30 days
of a qualifying life event (marriage, birth of a child, change in partner's
employment status).
docs/pension.txt:
Company Pension Scheme
Effective date: 1 January 2026
1. Overview
------------
The company operates a defined contribution pension scheme through Nest.
All eligible employees are automatically enrolled on their start date.
Eligible employees are those aged between 22 and state pension age earning
above the auto-enrolment threshold (£10,000 per year as of 2026).
2. Contribution Rates
----------------------
The company contributes 6% of qualifying earnings. The employee minimum
contribution is 3%, giving a combined minimum of 9%.
Employees may increase their personal contribution at any time via the
Nest online portal or by contacting the People team. Contributions above
3% attract no additional employer match.
Qualifying earnings are calculated on the band between the lower earnings
limit (£6,240) and the upper earnings limit (£50,270) for the 2024/25
tax year.
3. Salary Sacrifice
--------------------
Pension contributions are made via salary sacrifice by default. This means
both the employee and employer save on National Insurance contributions.
The employee's take-home pay reduction is lower than the nominal pension
contribution because the deduction is made before NI is calculated.
Employees who prefer to opt out of salary sacrifice and make contributions
from net pay may do so by contacting payroll before the 15th of the month.
4. Vesting
-----------
Employer contributions vest immediately. There is no minimum service period
to retain employer contributions already paid into the scheme.
5. Investment Choices
----------------------
Nest offers a range of investment funds. New members are defaulted into the
Nest Retirement Date Fund aligned to their expected retirement year. Members
can change their fund selection at any time via the Nest portal with no charge.
6. Opting Out
--------------
Employees may opt out of the pension scheme within one month of enrolment
by contacting Nest directly. Any contributions made before opting out will
be refunded. Employees who opt out are automatically re-enrolled every three
years in line with legal requirements.
7. Additional Voluntary Contributions
---------------------------------------
Employees may make one-off additional voluntary contributions (AVCs) directly
via the Nest portal. AVCs benefit from tax relief at the employee's marginal
rate but do not attract employer match.
8. On Leaving the Company
---------------------------
When an employee leaves, their pension pot remains in Nest and continues
to be invested. Employees may choose to transfer the pot to a new employer's
scheme or to a personal pension. Nest does not charge for transfers out.
Docker Compose
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 10s
timeout: 5s
retries: 10
model-pull:
image: ollama/ollama
depends_on:
ollama:
condition: service_healthy
environment:
OLLAMA_HOST: http://ollama:11434
entrypoint: >
sh -c "ollama pull ${OLLAMA_MODEL:-gemma3:4b} &&
ollama pull ${EMBED_MODEL:-nomic-embed-text}"
restart: "no"
ingest:
build: .
profiles: ["setup"]
depends_on:
model-pull:
condition: service_completed_successfully
environment:
OLLAMA_HOST: http://ollama:11434
EMBED_MODEL: ${EMBED_MODEL:-nomic-embed-text}
CHROMA_DIR: /chroma
DOCS_DIR: /docs
volumes:
- ./docs:/docs:ro
- chroma-data:/chroma
entrypoint: ["python", "ingest.py"]
app:
build: .
depends_on:
model-pull:
condition: service_completed_successfully
environment:
OLLAMA_HOST: http://ollama:11434
OLLAMA_MODEL: ${OLLAMA_MODEL:-gemma3:4b}
EMBED_MODEL: ${EMBED_MODEL:-nomic-embed-text}
CHROMA_DIR: /chroma
volumes:
- chroma-data:/chroma
ports:
- "8080:8080"
volumes:
ollama-data:
chroma-data:
Four services:
ollama— the model server. Stores models in a named volume. A healthcheck ensures the API is ready before anything else proceeds.model-pull— pulls both models in sequence with a singlesh -ccall. Runs once and exits. On subsequent starts it checks the cache and exits in under a second — the models are already stored in the shared volume.ingest— readsdocs/, chunks the text, embeds each chunk, and writes to ChromaDB. Assigned to thesetupprofile so it does not run on everydocker compose up.app— the FastAPI chatbot. Reads from the ChromaDB volume, answers queries against Ollama.
# .env
OLLAMA_MODEL=gemma3:4b
EMBED_MODEL=nomic-embed-text
Dockerfile and Dependencies
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY ingest.py app.py ./
COPY static/ static/
ENTRYPOINT ["python", "app.py"]
# requirements.txt
ollama>=0.6.1
chromadb>=1.5.1
fastapi>=0.113
uvicorn>=0.41
Python version note. The Dockerfile uses Python 3.12 rather than the latest 3.14. ChromaDB currently relies on Pydantic V1 internals, which are incompatible with Python 3.14+ — you will see UserWarning: Core Pydantic V1 functionality isn't compatible with Python 3.14 or greater at startup and the app will not run correctly. Python 3.12 avoids this. Once ChromaDB drops its Pydantic V1 dependency the version constraint lifts. (Checked February 2026.)
Indexing: ingest.py
The ingest script runs once. It reads .txt files, splits them into overlapping word-based chunks, generates an embedding for each chunk, and stores everything in ChromaDB.
# ingest.py
import os
from pathlib import Path
import chromadb
import ollama
DOCS_DIR = os.getenv("DOCS_DIR", "docs")
CHROMA_DIR = os.getenv("CHROMA_DIR", "chroma")
EMBED_MODEL = os.getenv("EMBED_MODEL", "nomic-embed-text")
CHUNK_WORDS = 120
OVERLAP_WORDS = 20
ollama_client = ollama.Client(host=os.getenv("OLLAMA_HOST", "http://localhost:11434"))
chroma_client = chromadb.PersistentClient(path=CHROMA_DIR)
def chunk_text(text: str) -> list[str]:
words = text.split()
chunks, start = [], 0
while start < len(words):
chunk = " ".join(words[start : start + CHUNK_WORDS])
if chunk:
chunks.append(chunk)
start += CHUNK_WORDS - OVERLAP_WORDS
return chunks
def embed(texts: list[str]) -> list[list[float]]:
return [
ollama_client.embeddings(model=EMBED_MODEL, prompt=t).embedding
for t in texts
]
def main() -> None:
collection = chroma_client.get_or_create_collection("docs")
doc_files = sorted(Path(DOCS_DIR).glob("*.txt"))
if not doc_files:
print(f"No .txt files found in {DOCS_DIR}")
return
for doc_file in doc_files:
text = doc_file.read_text(encoding="utf-8")
chunks = chunk_text(text)
embeddings = embed(chunks)
ids = [f"{doc_file.stem}_{i}" for i in range(len(chunks))]
collection.upsert(
ids=ids,
embeddings=embeddings,
documents=chunks,
metadatas=[{"source": doc_file.name}] * len(chunks),
)
print(f"Indexed {len(chunks)} chunks from {doc_file.name}")
print(f"Done. Collection has {collection.count()} chunks total.")
if __name__ == "__main__":
main()
Chunking strategy. 120 words per chunk with a 20-word overlap. The overlap ensures that sentences split across chunk boundaries do not lose context. Too small and a single fact gets fragmented; too large and the retrieved context is noisy. 100-150 words works well for prose documentation.
Re-indexing. upsert rather than add makes the script idempotent — existing chunk IDs are updated in place if their content changes. Add a document, re-run ingest, restart nothing.
The Assistant: app.py
The app has two responsibilities: retrieve relevant chunks for a question, then stream the LLM's answer back to the browser.
# app.py
import os
from pathlib import Path
import chromadb
import ollama
from fastapi import FastAPI
from fastapi.responses import HTMLResponse, StreamingResponse
from pydantic import BaseModel
CHROMA_DIR = os.getenv("CHROMA_DIR", "chroma")
EMBED_MODEL = os.getenv("EMBED_MODEL", "nomic-embed-text")
LLM_MODEL = os.getenv("OLLAMA_MODEL", "gemma3:4b")
TOP_K = int(os.getenv("TOP_K", "3"))
ollama_client = ollama.Client(host=os.getenv("OLLAMA_HOST", "http://localhost:11434"))
chroma_client = chromadb.PersistentClient(path=CHROMA_DIR)
collection = chroma_client.get_collection("docs")
SYSTEM_PROMPT = """You are a helpful support assistant.
Answer questions using only the provided context.
If the answer is not in the context, say clearly that you do not know.
Do not invent information."""
app = FastAPI()
class ChatRequest(BaseModel):
question: str
def retrieve(question: str) -> list[str]:
embedding = ollama_client.embeddings(model=EMBED_MODEL, prompt=question).embedding
results = collection.query(query_embeddings=[embedding], n_results=TOP_K)
return results["documents"][0]
def stream_answer(question: str, context: list[str]):
context_text = "\n\n---\n\n".join(context)
user_message = f"Context:\n{context_text}\n\nQuestion: {question}"
for part in ollama_client.chat(
model=LLM_MODEL,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
],
stream=True,
):
yield part.message.content
@app.post("/chat")
def chat(req: ChatRequest) -> StreamingResponse:
context = retrieve(req.question)
return StreamingResponse(stream_answer(req.question, context), media_type="text/plain")
@app.get("/", response_class=HTMLResponse)
def index() -> str:
return Path("static/index.html").read_text(encoding="utf-8")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080)
The /chat endpoint:
- Embeds the question with
nomic-embed-text - Queries ChromaDB for the 3 most similar chunks
- Passes those chunks and the question to
gemma3:4bvia a system prompt that restricts the answer to the provided context - Streams the response token by token
Streaming matters here. A typical response takes 5-15 seconds to generate on CPU. Without streaming, the browser hangs on a blank page for the full duration. With streaming, the first tokens arrive in under a second and the user can read while the model is still writing.
The Chat UI
<!-- static/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document Assistant</title>
<style>
*, *::before, *::after { box-sizing: border-box; }
body { margin: 0; font-family: system-ui, sans-serif; background: #f4f4f5; display: flex; flex-direction: column; height: 100vh; }
#messages { flex: 1; overflow-y: auto; padding: 1.5rem; display: flex; flex-direction: column; gap: 1rem; }
.bubble { max-width: 720px; padding: .75rem 1rem; border-radius: 12px; line-height: 1.5; white-space: pre-wrap; word-break: break-word; }
.user { background: #3b82f6; color: #fff; align-self: flex-end; }
.bot { background: #fff; border: 1px solid #e4e4e7; align-self: flex-start; }
#form { display: flex; gap: .5rem; padding: 1rem; background: #fff; border-top: 1px solid #e4e4e7; }
#input { flex: 1; padding: .6rem .9rem; border: 1px solid #d4d4d8; border-radius: 8px; font-size: 1rem; }
button { padding: .6rem 1.2rem; background: #3b82f6; color: #fff; border: none; border-radius: 8px; cursor: pointer; font-size: 1rem; }
button:disabled { opacity: .5; cursor: default; }
</style>
</head>
<body>
<div id="messages"></div>
<form id="form">
<input id="input" type="text" placeholder="Ask a question about the docs…" autocomplete="off">
<button type="submit" id="btn">Send</button>
</form>
<script>
const messages = document.getElementById('messages');
const form = document.getElementById('form');
const input = document.getElementById('input');
const btn = document.getElementById('btn');
function addBubble(className, text = '') {
const div = document.createElement('div');
div.className = `bubble ${className}`;
div.textContent = text;
messages.appendChild(div);
messages.scrollTop = messages.scrollHeight;
return div;
}
form.addEventListener('submit', async (e) => {
e.preventDefault();
const question = input.value.trim();
if (!question) {
return;
}
input.value = '';
btn.disabled = true;
addBubble('user', question);
const bot = addBubble('bot', '...');
const response = await fetch('/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ question }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let text = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
text += decoder.decode(value, { stream: true });
bot.textContent = text;
messages.scrollTop = messages.scrollHeight;
}
btn.disabled = false;
input.focus();
});
</script>
</body>
</html>
The UI is 60 lines of self-contained HTML. No framework, no build step. The key part is the streaming fetch: the browser reads chunks via ReadableStream and appends each token to the bubble as it arrives. The model's output becomes visible in real time.
Running It
First start: pull models and index docs.
# Start Ollama and pull both models (~2.5 GB download on first run)
docker compose up -d
# Index your documents (run once, or when docs change)
docker compose --profile setup run --rm ingest
Output from the ingest step:
Indexed 4 chunks from benefits.txt
Indexed 4 chunks from pension.txt
Done. Collection has 8 chunks total.
Start the chatbot:
docker compose up app
Open http://localhost:8080. Ask "How much does the company contribute to the pension?" and the answer streams in, sourced directly from pension.txt.
To update your knowledge base: drop new .txt files in docs/, re-run ingest, and the chatbot knows the new content immediately. No restart required.
Taking It Further
Source attribution. The collection.query call already returns results["metadatas"] alongside the documents. Expose those in the /chat response and display "Source: faq.txt" below each answer so users can verify.
Multiple file formats. The ingest script currently reads .txt. Adding PDF support means adding pypdf and a small adapter that extracts text before chunking. The same logic applies to Markdown, HTML, or DOCX — swap the reader, keep everything else.
Larger context window. gemma3:4b has a 128K context window. For long documents, increase TOP_K from 3 to 10 and pass more context per query. Tune to your corpus — more chunks means more context but also more noise for the model to reason through.
GPU acceleration. Add the deploy.resources block to the ollama service as shown in the Local LLM feedback analyser article. Streaming latency drops from seconds to milliseconds per token.
Persistent conversation. Pass previous messages in the messages array to the Ollama chat call. The retrieval step can be conditioned on the full conversation history rather than just the last question, which improves context continuity across follow-up questions.
Metadata filtering. Once you have multiple document categories — product docs, legal, HR — add a category field to the metadata and filter at query time: collection.query(where={"category": "legal"}, ...). This prevents irrelevant sections from polluting the context.
Production deployment. Remove the ports mapping from ollama, put nginx in front of FastAPI, and mount docs/ from a network share or object storage. The rest stays the same. When you need a proper production vector store — multi-node, high availability, rich filtering — migrate to Qdrant: the embedding and retrieval logic stays identical, only the client calls change.
When This Pattern Reaches Its Limits
RAG works well when answers exist in your documents and a user asks a direct question. It struggles with:
- Synthesis questions — "summarise everything about billing" requires reading many chunks, and the top-3 retrieval window may miss relevant material
- Implicit knowledge — things your docs assume but do not state explicitly
- Very large corpora — beyond a few hundred documents, retrieval quality degrades without metadata filtering or a re-ranking step between retrieval and generation
For a product FAQ or internal wiki under a few hundred pages, RAG is the right tool. For a legal corpus or a codebase with thousands of files, you will need hybrid search (keyword + vector), metadata filtering, or a re-ranker before the generation step.
The full project — including docker-compose.yml, Dockerfile, ingest script, app, and sample docs — is on GitHub.