RAG
Documents flow through a RAG pipeline in eight stages. Six are implemented (ingest, chunk, embed, retrieve, generate, evaluate); test coverage is minimal (unit tests only); monitor is scaffolded but broken on main. This note documents the nadir implementation as of main — config defaults, exact mechanics, and where the code has gaps.
Before reading:
nadiris a separate Go repo (github.com/Chandra179/nadir); this vault note documents it. Retrieval-quality concepts (recall@k, MRR, NDCG, faithfulness, LLM-as-judge) live inai/evaluation.md.
Ingestion
IngestService.Run lists markdown files across one or more roots (knowledge_base.path plus knowledge_base.paths, merged and deduped via AllPaths()), filters out paths matching pkb.ignore_patterns (glob, with dir/** prefix matching), diffs each file’s SHA-256 against the store in a single paginated scroll (GetAllFileSHAs, page size 1000), and dispatches changed files to 8 concurrent workers (const ingestWorkers = 8). Unchanged files are skipped; changed files are upserted in place by deterministic point ID.
Docling converts PDF to Markdown (pdfs/raw → pdfs/converted); the recursive chunker separately strips Docling’s HTML-comment artifacts.
- Deterministic IDs —
chunkID(filePath, lineStart, chunkIndex)= UUIDv5 (uuid.NewSHA1over a private namespace). Same input always maps to the same point, so upserts replace rather than duplicate. - Contextual embedding — before embedding, each chunk is prefixed with
filePath > header\n(or justfilePath\nwhen the chunk has no heading) (Anthropic 2024). This anchors chunk semantics to document structure without altering stored text. Embedding is batched:OllamaEmbedder.EmbedBatchsends every chunk in a file to Ollama/api/embedin one round-trip (Pipelineuses theBatchEmbedderfast-path). - Qdrant collections — auto-configured on startup: dense vectors with Cosine distance, a named sparse vector with IDF modifier, a full-text index on
text, and a keyword index onfile_path.
Chunking
Two chunkers, selected by chunker.provider:
- Recursive (
recursive, default) — extracts sections by heading, then splits oversized sections by paragraph, then by sentence, then by word. Separators in order:\n\n,\n,.,. A TOC heuristic drops chunks whose lines are mostly bare page numbers (threshold 0.6). - Sentence Window (
sentence-window) — indexes at sentence granularity but stores a surrounding window (default 3 sentences before and after) as retrieval context.
Chunk size is measured in UTF-8 runes, not tokens. Default chunk_size: 512, chunk_overlap: 64 (both any positive integer, unconstrained). The 4-chars/token estimate is applied only later, at generator prompt truncation.
Embedding & Storage
OllamaEmbedder calls Ollama /api/embed with nomic-embed-text (768 dimensions). Sparse scoring has two providers: tf (zero-dependency fallback) and splade (calls the SPLADE sidecar, model prithivida/Splade_PP_en_v1).
embedder:
provider: "ollama"
model: "nomic-embed-text"
ollama_addr: "http://localhost:11434"
dimensions: 768
sparse_scorer:
provider: "splade" # "tf" (zero deps) | "splade" (requires sidecar)
addr: "http://localhost:5001"
Qdrant stores dense (and, when a sparse embedder is wired, sparse) vectors alongside payload metadata:
{
"header": "1.3 Weighted A* Search",
"window_text": "",
"file_path": "week 3 informed search and heuristic function.md",
"line_start": 122,
"chunk_index": 2,
"source_sha": "b2f71659eee1eb2a3a377ecc1327bd9ead16552ec6c8cc101f040d187e8b8e6d",
"text": "finds a solution in [ C ∗ , WC ∗ ], but usually closer to C ∗ .\nTo modify A* algorithm to Weighted A*, just change line 14 in Algorithm 2 to Equation 3."
}
Distance metric: Cosine, exclusively (dense collection and semantic cache).
Server-side vs client-side hybrid. The store supports both, selected at query time:
- Client-side (active) — dense
Search+ BM25Scroll+ client SPLADE rescore + manual RRF. This is the only path wired inserver.gotoday. - Server-side (exists, not wired) —
QueryPointswith dense+sparse prefetch legs and Qdrant-nativeFusion_RRFin a single round-trip. Gated onstore.WithSparseEmbedder(...), whichserver.gonever calls, so sparse vectors are never stored at ingest and this branch is unreachable in the current build.
Retrieval
Query
│
▼
Semantic Cache ──── hit (score ≥ threshold, generate=false) ──► return cached result
│ miss
▼
Query Transformation (HyDE) [Optional]
└── LLM generates hypothetical doc → embed → avg vector
│
▼
Hybrid Search (client-side, active)
├── Dense: Qdrant ANN (nomic-embed-text vec)
└── Sparse: BM25 Scroll → SPLADE rescore
└── RRF fusion (k=60): score = 1/(60+denseRnk) + 1/(60+bm25Rnk)
│
▼
Reranker
└── cross-encoder/ms-marco-MiniLM-L-6-v2 (sidecar :5002)
│
▼
Results: []{ file_path, header, line_start, score, text }
Semantic Cache
A dedicated Qdrant collection (pkb_cache) caches results keyed by query-embedding similarity. On a hit (top-1 cosine ≥ threshold) the cached result returns immediately, skipping the retrieval pipeline — store search, reranker, chunk filter, and generator (the cache still embeds the query to perform the lookup). The cache hit path runs only when generate=false and query is non-empty — generation and keyword-only requests always run the full pipeline. On a miss, the pipeline runs and the result is written to cache asynchronously (go semanticCache.Set(...)); generate requests also populate the cache even though they never read from it.
{
"cached_at": "2026-04-26T15:23:38Z",
"results_json": "{Variants\",\"LineStart\":327,\"ChunkIndex\":0,\"Vector\":null,\"SparseIndices\":null}",
"query": "In Monte Carlo Tree Search, how do we calculate UCB?"
}
- TTL — default 24h;
0disables expiry. - Threshold (cosine):
0.85–0.90: high recall, allows paraphrased queries0.90–0.95: balanced (default0.90)>0.95: near-identical only
Query Transformation (HyDE)
Given a query like “How do I install Python?”, HyDE asks an LLM to write a hypothetical document answering it, embeds that document, and searches with the embedding — closer to the target than the raw query. Three variants exist (see §HyDE Variants). Ref: Gao et al., ACL 2023.
Hybrid Search
Hybrid search fuses a dense leg and a sparse (BM25) leg. The client-side path fetches topK × prefetch_mul (default ×5) candidates per leg, rescores the sparse leg with the configured sparse scorer, then fuses via RRF (k=60). The server-side path (see Embedding & Storage) does the same in one Qdrant round-trip but is not currently wired.
Multi-fragment queries (non-HyDE path). When HyDE is off,
SearchService.multiSearchsplits the query on[.?;]+\s*, runsHybridSearchper fragment, dedups by chunk key (keeping the best score), and re-sorts. With HyDE on, the averaged hypothetical-doc vector is searched in a singleHybridSearchcall (no fragment splitting). ThetopKpassed into the store is alreadytopK × candidate_mulwhen a reranker is wired, so total candidates per leg scale astopK × candidate_mul × prefetch_mul.
Payload Filtering
HybridSearch and KeywordSearch accept a *SearchFilter whose non-empty fields are ANDed:
file_path— restrict to a specific fileheader— restrict to a specific sectionsource_sha— restrict to a specific document version
Standalone dense Search(ctx, vector, topK) takes no filter — only the hybrid and keyword paths pre-filter. (The dense leg inside hybrid does apply the filter.)
Reranking
A cross-encoder re-ranks candidates from vector search using the chunk’s Window Text.
- Oversampling — the retrieval stage fetches
topK × candidate_mulcandidates (defaultcandidate_mul: 2; code fallback 3) so the reranker has high-quality options. The store-level hybrid prefetch is separate: ×5 per leg. - Contextual scoring — the Window Text (chunk plus surrounding context) is passed to the cross-encoder.
- Final sorting — candidates are re-scored by deep semantic relevance and sorted, promoting the best matches to the top for the LLM.
Post-Retrieval Filtering
After reranking, an optional LLM chunk filter drops irrelevant results before generation. Ref: arxiv 2410.19572 (+10pp PopQA accuracy).
- Calls the OpenAI-compatible
/v1/chat/completionsendpoint (constructed asollama_addr + "/v1", i.e. Ollama’s compatibility shim — distinct from the generator, which uses native/api/chat). - Batches all retrieved chunks (Window Text, falling back to chunk Text) into one prompt; the model returns a JSON array of scores 0–1, one per passage.
- Drops chunks below the configurable threshold (default 0.5); order of survivors is preserved.
- Never returns zero chunks.
SearchService.postProcessonly swaps in the filtered list whenerr == nil && len(filtered) > 0, so an LLM error, a malformed/score-count-mismatch response, or an all-dropped result all fall through to the original (reranked) chunks.
chunk_filter:
enabled: false
model: "gemma3:1b"
threshold: 0.5
Generator
OllamaGenerator streams an answer grounded in retrieved chunks via Ollama /api/chat.
Prompt construction:
- Lost in the Middle ordering (Liu et al. 2023) — highest-scored chunk at position
[1], lowest in the middle, second-highest at the end. Reduces LLM degradation on long context. - Token budget — chunks truncated at roughly 1 token ≈ 4 chars; default
max_context_tokens: 2800(~70% of a 4k context window). - Citation — the prompt instructs the model to cite inline as
[1],[2], etc. This instruction lives in a singleuser-role message; no separatesystemmessage is sent.
Usage: POST /search with "generate": true. Response is text/plain with chunked transfer encoding (streaming).
generator:
enabled: true
model: "gemma3:1b"
max_context_tokens: 2800
HyDE Variants
Three variants, all off by default (hyde.enabled: false):
Standard HyDE — generates N hypothetical documents in parallel, averages their L2-normalized embeddings, runs hybrid search with the averaged vector.
Adaptive HyDE — runs vanilla hybrid search first; fires HyDE only when top-1 cosine score < threshold (default 0.50). Skips LLM cost when dense retrieval is already confident. Ref: arxiv 2507.16754.
Multi-HyDE — cycles through 5 diverse prompt templates (factual passage, key facts, expert explanation, contextual definition, example-driven) round-robin per document. Maximizes embedding diversity. Ref: arxiv 2509.16369. Use with num_docs >= 3.
hyde:
enabled: false
adaptive: true
adaptive_thresh: 0.50
multi_hyde: false
model: "gemma3:1b"
num_docs: 1
Evaluate
Status: implemented (
internal/eval/+cmd/eval/).
Two eval modes, both driven by a golden set (eval/golden.yaml) and run through the cmd/eval CLI:
Retrieval eval (-mode retrieval) — eval.Runner runs each golden query through a SearchService (rebuilt with the same HyDE/reranker/chunk-filter wiring as the server, minus semantic cache and generator) and eval.Aggregate scores the ranked list:
| Metric | Notes |
|---|---|
| Recall@5, Recall@10 | unique-relevant / total-relevant (dedup at chunk granularity) |
| Precision@5 | hits / k (denominator is k) |
| MRR (ReciprocalRank) | 1/rank of first relevant |
| Success@5 | 1 if any relevant in top-5 |
| MAP (AveragePrecision) | area under P-R curve |
| NDCG@10 | linear-gain (Järvelin & Kekäläinen 2002) |
| NDCG@10 (exp) | exponential-gain (2^rel − 1)/log2(i+1) — BEIR-style |
Graded relevance: relevance: {file: grade} with 0=irrelevant, 1=marginal, 2=relevant, 3=highly; expected_files is the binary special case (grade=1). Path matching is suffix-based (MatchFile), so math/trigonometry.md matches gitbook/math/trigonometry.md. Bootstrap 95% CIs are printed for Recall@5, Recall@10, NDCG@10, and MAP (1000 resamples, fixed seed). -granularity chunk scores at passage level (paper-comparable); default is file (deduped).
RAG eval (-mode rag) — eval.RAGASEvaluator runs the full RAG loop per query (retrieve → GeneratorAdapter generates → OllamaJudge scores) and reports four RAGAS metrics (Es et al. 2023, arxiv 2309.15217):
| Metric | Method |
|---|---|
| Faithfulness | decompose answer → statements; verify each against context; ratio supported |
| Answer Relevance | LLM rates answer-vs-query 0–1 |
| Context Precision | LLM rates each chunk 0–1, weighted by 1/log2(k+1) |
| Context Recall | decompose expected_answer → statements; check attributability to context (requires expected_answer; otherwise N/A) |
The judge calls Ollama’s OpenAI-compatible /v1/chat/completions; the judge model defaults to generator.model (override with -judge-model). -mode both runs retrieval then RAGAS in one pass.
CLI: go run ./cmd/eval -golden eval/golden.yaml -fetch-k 10 -mode retrieval (Make targets: eval, eval-rag, eval-both, eval-chunk). Flags: -config (default config/config.yaml), -golden, -fetch-k (default 10), -mode (retrieval|rag|both), -granularity (file|chunk), -judge-model. A warning is printed when n < 50 (the golden set ships with 5 queries — directional only; BEIR min ~1k).
Tests: internal/eval/{metrics,ragas,runner}_test.go — metric math, RAGAS scoring with a stub judge, and runner aggregation. No integration tests against live Qdrant/Ollama.
The
EVAL_LLM_BASE_URL/EVAL_LLM_MODEL/EVAL_HISTORY_PATHentries in.env.exampleare not read by anything —cmd/evaltakes CLI flags and readsconfig.yaml(judge/generator borrowgenerator.ollama_addr+generator.model). They are aspirational/dead.
For the conceptual basis — perplexity, BLEU/ROUGE/METEOR/BERTScore, LLM-as-judge, benchmarks — see ai/evaluation.md.
Test
Status: minimal (unit tests only, no Docker/Qdrant required).
make test—go test -short -count=1 ./...make test-all—go test -count=1 ./...(notestcontainersdependency exists;AGENTS.md’s “requires Qdrant via testcontainers” line is stale.)
Five test files cover five units:
internal/pkb/file_lister_local_test.go— glob ignore-pattern matching.internal/pkb/hyde_test.go— HyDE vector ops (averageVectors,l2Normalize).internal/eval/metrics_test.go— retrieval metric math (Recall/Precision/MRR/MAP/NDCG/Bootstrap CI).internal/eval/ragas_test.go— RAGAS scoring with a stubLLMJudge.internal/eval/runner_test.go— runner aggregation + granularity.
No chunker tests, no integration tests against live Qdrant/Ollama, no k6 load scripts (the k6 repo topic is aspirational).
Monitor
Status: scaffolded, broken on
main.
docker-compose.ymldefinesprometheusandnode-exporterbut notgrafanaork6.scripts/dev-local.shrunsdocker compose up ... grafana— referencing a service the compose file does not define, somake deverrors on that line.- The compose file mounts
./config/prometheus.ymland./config/recording_rules.yml, but neither file exists inconfig/(onlyconfig.goandconfig.yaml); Prometheus would fail to start as committed. - The Go app exposes no
/metricsendpoint — onlyPOST /search,POST /ingest,GET /healthz. OpenTelemetry metric SDK packages are indirect dependencies (via the logger) and unused for app metrics.
Known Gaps & Drift
- Server-side hybrid search is not wired.
server.gocallsstore.WithSparseScorer(...)(client-side SPLADE rescore) but neverstore.WithSparseEmbedder(...). Sparse vectors are not stored at ingest, sohybridSearchServer(theQueryPoints+ native RRF path) is unreachable in the current build. Only client-side hybrid is active. (cmd/eval’sbuildSearchermirrors this — same gap.) - Modified files leave orphan chunks.
IngestService.Runupserts changed files by deterministic ID but never callsDeleteByFile(the method exists on the store andPipeline.Deletewraps it, but neither is invoked on the change path). BecausechunkIDis derived fromfilePath:lineStart:chunkIndex, an edit that shifts a chunk’sline_startproduces a new point while the old point remains — stale chunks are not garbage-collected. nadir’s own agent docs are stale.AGENTS.mdandCLAUDE.mdclaim chunk IDs are “FNV hash offilePath:lineStart” (they are UUIDv5 overfilePath:lineStart:chunkIndex);CLAUDE.mdeven lists “fix: UUIDv5” as a TODO for something already done.CLAUDE.mdsays prefetch istopK*3(it is ×5).AGENTS.mdsaysmake test-all“requires Qdrant via testcontainers” — there is notestcontainersdependency;make test-allis plaingo test -count=1 ./....- Monitor infra is half-built. See §Monitor — missing Prometheus config files, undefined Grafana service, no app metrics endpoint.
embedder.api_key/EMBEDDER_API_KEYis a dead config field. It is parsed intoconfig.Embedder.APIKeyand surfaced in.env.example, butOllamaEmbedderis constructed with only(addr, model, dimensions)and sends noAuthorizationheader, so the key is never used. (The RAGAS judge and chunk filter accept an API key, but both are constructed with"".)- Dead env entries.
EVAL_*andGRAFANA_*in.env.exampleare read by nothing — see §Evaluate and the Config Reference footnote.
References
- Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels,” ACL 2023 — HyDE.
- arxiv 2507.16754 — Adaptive HyDE.
- arxiv 2509.16369 — Multi-HyDE.
- arxiv 2410.19572 — LLM chunk filter (+10pp PopQA).
- Liu et al., 2023 — “Lost in the Middle: How Language Models Use Long Contexts.”
- Anthropic, 2024 — contextual embedding (
filePath > headerprefix). - Järvelin & Kekäläinen, 2002 — NDCG (ACM TOIS 20(4):422–446).
- Manning, Raghavan & Schütze, 2008 — Introduction to Information Retrieval (MAP).
- Es et al., 2023 — RAGAS (arxiv 2309.15217).
- Thakur et al., 2021 — BEIR (NeurIPS, arxiv 2104.08663).
- Formal et al., 2021 — SPLADE (arxiv 2107.05720).