Naive RAG and our internal bootcamp to spread the word

This task is part of our internal bootcamp.

We do bootcamps in our team to keep up with the fast changing environment in software development and AI.

A topic is overtaken by a team member, who has not worked with the technology in a client’s project yet. This spreads knowledge and uncovers new perspectives to those, who have implemented these technologies for our clients already. With our bootcamp exercises, we try to focus on the most simple approach to a topic in order to learn the fundamentals of the technology.

Only requirement: write your own code and if tooling needed, use open source.

TL;DR

So here we go…

RAG pairs a fast vector‑based retrieval layer with a generative LLM, letting you ask natural‑language questions and receive answers.

We used our company principles as a data set.

By chunking each principle, creating canonical questions + paraphrases, embedding them and storing the vectors, the system can deliver “accurate” :) , principle‑aligned responses. This pipeline embodies our (DDU’s) company principles as an easy example for rapid prototyping in one of our (AI) bootcamps.

Stack:

Ruby on Rails (API)
MySQL + MyVector Plugin
all-MiniLM (embeddings)
Gemma 3 (LLM)

Our goal. Build a small, standalone MVP using only open-source tooling, no external vector service, no UI, just clean JSON API to demonstrate end-to-end RAG.

What we built. An API that accepts JSON principles, chunks text, generates a canonical question + 5 paraphrases per chunk, computes one weighted embedding per segment, stores it in MySQL (MyVector), runs kNN and composes a cited answer with Gemma 3.

What we evaluated. How ef_search, paraphrase style (questions vs declaratives vs mixed) and component weights affect hit rank and latency.

Visit Github Page

What Is a RAG System?

Retrieval‑Augmented Generation (RAG) couples two stages:

Stage	What it does
Retrieval	Pulls relevant passages from a knowledge base (documents, policies, logs or similar)
Generation	A language model (LLM) writes a response using the retrieved context

It can help, where keyword search might break on synonyms, acronyms or bad structured texts.

Why use RAG?

improves factual accuracy
it enables domain‑specific assistants
offers real‑time Q&A over data
cuts hallucinations by grounding output in actual content

Data & pipeline (end-to-end)

Input JSON → [{ title, description }, …].
Chunking → sentence-aware split with a cap ~1,000 chars → segments
Augmentation → per segment: canonical_q + 5 paraphrases (fixed to five for stable coverage).
Embeddings → one vector per segment via weighted mean of L2-normalized parts:

title = 0.10, canonical_q = 0.35, answer = 0.45, paraphrases(total) = 0.10 → each of the 5 paraphrases = 0.02.Each component is L2-normalized; then weighted mean-pooling → final normalized segment vector.

Storage → segments.embedding VECTOR (Cosine) in MySQL with MyVector.
Retrieval → embed the user query (all-MiniLM) → kNN using ANN hint; select top-k segments.
Answering → Gemma 3 composes a concise answer only from the retrieved passages, with citations.

JSON

Request: GET /questions/answer?q=How many hours do we work?

Response json: "{
  "query": "How many hours do we work?",
  "result": "We work 8 hours a day, no more. Your time outside work is your own. We value sustainable productivity over burnout.",
  "segment_ids": [2, 9, 3, 1, 4]
}"

Here segment_ids are the top-k from kNN (ranked best → worst).

Small dataset & indexing footprint

~120 principles → after chunking ~260 segments (~700–1,000 chars per segment)
Indexing workload → 8 embeddings/segment (title + canonical_q + answer + 5 paraphrases)
384-dim vectors occupy roughly 3–5 MB for this size (plus metadata)

Queries → expected hits

“How many hours do we work?” → “We work 8 hours a day, no more.”
“Are we a remote-first company?” → the remote-first policy statement.
“Do we ship quickly or perfectly?” → “good enough / rapid product development…”

What the numbers mean

Values below are the rank of the first relevant segment among the top-k (lower is better). “1” = found at rank-1; “-” = not in top-k.

Query	Baseline (para=0.10, style=Q, ef=~96)	No paraphrases (para=0.00)	ef_search= 64	ef_search= 128	Declarative paraphrases (5)	Mixed (3Q+2D)
How many hours do we work?	1	2	2	1	2	1
Are we a remote-first company?	1	2	2	1	1	1
It is important to do the job quickly or perfectly?	1	3	3	1	1	1

Takeaways

Paraphrasing boosts retrieval on ambiguous/verbose queries (third query 3 → 1)
Too low ef_search can degrade rank (1 → 2); ~128 tends to stabilize 1
Declarative paraphrases perform about as well as question-style for policy statements, but slightly worse for clearly interrogative queries (“hours”)
Mixed (3Q+2D) is the most robust across these cases

Performance and cost

with all-MiniLM, CPU delivers low cost and solid latency; GPU becomes attractive at very high throughput (bulk re-indexing) or when adding heavier models/rerankers. In MySQL+MyVector, k and ef_search set the recall ↔ kNN-time trade-off
prompt-injection and poisoned snippets are mitigated by a strict context perimeter (“answer only from provided passages”) and input sanitation. ACL/tenant filters apply before ranking; search_logs (query, top-k, distances/scores, latency, config) support auditability and tuning
MySQL + MyVector keeps one familiar datastore and simple migrations, but offers fewer vector-native features than Qdrant/Milvus/Weaviate. all-MiniLM is fast and inexpensive on CPU; when recall is fine but ordering within top-k lags, a lightweight reranker (e.g., bge-reranker) typically lifts nDCG/MRR

Problems to improve

Chunking & text quality

Problem: overly long segments blur the signal; overly short ones add noise and inflate token cost. Mitigation: sentence-aware split with a moderate cap (~1,000 chars) and small overlap.
Problem: duplicates and near-duplicates crowd top-k. Mitigation: dedupe by raw-text checksum and filter high-similarity chunks pre-index.
Problem: mixed languages and acronyms reduce recall. Mitigation: tag lang and generate paraphrases that expand domain terms.

Embeddings & weights

Problem: imbalanced weights (canonical_q vs answer) hurt either interrogative or declarative queries. Mitigation: light grid-tuning; the balance 0.10 / 0.35 / 0.45 / 0.10 performed best on our mixed set.
Problem: one paraphrase style doesn’t cover user phrasing diversity. Mitigation: 3Q+2D (3 questions + 2 declaratives) proved most robust.

Retrieval & ANN

Problem: low ef_search → misses (rank drifts 1→2). Mitigation: an operating band around 96–128 improved recall at moderate latency.
Problem: “clumping” of very similar chunks. Mitigation: aggregate into one weighted vector per segment with moderate paraphrase weight.

Answer generation

Problem: context leakage (hallucinations). Mitigation: grounded-only prompt with mandatory citations; when evidence is weak, return passages without conclusions.
Problem: over-normalization breaks acronyms/lists. Mitigation: gentle unicode/whitespace normalization while preserving meaningful symbols.

Conclusion

We built a standalone RAG MVP using only open-source pieces. Rails as a clean JSON API, MySQL + MyVector for storage and ANN search, all-MiniLM for embeddings and Gemma 3 for question/paraphrase generation and cited answers. Content is chunked sentence wise, each segment gets a canonical question + 5 paraphrases and we compute one weighted, L2-normalized embedding per segment. Queries run through kNN with tunable ef_search and answers are grounded with citations.

Along the way we learned: paraphrasing lifts ambiguous/verbose queries, the ef_search knob trades recall vs latency, a 3Q+2D paraphrase mix is robust across query styles and simple weighted aggregation keeps retrieval stable on CPU.

What Is a RAG System?#

Why use RAG?#

Data & pipeline (end-to-end)#

Small dataset & indexing footprint#

Queries → expected hits#

What the numbers mean#

Takeaways#

Performance and cost#

Problems to improve#

Chunking & text quality#

Embeddings & weights#

Retrieval & ANN#

Answer generation#

Conclusion#

Resources#