Retrieval Augmented Generation · USA

RAG Systems Engineer

I build retrieval-augmented generation pipelines that work in production - persistent memory, hybrid vector search, reranking, and context management that keeps answers accurate as knowledge bases grow.

M.S. CS @ ASUOpen to WorkCreator of LUNAOpen Source

LLM Backends

Pipeline Stages

∞

Context Sessions

Data to Cloud

The Real Challenges

Why Good RAG Is Harder Than It Looks

A basic RAG system is easy to build. Embed some documents, store vectors, retrieve the top-k at query time, stuff them in the context window, and generate. The problem is that basic RAG fails in predictable ways: it retrieves documents that are semantically similar to the query but not actually relevant, it loses coherence when the retrieved chunks are out of context, and it degrades as the knowledge base grows because nearest-neighbor search stops being discriminative at scale.

Production RAG requires solving each of these failure modes explicitly. Hybrid retrieval - combining dense vector search with sparse BM25 keyword search - dramatically outperforms either approach alone, especially for domain-specific terminology that embeddings do not always capture well. Cross-encoder reranking adds a second pass that catches cases where the first-pass retrieval ranked results by surface similarity rather than relevance. Context window management ensures that retrieved chunks are assembled in an order that is coherent for the model to read, not just a bag of relevant paragraphs.

The most important metric for a RAG system is not retrieval recall in isolation - it is end-to-end answer quality on the actual questions users ask. Building evaluation pipelines that measure this directly, rather than proxying with embedding similarity scores, is what separates a RAG system that works in production from one that works in a demo.

Pipeline Architecture

6-Stage RAG Pipeline

Ingestion

Document loading, format normalization, metadata extraction

Chunking

Semantic chunking with overlap, preserving context boundaries

Embedding

Dense vector generation, model selection for domain

Retrieval

Hybrid search - vector similarity + keyword BM25 fusion

Reranking

Cross-encoder reranking, MMR for diversity

Generation

Context window management, citation tracking, streaming

Production Example

LUNA Memory: Model-Agnostic Retrieval

LUNA's memory system is a practical example of production RAG design. The core requirement was simple: conversational context should persist across sessions and survive model swaps. Switching from a local Ollama model to Claude's API mid-project should not lose the context of everything discussed so far.

This required a retrieval layer that is completely decoupled from the inference layer. Memory is stored in a structured format that any model can read - not tied to any specific model's context window format or token counting method. At query time, relevant memories are retrieved based on the current conversation context, ranked by relevance and recency, and injected into the prompt before the LLM generates a response.

The design also required managing what not to retrieve. Injecting every memory into every prompt would quickly exhaust context window limits and degrade response quality as the model tries to attend to too much irrelevant history. The retrieval system filters aggressively, bringing in only memories that are likely to be relevant to the current turn. This is the same challenge that every production RAG system faces - retrieval precision matters as much as retrieval recall.

View LUNA architecture →

Domain-Specific RAG

Debi: Schema Retrieval for SQL

Debi is an AI database assistant that uses RAG to understand schema context before generating SQL. The challenge is domain-specific: a language model given a raw table schema can generate SQL that is syntactically correct but semantically wrong - using the wrong join key, ignoring a foreign key constraint, or selecting from a table that has been deprecated in favor of a view.

The retrieval layer in Debi indexes schema metadata - table definitions, column descriptions, relationships, indexes, and usage notes - and retrieves the relevant subset based on the user's query before generation. This means the model generates SQL with accurate knowledge of the specific database it is querying, not with generic knowledge of SQL syntax.

The result is a system that can explain query plans, suggest appropriate indexes for slow queries, and identify when a proposed query would produce incorrect results due to a misunderstood join condition. This is RAG applied to a narrow, well-defined domain - and narrow domains are where RAG works best, because the retrieval problem is tractable and the evaluation is clear.

Tech Stack

Tools and Frameworks

LangChainLlamaIndexChromaDBPineconepgvectorOpenAI EmbeddingsHuggingFace EmbeddingsBM25FaissPythonFastAPIPyTorchasyncio

Open to RAG and LLM systems roles.

Google · Meta · OpenAI · Anthropic · Cohere · Pinecone · Weaviate and AI-native companies.

Full Portfolio GitHub ↗