LLM Systems Engineering · USA
LLM Engineer
I build the layer between raw language models and working products - inference routers, retrieval pipelines, provider orchestration, and serving infrastructure that runs in production.
Overview
What LLM Engineering Actually Involves
Large language model engineering sits at the intersection of systems engineering, machine learning research, and product thinking. An LLM engineer needs to understand how transformer models work well enough to debug unexpected outputs, be comfortable with distributed systems to design serving infrastructure, and be pragmatic enough to ship something that handles real load.
Most of my work lives in this space. I have designed multi-provider inference routers that fail over gracefully, built RAG pipelines that maintain coherence across long sessions, fine-tuned models using adapter methods when a general model was insufficient, and instrumented serving pipelines so silent failures get caught before users notice. At Omdena, I applied the same engineering discipline to a distributed ML pipeline processing over one million records, where reproducibility and observability were as critical as raw model accuracy.
The through-line in all of this is treating LLM infrastructure as software infrastructure. That means clean interfaces, observable systems, graceful degradation, and reproducible experiments - not as aspirations but as baseline requirements before anything ships.
Open Source
LUNA: 8-Provider Inference Router
The clearest example of my LLM engineering work is LUNA - a local-first AI engine I designed and shipped solo. At its core is a provider-agnostic inference router supporting eight distinct LLM backends: Ollama for fully local inference, Claude, Gemini, Groq, NVIDIA NIM, Mistral, OpenAI, and Cohere for cloud-hosted models.
Each backend has a different API shape, a different streaming protocol, a different token counting method, and different failure modes. The router normalizes all of this so every downstream skill sees the same interface regardless of which model is actually running. Provider selection, fallback chains, and latency budgets are all configurable without changing application code.
On top of the router sits a persistent memory system. Conversational context survives model swaps - if you switch from a local Ollama model to Claude mid-session, the conversation continues without interruption. This required a memory layer that is model-agnostic by construction, not retrofitted as an afterthought. The skills plugin system extends the engine further: drop a folder with a skill.json and SKILL.md and the new capability is available on the next request with no code changes and no server restarts required.
On-Device Inference
Voice Pipeline: Latency-First Design
LUNA includes a fully on-device voice pipeline: wake-word detection, Whisper STT, LLM inference, Edge TTS, and speaker output. No audio leaves the device at any stage. The engineering challenge is the end-to-end latency budget - wake-word detection needs to respond in under 200ms, Whisper needs to transcribe accurately without a remote round-trip, and TTS needs to start streaming output before the LLM has finished generating.
Achieving acceptable latency required careful model selection at each stage: the smallest Whisper variant that meets accuracy thresholds, TTS configured for streaming rather than waiting for full output, and LLM sampling parameters tuned to favor shorter initial tokens. Streaming TTS output while the LLM continues generating was the key architectural decision that makes the pipeline feel responsive rather than slow.
Industry Experience
Production ML at Scale
At Omdena from July to October 2024, I served as senior ML engineer on a platform processing over one million health and longevity records. The work was infrastructure-heavy: designing distributed preprocessing pipelines, building experiment tracking infrastructure, and establishing quality gates before any model moved to evaluation.
The outcome: model accuracy improved from 72% to 89% - a 17-percentage-point gain - through systematic feature engineering and pipeline redesign. Preprocessing latency dropped 40% through architectural improvements to the data loading pipeline. I led the 15-person cross-functional team across the full engagement, responsible for ML architecture decisions, code review, and structured handoffs.
The engineering discipline from this work transfers directly to LLM systems: reproducible pipelines, structured evaluation before deployment, observability that catches silent failures, and clean interfaces across a distributed team. Whether you are tuning a gradient-boosted classifier or fine-tuning a Llama variant for a specific task, the rigor is identical.
Fine-Tuning and Alignment
Adapter Methods and Evaluation
I have implemented instruction fine-tuning and adapter-based fine-tuning using LoRA and QLoRA via HuggingFace PEFT. The workflow is: prepare a dataset, define the instruction format, run training with quantization to keep VRAM requirements manageable, and run structured evaluation against a held-out set before the fine-tuned model touches any downstream system.
The most common failure mode in fine-tuning is using a test set too similar to training data and mistaking lower training loss for genuine task improvement. I build evaluation pipelines before training starts, defining clear metrics for the specific task so there is an unambiguous signal whether fine-tuning helped - lower perplexity alone is not evidence of improvement on what actually matters.
For alignment, I have studied and implemented preference datasets, reward modeling concepts, and Direct Preference Optimization as a more stable alternative to RLHF that eliminates the separate reward model training step. Understanding these alignment techniques is increasingly important as production LLM systems need to behave predictably across adversarial inputs.
Tech Stack
Tools and Frameworks
Open to LLM engineering roles.
Google DeepMind · Meta FAIR · OpenAI · Anthropic · Mistral · Cohere and top AI-native companies.