May 1, 2026·10 min read

Building LUNA: Lessons from Shipping a Local AI Engine

I built LUNA because I was tired of API costs and logging. Six months later it routes across 8 LLM providers with a plugin system, persistent memory, and on-device voice. Here is what I learned.

Open SourceLLMArchitecture
Sehastrajit
Sehastrajit Selvachandran
AI/ML Engineer · M.S. CS @ ASU

I built LUNA because I was tired of paying API costs for every tool I wanted to automate, and tired of my conversations being logged somewhere I did not control. The initial version was a wrapper around Ollama with a voice interface. Six months later, it is a local-first AI engine routing across eight providers, with a plugin-based skills system, persistent memory, health data integrations, and support for both personal and team deployments.

Here is what I learned.

The Core Architecture Decision: Unified Interface First

The first decision was the most important: build a provider-agnostic interface before implementing any provider-specific code. Every LLM provider has a different API shape. Ollama uses one streaming protocol. Claude uses another. Groq returns different metadata. OpenAI has function calling formatted one way; Anthropic has tool use formatted differently.

If I had started with Ollama and added providers later, every skill in the system would have accumulated provider-specific branching over time. Instead, I defined the interface the system expected - a streaming response generator, a uniform token counting method, a standard tool call format - and implemented adapters for each provider behind that interface.

The payoff: adding a new provider takes a few hours, not a refactor. A skill written for Ollama works on Claude without modification. The skills system does not know which LLM is running, and it does not need to.

The Hardest Part: Streaming Without Buffering

The naive implementation of a streaming LLM response is to buffer the entire response and emit it at once. This works. It also feels terrible - the UI is frozen for 15 seconds and then everything appears at once.

True streaming requires threading the token stream from the LLM through the application layer to the client without buffering anywhere in the chain. This sounds simple. It is not simple.

The challenge is that the streaming protocol differs per provider (SSE for some, chunked JSON for others), error handling must work mid-stream (what happens when the model stops generating unexpectedly?), and the voice pipeline needs to start TTS on partial output before the full response is complete.

I solved this with Python async generators. Each provider adapter is an async generator that yields tokens as they arrive. The core router iterates the adapter generator. The WebSocket handler forwards tokens to the client in real time. The voice pipeline listens to the same token stream and starts TTS as soon as enough tokens have accumulated for a coherent utterance - without waiting for the LLM to finish. No buffering anywhere.

This design required careful thought about backpressure: what happens when the client is slower than the model? And about error recovery: if a provider drops the connection mid-stream, can LUNA retry on a fallback provider from the beginning of the response, or is the partial response already committed to the UI?

The Skills System: Zero-Code Extensibility

The second major architecture decision was the plugin system. The goal was that someone should be able to add a new capability to LUNA without touching the core codebase, without understanding how the LLM routing works, and without restarting anything.

Each skill is a directory containing two files:

  • skill.json - declares the skill name, description, parameters, and any environment variables it needs
  • SKILL.md - a markdown file telling the LLM what the skill does and how to call it

At startup, LUNA scans the skills directory, reads every skill.json, and registers each skill as a tool the LLM can call. The LLM decides when to invoke a skill based on the description in the SKILL.md. The engine routes the tool call to the right skill's handler, runs it, and returns the result to the LLM for incorporation into the response.

Nine built-in skills ship with LUNA: research agent, coding agent, desktop agent, document drafter, file builder, dataset builder, resume analyzer, workspace suite (Gmail, Calendar, Drive), and job assistant. Custom skills are a dropped folder - no code changes, no restarts.

The skills are also composable. The research agent can call the coding agent to execute Python on data it retrieved. The desktop agent can use the workspace suite to email results. Multi-step workflows emerge from composition rather than from complex orchestration logic in the core engine.

The Voice Pipeline: Latency Budget Engineering

LUNA's voice pipeline runs fully on-device: wake-word detection, Whisper STT, LLM inference, Edge TTS, speaker output. No audio leaves the machine.

The engineering challenge is the end-to-end latency budget. Wake-word detection must respond in under 200ms to feel responsive. Whisper must transcribe accurately without a remote round-trip. TTS must start streaming before the LLM has finished generating - otherwise the user waits for the full response before hearing anything.

Achieving acceptable latency required model selection at each stage: the smallest Whisper variant that meets accuracy requirements on the target accent and vocabulary, TTS configured for streaming rather than batch output, and LLM sampling parameters tuned to favor shorter initial tokens. The streaming-TTS-while-LLM-generates design is the key architectural decision - without it, the pipeline would feel unresponsive on any hardware.

The latency budget is also hardware-dependent in ways that are hard to abstract. A Whisper model that takes 1.5 seconds on a modern laptop takes 6 seconds on older hardware. I parameterized the model choices for each stage so users can tune the tradeoff for their machine.

What I Would Do Differently

Context window management is harder than it looks. The naive approach - include the full conversation history in every prompt - works until sessions get long. Then it breaks in two ways: context window limits are hit, and the model starts attending to irrelevant history. I retrofitted a memory and retrieval system late in development, and it required rethinking how context is assembled for every type of request. This should have been a first-class concern from the beginning.

Provider fallback logic is subtler than it appears. If Ollama is slow, fall back to Groq. If Groq rate-limits, fall back to Claude. Simple in principle. In practice: what counts as "slow"? How long do you wait before declaring a timeout? Should you retry the same provider or fall back immediately? If the first few tokens of a response were already streamed to the client, should the fallback start over or continue mid-sentence? I made these decisions ad hoc. They should be a configurable policy with sensible defaults.

The plugin interface needs versioning. As LUNA has evolved, the interface that skills depend on has changed. Skills written for earlier versions break on newer versions in non-obvious ways. A versioned skill API with explicit compatibility guarantees would have made this manageable. I am retrofitting this now.

The Open Source Part

LUNA is on GitHub. The personal variant covers voice, vision, Spotify, health data, and desktop automation. The business variant adds multi-user JWT auth, rate limiting, and messaging platform integrations (Telegram, Discord, Slack). The skills system is the same in both - everything that works in personal mode works in business mode.

If you are building a local AI assistant, a multi-provider LLM router, or an agent framework, the architecture decisions above are the ones that will either pay off or cost you later. Build the provider interface first. Design the streaming layer before you need voice. Make the plugin system zero-coupling from day one.

Sehastrajit
Sehastrajit Selvachandran
AI/ML Engineer. M.S. CS at Arizona State University (GPA 3.94). Creator of LUNA. 4 peer-reviewed publications. Building LLM systems, CV pipelines, and production ML infrastructure.