The Voice AI Bottleneck I Encountered
A little over 6 months ago I set out to build a new class of AI system code-named Deliberium. The goal was ambitious: create voice-native experiences that feel instantaneous, interruption-aware, and genuinely useful in real-world settings. As the project progressed, I kept hitting the same wall. Every promising speech-to-speech model required me to reinvent the same plumbing: session management, audio frame handling, transport boundaries, tool context injection, interruption logic, and fallback policies. The models themselves were impressive, but the runtime glue that turned them into reliable products was brittle, provider-specific, and impossible to test deterministically.
That realisation led to the second major subsystem I have now decided to open-source: Vona. While my broader AI project remains under active development, Vona was ready for the community sooner than the rest. I believe teams building voice products; whether for mixed-reality headsets, customer service platforms, or ambient assistants; will find it immediately useful for both production deployments and research experiments.
What Vona Actually Is
Vona is a lightweight, provider-neutral Rust runtime layer for real-time speech-to-speech (STS) systems. It does not try to be a full voice assistant. Instead, it owns the hard boundary between your application’s product surface and the ever-changing world of speech models, transports, and deployment topologies.
Think of it as the operating system kernel for voice AI: stable contracts, deterministic behaviour, and pluggable adapters so that your host application stays blissfully unaware of whether it is talking to OpenAI’s Realtime API, Google’s Gemini Live, a local Moshi model, or a custom cascade of STT–LLM–TTS services.
How Vona Works
At its heart, Vona is built around four clean surfaces:
- AudioTransport - receives microphone frames and delivers playback frames, handling barge-in and buffer clearing automatically.
- SpeechToSpeechBackend - the provider-specific implementation (step-oriented or event-stream realtime).
- VonaRuntime - orchestrates the session, applies your policy, and tracks every metric that matters.
- SkillExecutor - injects external context and tool results back into the backend without tight coupling.
Audio flows in 20 ms frames. The runtime measures time-to-first-audio, detects interruptions, executes tool calls, injects context, and decides on fallbacks; all while remaining fully deterministic. The included test harness lets you script exact sequences of events and assert that interruptions clean up correctly, that fallback paths fire when expected, and that latency stays within your chosen budget.
The architecture is deliberately modular. A single vona facade crate lets you opt in to only the features you need, keeping binary size and compile times low. Provider adapters live in separate crates so that adding support for a new model never touches your core logic.
Key Capabilities That Matter in Production
- Backend portability - swap between local models (Seamless, Moshi) and cloud services (OpenAI Realtime, Gemini Live, Azure, ElevenLabs, Deepgram) with zero changes to your application code.
- Interruption-aware sessions - proper handling of barge-in, partial outputs, and cleanup.
- Tool and context injection - schema-validated skill registry and external events that keep product logic outside the model boundary.
- Deterministic testing - the
vona-test-harnesscrate gives you release-gate confidence that your voice stack behaves exactly as designed. - Transport flexibility - in-process, local HTTP, or length-prefixed IPC via the sidecar binary; ideal for edge, desktop, or containerised deployments.
- Observability built in - session metrics, audit events, and structured traces out of the box.
All of this sits in a 13-crate Rust workspace that compiles cleanly with cargo check --locked and passes a comprehensive release gate script before every change.
Getting Started Is Straightforward
Installation guidance can be found on the project website: https://vona.deliberium.ai.
For most teams the quickest path is to add the facade crate to your Cargo.toml with the features you need:
[dependencies]
vona = { version = "0.1.0", features = [
"seamless",
"moshi",
"transport-local",
"openai-realtime",
"model-provisioning",
] }
Full documentation, architecture overview, and example sessions are in the repository: https://github.com/deliberium/vona. The mock_session example in the test harness gives you a working end-to-end trace in under a minute.
Why I Open-Sourced Vona Now
Voice AI is moving faster than any single team can track. New models and protocols appear monthly. Rather than keep my runtime private while I finish the larger Deliberium system, I chose to release Vona early. The community; researchers, indie hackers, and enterprise teams alike; can now avoid repeating the same painful plumbing work I did. In return, I hope to see contributions that expand adapter coverage, improve transport options, or push the test harness even further.
Vona is MIT-licensed and actively maintained. Contributions are genuinely welcome; the CONTRIBUTING.md file sets out clear expectations around keeping the core contracts provider-neutral and ensuring the release gate stays green.
Looking Ahead
Alan Kay once observed that “the best way to predict the future is to invent it.” Voice-native products are that future for human-computer interaction, and they will only succeed if the foundational runtime is as reliable as the models themselves. Vona is my contribution to that foundation.
If you are building the next wave of voice experiences, I invite you to try Vona today. Clone the repository, run the release gate, and see how quickly your voice stack becomes something you can trust rather than something you have to babysit.
I look forward to seeing what you build.