Voice-agent pipeline engine in C++17 — on-device speech for Linux, Windows, and Android (plus Apple via a Swift sibling).
On-device voice activity detection, speech-to-text (batch and real-time streaming), speaker diarization, and text-to-speech. Runs locally on CPU — no cloud, no Python at inference, no data leaves the machine.
📖 Docs · 🤗 Models · 🍎 Apple (Swift) · 💬 Discord
Voice cloning with VoxCPM2 — watch the speech-studio demo on YouTube
speech-core is a small orchestration core (state machine, turn detection, interruption handling, audio utilities — zero ML deps) plus a set of abstract interfaces. Model inference is opt-in through two interchangeable backends you can enable independently:
- ONNX Runtime (
SPEECH_CORE_WITH_ONNX) — Silero VAD, Parakeet STT, Nemotron-3.5 multilingual streaming STT, Kokoro TTS, DeepFilterNet3. - LiteRT (
SPEECH_CORE_WITH_LITERT) — Silero VAD, Parakeet STT, Nemotron streaming STT, Nemotron-3.5 multilingual streaming STT, Omnilingual STT, Pyannote diarization, WeSpeaker embeddings, VoxCPM2 TTS. Backed by Google'sai-edge-litert(libLiteRt).
Consumers can enable either, both, or neither — or bring their own implementations of the interfaces (CPU, GPU, CoreML/MLX, a remote API).
| Model | Task | ONNX | LiteRT |
|---|---|---|---|
| Silero VAD v5 | Voice activity detection | ✓ | ✓ |
| Parakeet TDT v3 (0.6B) | Speech-to-text | ✓ | ✓ |
| Nemotron Speech Streaming (0.6B) | Streaming speech-to-text | ✓ | ✓ |
| Nemotron-3.5 ASR Streaming Multilingual (0.6B) | Streaming speech-to-text (multilingual, prompt-conditioned) | ✓ | ✓ |
| Omnilingual ASR CTC (300M) | Speech-to-text (multilingual) | — | ✓ |
| Pyannote Segmentation 3.0 | Diarization (segmentation) | — | ✓ |
| WeSpeaker ResNet34-LM | Speaker embedding | — | ✓ |
| VoxCPM2 (2B) | Text-to-speech (48 kHz, voice cloning) | — | ✓ |
| Kokoro 82M | Text-to-speech | ✓ | — |
| DeepFilterNet3 | Speech enhancement | ✓ | — |
Diarization (DiarizationPipeline) is pure C++ and composes a segmenter + embedder into speaker-labelled segments — no ML-runtime dependency of its own.
| Backend | Static lib | Runtime dep | Platforms | Setup |
|---|---|---|---|---|
| ONNX | speech_core_models |
onnxruntime |
Linux, macOS, Windows, Android | ORT_DIR from an ONNX Runtime release |
| LiteRT | speech_core_models_litert |
libLiteRt |
Linux x86_64, Windows x86_64, Android, macOS arm64 | scripts/fetch_litert.sh (extracts from the ai-edge-litert PyPI wheel) |
Hardware acceleration. ONNX: NNAPI on Android, QNN on Qualcomm Linux (drop libQnnHtp.so on the lib path), optional NVIDIA CUDA / TensorRT via -DSPEECH_CORE_WITH_CUDA=ON — runtime-gated by SPEECH_CORE_ORT_PROVIDER with silent CPU fallback. LiteRT runs CPU only today; Hexagon / GPU delegates exist in libLiteRt but aren't wired through the C API yet.
Build the core + the LiteRT backend (the runtime library is extracted from the ai-edge-litert wheel — no TensorFlow build):
git clone https://github.com/soniqo/speech-core && cd speech-core
scripts/fetch_litert.sh build/litert # PYTHON=python3.11 if 'python3' is older
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert
cmake --build buildLink the targets you need:
target_link_libraries(my_app PRIVATE speech_core) # orchestration only
target_link_libraries(my_app PRIVATE speech_core speech_core_models) # + ONNX models
target_link_libraries(my_app PRIVATE speech_core speech_core_models_litert) # + LiteRT modelsTranscribe an audio buffer:
#include <speech_core/models/litert_parakeet_stt.h>
speech_core::LiteRTParakeetStt stt(
"parakeet-encoder.tflite", "parakeet-decoder-joint.tflite", "vocab.json");
auto r = stt.transcribe(audio, n_samples, 16000); // r.text / r.language / r.confidenceReal-time streaming with partials (CPU, ~RTF 1.0):
#include <speech_core/models/litert_nemotron_streaming_stt.h>
speech_core::LiteRTNemotronStreamingStt stt(
"nemotron-streaming-encoder.tflite",
"nemotron-streaming-decoder.tflite",
"nemotron-streaming-joint.tflite", "vocab.json");
stt.begin_stream(16000);
for (const auto& chunk : mic_chunks) { // feed ~80 ms windows as they arrive
auto partial = stt.push_chunk(chunk.data(), chunk.size());
if (!partial.text.empty()) std::cout << partial.text << std::flush;
}
auto final = stt.end_stream();Full voice-agent pipeline (VAD → STT → LLM → TTS):
#include <speech_core/pipeline/voice_pipeline.h>
speech_core::AgentConfig cfg;
cfg.mode = speech_core::AgentConfig::Mode::Pipeline; // or ::TranscribeOnly / ::Echo
speech_core::VoicePipeline pipeline(
stt, tts, &llm, vad, cfg,
[](const speech_core::PipelineEvent& ev) { /* transcripts, audio out, errors */ });
pipeline.start();
pipeline.push_audio(mic_samples, count); // call from your audio threadVoicePipeline is the real-time voice-agent state machine — VAD-driven turn detection, interruption handling, eager STT, conversation tracking, tool calling. It owns no audio I/O or network: the platform feeds audio in and receives events via the callback. Pass Mode::TranscribeOnly (and llm = nullptr) for a pure transcription pipeline.
#include <speech_core/models/litert_silero_vad.h>
speech_core::LiteRTSileroVad vad("silero-vad.tflite");
float p = vad.process_chunk(samples_512, 512); // speech probability in [0, 1]Feed the probability stream to StreamingVAD (speech_core/vad/streaming_vad.h) for hysteresis-gated SpeechStarted / SpeechEnded events.
#include <speech_core/models/litert_pyannote_segmentation.h>
#include <speech_core/models/litert_wespeaker_embedding.h>
#include <speech_core/diarization/diarization_pipeline.h>
speech_core::LiteRTPyannoteSegmentation seg("pyannote-segmentation.tflite");
speech_core::LiteRTWeSpeakerEmbedding emb("wespeaker-resnet34.tflite");
speech_core::DiarizationPipeline diar(seg, emb);
auto segments = diar.diarize(audio, n_samples, 16000, speech_core::DiarizerConfig{});
for (const auto& s : segments)
printf("speaker %d: %.2fs - %.2fs\n", s.speaker, s.start, s.end);#include <speech_core/models/litert_voxcpm2_tts.h>
speech_core::LiteRTVoxCPM2Tts tts(
"voxcpm2-text-prefill.tflite", "voxcpm2-token-step.tflite",
"voxcpm2-audio-encoder.tflite", "voxcpm2-audio-decoder.tflite", "tokenizer.json");
tts.synthesize("Hello world", "en", [](const float* samples, size_t len, bool is_final) {
// 48 kHz Float32 PCM, streamed in chunks
});Each interface and model is documented in docs/interfaces.md and docs/models.md (download URLs, sizes, preprocessing).
┌──────────────────────────────────────────────┐
│ speech_core (always built) │
│ │
│ VoicePipeline / TurnDetector / SpeechQueue │ orchestration
│ StreamingVAD / AudioBuffer / Resampler │
│ DiarizationPipeline │
│ │
│ STT / TTS / VAD / Enhancer / AEC / LLM │ abstract interfaces
│ Segmentation / Embedding / Diarizer │
└──────────────────────────────────────────────┘
▲ ▲
│ implements (optional) │
┌─────────────┴──────────┐ ┌─────────┴──────────────┐
│ speech_core_models │ │ speech_core_models_litert │
│ (SPEECH_CORE_WITH_ONNX)│ │ (SPEECH_CORE_WITH_LITERT) │
│ ONNX Runtime │ │ libLiteRt │
└────────────────────────┘ └───────────────────────────┘
The orchestration core depends only on the interfaces — never on a concrete model — so a backend swap is a link-time choice, not a rewrite. Design principles: pure C++17 core, no platform APIs in the core, no network I/O, no audio I/O (operates on float buffers), callback-driven.
Reference: interfaces · models · pipeline / state machine · C API (FFI) · tool calling
# Orchestration only (no ML deps)
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build
# + ONNX backend
cmake -B build -DSPEECH_CORE_WITH_ONNX=ON -DORT_DIR=/path/to/onnxruntime && cmake --build build
# + ONNX with NVIDIA CUDA / TensorRT (ORT_DIR must be a GPU-enabled ONNX Runtime)
cmake -B build -DSPEECH_CORE_WITH_ONNX=ON -DSPEECH_CORE_WITH_CUDA=ON -DORT_DIR=/path/to/onnxruntime-gpu && cmake --build build
# + LiteRT backend
scripts/fetch_litert.sh build/litert
cmake -B build -DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert && cmake --build buildLiteRT headers are vendored in third_party/litert/ (no setup). LITERT_DIR points at the directory holding libLiteRt.{so,dylib,dll} (Windows also needs LiteRt.lib). Add -DSPEECH_CORE_BUILD_EXAMPLES=ON for the Linux CLI demos (speech_transcribe, speech_synthesize, …) — see examples/linux. A voice-cloning CLI (speech_voxcpm2_clone) is built automatically whenever SPEECH_CORE_WITH_LITERT=ON — see examples/litert.
On-device model download (optional). Add -DSPEECH_CORE_WITH_HF_DOWNLOAD=ON to fetch model bundles from Hugging Face on first use instead of provisioning them by hand. It links libcurl (find_package(CURL) — system libcurl on Linux/macOS, vcpkg on Windows) and adds sc_voxcpm2_create_from_pretrained("soniqo/VoxCPM2-LiteRT", …) to the VoxCPM2 C ABI: a resumable, retrying download (HTTP Range, atomic rename) that tolerates network interruptions and caches under the OS cache dir (SPEECH_CORE_CACHE_DIR to override; HF_ENDPOINT for a mirror). Off by default so embedded/offline builds carry no HTTP/TLS dependency. The hf_fetch debug CLI exercises it directly.
cd build && ctest --output-on-failure # core unit tests (no models needed)The orchestration + diarization unit tests need no model files. Integration tests load real .tflite / .onnx artifacts and skip cleanly when SPEECH_LITERT_MODEL_DIR / SPEECH_MODEL_DIR are unset:
scripts/fetch_litert.sh build/litert
scripts/download_models_litert.sh # public soniqo/* models, no token
cmake -B build -DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert && cmake --build build
SPEECH_LITERT_MODEL_DIR=scripts/models-litert ctest --test-dir build --output-on-failureCI builds + tests across Linux, Windows, and macOS (LiteRT on Linux + Windows, ONNX on Linux), plus an aarch64 cross-compile; a nightly lane runs the model integration tests against the public model files.
PRs welcome — model integrations, backends, docs, fixes. Branch off main, build + ctest, open a PR. No marketing copy in commits or PRs.
Apache 2.0 — see LICENSE.