Speech Core

Voice-agent pipeline engine in C++17 — on-device speech for Linux, Windows, and Android (plus Apple via a Swift sibling).

On-device voice activity detection, speech-to-text (batch and real-time streaming), speaker diarization, and text-to-speech. Runs locally on CPU — no cloud, no Python at inference, no data leaves the machine.

📖 Docs · 🤗 Models · 🍎 Apple (Swift) · 💬 Discord

Demo

Voice cloning with VoxCPM2 — watch the speech-studio demo on YouTube

speech-core is a small orchestration core (state machine, turn detection, interruption handling, audio utilities — zero ML deps) plus a set of abstract interfaces. Model inference is opt-in through two interchangeable backends you can enable independently:

ONNX Runtime (SPEECH_CORE_WITH_ONNX) — Silero VAD, Parakeet STT, Nemotron-3.5 multilingual streaming STT, Kokoro TTS, DeepFilterNet3.
LiteRT (SPEECH_CORE_WITH_LITERT) — Silero VAD, Parakeet STT, Nemotron streaming STT, Nemotron-3.5 multilingual streaming STT, Omnilingual STT, Pyannote diarization, WeSpeaker embeddings, VoxCPM2 TTS. Backed by Google's ai-edge-litert (libLiteRt).

Consumers can enable either, both, or neither — or bring their own implementations of the interfaces (CPU, GPU, CoreML/MLX, a remote API).

Supported models

Model	Task	ONNX	LiteRT
Silero VAD v5	Voice activity detection	✓	✓
Parakeet TDT v3 (0.6B)	Speech-to-text	✓	✓
Nemotron Speech Streaming (0.6B)	Streaming speech-to-text	✓	✓
Nemotron-3.5 ASR Streaming Multilingual (0.6B)	Streaming speech-to-text (multilingual, prompt-conditioned)	✓	✓
Omnilingual ASR CTC (300M)	Speech-to-text (multilingual)	—	✓
Pyannote Segmentation 3.0	Diarization (segmentation)	—	✓
WeSpeaker ResNet34-LM	Speaker embedding	—	✓
VoxCPM2 (2B)	Text-to-speech (48 kHz, voice cloning)	—	✓
Kokoro 82M	Text-to-speech	✓	—
DeepFilterNet3	Speech enhancement	✓	—

Diarization (DiarizationPipeline) is pure C++ and composes a segmenter + embedder into speaker-labelled segments — no ML-runtime dependency of its own.

Platforms & backends

Backend	Static lib	Runtime dep	Platforms	Setup
ONNX	`speech_core_models`	`onnxruntime`	Linux, macOS, Windows, Android	`ORT_DIR` from an ONNX Runtime release
LiteRT	`speech_core_models_litert`	`libLiteRt`	Linux x86_64, Windows x86_64, Android, macOS arm64	`scripts/fetch_litert.sh` (extracts from the `ai-edge-litert` PyPI wheel)

Hardware acceleration. ONNX: NNAPI on Android, QNN on Qualcomm Linux (drop libQnnHtp.so on the lib path), optional NVIDIA CUDA / TensorRT via -DSPEECH_CORE_WITH_CUDA=ON — runtime-gated by SPEECH_CORE_ORT_PROVIDER with silent CPU fallback. LiteRT runs CPU only today; Hexagon / GPU delegates exist in libLiteRt but aren't wired through the C API yet.

Quick start

Build the core + the LiteRT backend (the runtime library is extracted from the ai-edge-litert wheel — no TensorFlow build):

git clone https://github.com/soniqo/speech-core && cd speech-core
scripts/fetch_litert.sh build/litert          # PYTHON=python3.11 if 'python3' is older
cmake -B build -DCMAKE_BUILD_TYPE=Release \
    -DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert
cmake --build build

Link the targets you need:

target_link_libraries(my_app PRIVATE speech_core)                          # orchestration only
target_link_libraries(my_app PRIVATE speech_core speech_core_models)        # + ONNX models
target_link_libraries(my_app PRIVATE speech_core speech_core_models_litert) # + LiteRT models

Transcribe an audio buffer:

#include <speech_core/models/litert_parakeet_stt.h>

speech_core::LiteRTParakeetStt stt(
    "parakeet-encoder.tflite", "parakeet-decoder-joint.tflite", "vocab.json");

auto r = stt.transcribe(audio, n_samples, 16000);   // r.text / r.language / r.confidence

Real-time streaming with partials (CPU, ~RTF 1.0):

#include <speech_core/models/litert_nemotron_streaming_stt.h>

speech_core::LiteRTNemotronStreamingStt stt(
    "nemotron-streaming-encoder.tflite",
    "nemotron-streaming-decoder.tflite",
    "nemotron-streaming-joint.tflite", "vocab.json");

stt.begin_stream(16000);
for (const auto& chunk : mic_chunks) {              // feed ~80 ms windows as they arrive
    auto partial = stt.push_chunk(chunk.data(), chunk.size());
    if (!partial.text.empty()) std::cout << partial.text << std::flush;
}
auto final = stt.end_stream();

Full voice-agent pipeline (VAD → STT → LLM → TTS):

#include <speech_core/pipeline/voice_pipeline.h>

speech_core::AgentConfig cfg;
cfg.mode = speech_core::AgentConfig::Mode::Pipeline;   // or ::TranscribeOnly / ::Echo

speech_core::VoicePipeline pipeline(
    stt, tts, &llm, vad, cfg,
    [](const speech_core::PipelineEvent& ev) { /* transcripts, audio out, errors */ });

pipeline.start();
pipeline.push_audio(mic_samples, count);               // call from your audio thread

VoicePipeline is the real-time voice-agent state machine — VAD-driven turn detection, interruption handling, eager STT, conversation tracking, tool calling. It owns no audio I/O or network: the platform feeds audio in and receives events via the callback. Pass Mode::TranscribeOnly (and llm = nullptr) for a pure transcription pipeline.

Code examples

Voice activity detection

#include <speech_core/models/litert_silero_vad.h>
speech_core::LiteRTSileroVad vad("silero-vad.tflite");
float p = vad.process_chunk(samples_512, 512);   // speech probability in [0, 1]

Feed the probability stream to StreamingVAD (speech_core/vad/streaming_vad.h) for hysteresis-gated SpeechStarted / SpeechEnded events.

Speaker diarization

#include <speech_core/models/litert_pyannote_segmentation.h>
#include <speech_core/models/litert_wespeaker_embedding.h>
#include <speech_core/diarization/diarization_pipeline.h>

speech_core::LiteRTPyannoteSegmentation seg("pyannote-segmentation.tflite");
speech_core::LiteRTWeSpeakerEmbedding   emb("wespeaker-resnet34.tflite");
speech_core::DiarizationPipeline        diar(seg, emb);

auto segments = diar.diarize(audio, n_samples, 16000, speech_core::DiarizerConfig{});
for (const auto& s : segments)
    printf("speaker %d: %.2fs - %.2fs\n", s.speaker, s.start, s.end);

Text-to-speech

#include <speech_core/models/litert_voxcpm2_tts.h>
speech_core::LiteRTVoxCPM2Tts tts(
    "voxcpm2-text-prefill.tflite", "voxcpm2-token-step.tflite",
    "voxcpm2-audio-encoder.tflite", "voxcpm2-audio-decoder.tflite", "tokenizer.json");

tts.synthesize("Hello world", "en", [](const float* samples, size_t len, bool is_final) {
    // 48 kHz Float32 PCM, streamed in chunks
});

Each interface and model is documented in docs/interfaces.md and docs/models.md (download URLs, sizes, preprocessing).

Architecture

┌──────────────────────────────────────────────┐
│            speech_core (always built)         │
│                                              │
│  VoicePipeline / TurnDetector / SpeechQueue  │  orchestration
│  StreamingVAD / AudioBuffer / Resampler      │
│  DiarizationPipeline                         │
│                                              │
│  STT / TTS / VAD / Enhancer / AEC / LLM      │  abstract interfaces
│  Segmentation / Embedding / Diarizer         │
└──────────────────────────────────────────────┘
              ▲                       ▲
              │ implements (optional) │
┌─────────────┴──────────┐  ┌─────────┴──────────────┐
│ speech_core_models     │  │ speech_core_models_litert │
│ (SPEECH_CORE_WITH_ONNX)│  │ (SPEECH_CORE_WITH_LITERT) │
│  ONNX Runtime          │  │  libLiteRt                │
└────────────────────────┘  └───────────────────────────┘

The orchestration core depends only on the interfaces — never on a concrete model — so a backend swap is a link-time choice, not a rewrite. Design principles: pure C++17 core, no platform APIs in the core, no network I/O, no audio I/O (operates on float buffers), callback-driven.

Reference: interfaces · models · pipeline / state machine · C API (FFI) · tool calling

Build

# Orchestration only (no ML deps)
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build

# + ONNX backend
cmake -B build -DSPEECH_CORE_WITH_ONNX=ON -DORT_DIR=/path/to/onnxruntime && cmake --build build

# + ONNX with NVIDIA CUDA / TensorRT (ORT_DIR must be a GPU-enabled ONNX Runtime)
cmake -B build -DSPEECH_CORE_WITH_ONNX=ON -DSPEECH_CORE_WITH_CUDA=ON -DORT_DIR=/path/to/onnxruntime-gpu && cmake --build build

# + LiteRT backend
scripts/fetch_litert.sh build/litert
cmake -B build -DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert && cmake --build build

LiteRT headers are vendored in third_party/litert/ (no setup). LITERT_DIR points at the directory holding libLiteRt.{so,dylib,dll} (Windows also needs LiteRt.lib). Add -DSPEECH_CORE_BUILD_EXAMPLES=ON for the Linux CLI demos (speech_transcribe, speech_synthesize, …) — see examples/linux. A voice-cloning CLI (speech_voxcpm2_clone) is built automatically whenever SPEECH_CORE_WITH_LITERT=ON — see examples/litert.

On-device model download (optional). Add -DSPEECH_CORE_WITH_HF_DOWNLOAD=ON to fetch model bundles from Hugging Face on first use instead of provisioning them by hand. It links libcurl (find_package(CURL) — system libcurl on Linux/macOS, vcpkg on Windows) and adds sc_voxcpm2_create_from_pretrained("soniqo/VoxCPM2-LiteRT", …) to the VoxCPM2 C ABI: a resumable, retrying download (HTTP Range, atomic rename) that tolerates network interruptions and caches under the OS cache dir (SPEECH_CORE_CACHE_DIR to override; HF_ENDPOINT for a mirror). Off by default so embedded/offline builds carry no HTTP/TLS dependency. The hf_fetch debug CLI exercises it directly.

Testing & CI

cd build && ctest --output-on-failure        # core unit tests (no models needed)

The orchestration + diarization unit tests need no model files. Integration tests load real .tflite / .onnx artifacts and skip cleanly when SPEECH_LITERT_MODEL_DIR / SPEECH_MODEL_DIR are unset:

scripts/fetch_litert.sh build/litert
scripts/download_models_litert.sh            # public soniqo/* models, no token
cmake -B build -DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert && cmake --build build
SPEECH_LITERT_MODEL_DIR=scripts/models-litert ctest --test-dir build --output-on-failure

CI builds + tests across Linux, Windows, and macOS (LiteRT on Linux + Windows, ONNX on Linux), plus an aarch64 cross-compile; a nightly lane runs the model integration tests against the public model files.

Contributing

PRs welcome — model integrations, backends, docs, fixes. Branch off main, build + ctest, open a PR. No marketing copy in commits or PRs.

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
docs		docs
examples		examples
include/speech_core		include/speech_core
scripts		scripts
src		src
tests		tests
third_party/litert/litert		third_party/litert/litert
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CODEX.md		CODEX.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Core

Demo

Supported models

Platforms & backends

Quick start

Code examples

Voice activity detection

Speaker diarization

Text-to-speech

Architecture

Build

Testing & CI

Contributing

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Core

Demo

Supported models

Platforms & backends

Quick start

Code examples

Voice activity detection

Speaker diarization

Text-to-speech

Architecture

Build

Testing & CI

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages