A lightweight voice assistant that acts as a channel for OpenClaw. No local LLM - just:
- Wake Word → VAD → ASR → sends transcription to OpenClaw
- TTS ← receives speech commands from OpenClaw (any channel)
┌─────────────────────────────────────────────────────────────────────────────┐
│ OpenClaw Hybrid Assistant │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ INPUT PIPELINE │ │
│ │ │ │
│ │ Microphone → Wake Word → VAD → ASR/STT → WebSocket → OpenClaw │ │
│ │ (ALSA) (openWW) (Silero) (Parakeet) (Channel) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ OUTPUT PIPELINE │ │
│ │ │ │
│ │ OpenClaw → WebSocket → TTS/Piper → Speaker │ │
│ │ (any channel) (22050Hz) (ALSA) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
openclaw-hybrid-assistant/
├── src/
│ ├── main.cpp # Entry point, CLI parsing, event loop
│ ├── audio/ # Audio I/O (ALSA)
│ │ ├── audio_capture.h/cpp # Microphone input (16kHz, 16-bit PCM, mono)
│ │ ├── audio_playback.h/cpp # Speaker output (cancellable, multi-rate)
│ │ └── waiting_chime.h/cpp # Earcon feedback while waiting for response
│ ├── pipeline/ # Voice processing chain
│ │ ├── voice_pipeline.h/cpp # Wake Word → VAD → STT → TTS orchestrator
│ │ └── tts_queue.h/cpp # Producer/consumer streaming TTS playback
│ ├── network/ # Network communication
│ │ └── openclaw_client.h/cpp # Raw WebSocket client (RFC 6455)
│ └── config/ # Configuration
│ └── model_config.h # Model paths, IDs, availability checks
├── tests/
│ ├── test_components.cpp # Component tests (wake word, VAD, STT)
│ ├── test_integration.cpp # E2E tests (fake WS server, sanitization, TTS)
│ ├── audio/ # Generated test WAV files
│ └── scripts/ # Test audio generation scripts
├── scripts/
│ ├── download-models.sh # Model download (VAD, ASR, TTS, wake word)
│ ├── openclaw-voice.service # systemd service unit
│ └── test-on-mac.sh # Mac testing via Docker/Lima
├── CMakeLists.txt # Build configuration (3 targets)
├── Dockerfile # Docker build + test environment
├── build.sh # End-to-end build script
└── README.md
| Feature | linux-voice-assistant | openclaw-hybrid-assistant |
|---|---|---|
| Wake Word | ✅ | ✅ |
| VAD | ✅ | ✅ |
| ASR/STT | ✅ Local Whisper | ✅ Parakeet TDT-CTC 110M (NeMo CTC, int8) |
| LLM | ✅ Local or Moltbot | ❌ None - uses OpenClaw |
| TTS | ✅ Local Piper (22kHz) | ✅ Piper Lessac Medium (22050Hz) |
| Integration | HTTP Voice Bridge | WebSocket to OpenClaw |
- Model: openWakeWord "Hey Jarvis"
- Threshold: 0.5 (configurable)
- Frame size: 80ms (1280 samples at 16kHz)
- Model: Silero VAD (ONNX neural network, via sherpa-onnx)
- Much more accurate than energy-based VAD at distinguishing speech from noise
- Silence threshold: 1.5 seconds
- Minimum speech: 0.5 seconds
- Fallback: energy-based VAD if Silero model fails to load
- Model: Parakeet TDT-CTC 110M EN (NeMo CTC, int8 quantized)
- Architecture: FastConformer 110M params
- Features: Automatic punctuation + capitalization
- Sample rate: 16kHz mono
- Size: ~126MB (int8 quantized)
- Alternative: Whisper Tiny EN available with
--whisperdownload flag
- Model: Piper Lessac Medium (VITS)
- Output rate: 22050 Hz
- Voice: Natural American male (Lessac dataset)
- Size: ~61MB (model + espeak-ng-data)
- Alternative: Kokoro TTS v0.19 available with
--kokorodownload flag (11 speakers, 24kHz, ~330MB) - Text Sanitization: Automatically removes emojis, markdown, and special characters before synthesis
- ALSA-based microphone input
- Format: 16kHz, 16-bit PCM, mono (optimal for STT)
- Callback-driven: delivers audio chunks to the voice pipeline
- Device listing and selection support
- ALSA-based speaker output
- Cancellable playback: writes period-sized chunks (~46ms at 22kHz), checks cancel flag between each
- Instant silence on cancel via
snd_pcm_drop() - Dynamic sample rate reinitialize (supports 22050Hz Piper and 24kHz Kokoro)
- Producer/consumer pattern for gapless streaming TTS
- Producer thread synthesizes sentences and pushes audio chunks
- Consumer thread plays chunks via ALSA as they arrive
- Sentence N+1 synthesizes while sentence N plays
- Thread-safe cancellation support
- Raw WebSocket implementation (RFC 6455 compliant, no external WS library)
- TCP connect, WebSocket upgrade handshake with random key
- Masked frame sending and extended payload support
- Ping/pong handling for connection keepalive
- Background receive thread with thread-safe speak message queue
- Auto-reconnect with configurable delay and max attempts
Plays a brief, pleasant earcon sound while waiting for OpenClaw to process the user's request:
- Professional earcon: Generated via
soxpluck synthesis (sounds like a real glockenspiel chime) - Immediate acknowledgment: Plays once right after the transcription is sent
- Periodic reminder: Repeats every 5 seconds so the user knows the agent is still working
- Instant stop: Earcon stops within ~50ms when the response arrives
- Graceful fallback: If the earcon WAV is missing, waiting is silent (no crash)
Generated automatically by ./scripts/download-models.sh (requires sox).
- Wake word detection continues during TTS playback
- When detected: cancels current speech, clears pending responses, re-enters listening
- Deferred mutex handling for ARM safety (avoids deadlock between audio and pipeline threads)
- Text sanitization engine for TTS: removes emojis, markdown, HTML, special chars
- Abbreviation-aware sentence splitter (handles "Mr.", "Dr.", "e.g.", "U.S.", etc.)
ws://openclaw-host:8082
Connect:
{
"type": "connect",
"deviceId": "pi-living-room",
"accountId": "default",
"capabilities": {
"stt": true,
"tts": true,
"wakeWord": true
}
}Transcription (after ASR):
{
"type": "transcription",
"text": "What's the weather like?",
"sessionId": "main",
"isFinal": true
}Speak (for TTS):
{
"type": "speak",
"text": "The weather is sunny.",
"sourceChannel": "telegram",
"priority": 1,
"interrupt": false
}- Raspberry Pi 5 (or Linux x86_64/ARM64)
- ALSA development libraries
- OpenClaw running with voice-assistant channel enabled
./build.sh# Basic (connects to localhost:8082)
./build/openclaw-assistant
# With wake word enabled
./build/openclaw-assistant --wakeword
# Connect to remote OpenClaw
./build/openclaw-assistant --wakeword --openclaw-url ws://192.168.1.100:8082# Run all tests
./build/test-components --run-all
# Test wake word detection with audio file
./build/test-components --test-wakeword tests/audio/hey-jarvis.wav
# Test that audio does NOT trigger wake word
./build/test-components --test-no-wakeword tests/audio/noise.wav
# Test VAD and STT
./build/test-components --test-vad tests/audio/speech.wav
./build/test-components --test-stt tests/audio/speech.wav
# Test full pipeline
./build/test-components --test-pipeline tests/audio/wakeword-plus-speech.wav| Option | Description | Default |
|---|---|---|
--wakeword |
Enable wake word detection | Off |
--wakeword-threshold |
Detection threshold (0.0-1.0) | 0.5 |
--openclaw-url |
OpenClaw WebSocket URL | ws://localhost:8082 |
--device-id |
Device identifier | hostname |
--input |
ALSA input device | "default" |
--output |
ALSA output device | "default" |
--list-devices |
List audio devices | - |
--help |
Show help | - |
| Model | Size | Location |
|---|---|---|
| Silero VAD | ~2 MB | ~/.local/share/runanywhere/Models/ONNX/silero-vad/ |
| Parakeet TDT-CTC 110M EN (int8) | ~126 MB | ~/.local/share/runanywhere/Models/ONNX/parakeet-tdt-ctc-110m-en-int8/ |
| Piper Lessac Medium TTS | ~61 MB | ~/.local/share/runanywhere/Models/ONNX/vits-piper-en_US-lessac-medium/ |
| Hey Jarvis | ~1.3 MB | ~/.local/share/runanywhere/Models/ONNX/hey-jarvis/ |
| openWakeWord Embedding | ~1.3 MB | ~/.local/share/runanywhere/Models/ONNX/openwakeword-embedding/ |
| openWakeWord Melspectrogram | ~1.1 MB | ~/.local/share/runanywhere/Models/ONNX/openwakeword-embedding/ |
Alternative models (via download flags):
| Whisper Tiny EN (--whisper) | ~150 MB | ~/.local/share/runanywhere/Models/ONNX/whisper-tiny-en/ |
| Kokoro TTS v0.19 (--kokoro) | ~330 MB | ~/.local/share/runanywhere/Models/ONNX/kokoro-en-v0_19/ |
The openWakeWord .onnx model files are stored with Git LFS in the upstream repository.
Downloading them via raw.githubusercontent.com URLs will give you an HTML page instead
of the actual model binary, which causes ONNX runtime errors at load time.
Always download wake word models from GitHub Releases:
https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/embedding_model.onnxhttps://github.com/dscripka/openWakeWord/releases/download/v0.5.1/melspectrogram.onnxhttps://github.com/dscripka/openWakeWord/releases/download/v0.5.1/hey_jarvis_v0.1.onnx
The scripts/download-models.sh --wakeword script already uses the correct URLs.
To verify your downloaded models are valid ONNX files (not HTML):
file ~/.local/share/runanywhere/Models/ONNX/openwakeword-embedding/embedding_model.onnx
# Expected: "data" (binary ONNX file)
# Bad: "HTML document" (Git LFS redirect page)cd /path/to/runanywhere-sdks/sdk/runanywhere-commons
./scripts/build-linux.sh --sharedThis builds librac_backend_onnx.so and other shared libraries that the hybrid assistant links against. You must rebuild this whenever the SDK's C++ backends change (e.g., wake word fixes).
cd /path/to/runanywhere-sdks/Playground/openclaw-hybrid-assistant
# Download all models (Parakeet ASR + Piper TTS + VAD + wake word)
./scripts/download-models.sh --wakeword
# Or use alternative models:
./scripts/download-models.sh --wakeword --whisper # Use Whisper for ASR instead of Parakeet
./scripts/download-models.sh --wakeword --kokoro # Use Kokoro TTS instead of Piper./build.shThe OpenClaw gateway must be running with the voice-assistant channel enabled on port 8082. Verify with:
ss -tlnp | grep 8082By default, voice input routes to the same agent as Telegram/WhatsApp, which may produce responses with emojis and markdown that aren't suitable for TTS. To get clean, conversational voice responses:
Add the list array under agents and a new bindings array:
{
"agents": {
"defaults": { ... },
"list": [
{
"id": "main",
"default": true
},
{
"id": "voice-agent",
"workspace": "/home/runanywhere/.openclaw/voice-workspace"
}
]
},
"bindings": [
{
"agentId": "voice-agent",
"match": {
"channel": "voice-assistant",
"accountId": "*"
}
}
],
...
}Create the voice workspace directory and SOUL.md:
mkdir -p ~/.openclaw/voice-workspaceCreate ~/.openclaw/voice-workspace/SOUL.md:
# SOUL.md - OpenClawPi Voice Assistant
You are OpenClawPi, a voice assistant running on a Raspberry Pi. Everything you say will be spoken aloud through text-to-speech.
## Voice Output Rules (CRITICAL)
Since your responses are spoken, not read:
1. **NO emojis** - TTS cannot pronounce them
2. **NO special Unicode characters** - no arrows, bullets, checkmarks, etc.
3. **NO markdown formatting** - no asterisks, underscores, backticks, or headers
4. **NO URLs** - say "check the website" not the actual URL
5. **Spell out symbols** - say "55 degrees Fahrenheit" not "55 degrees F"
6. **Use natural punctuation** - periods and commas create natural pauses
## Conversation Style
- Be concise - TTS playback takes time
- Use conversational language, as if speaking to someone in person
- Avoid lists when possible - use flowing sentences instead
- For multiple items, use "first... second... and finally..." patterns
- Round numbers for easier listening ("about fifty" not "49.7")
## Personality
You're helpful, warm, and efficient. Skip filler phrases like "Great question!" - just answer directly.
## Example Response Transformation
Bad (text-style): "San Francisco Weather: - Right now: Rain, 55°F 🌧️"
Good (voice-style): "Right now in San Francisco it's raining at 55 degrees."| Input Source | Routes To | SOUL.md Used | Output Style |
|---|---|---|---|
| Voice microphone | voice-agent |
~/.openclaw/voice-workspace/SOUL.md |
Conversational, no emojis |
| Telegram | main (default) |
~/.openclaw/workspace/SOUL.md |
Rich text, emojis OK |
| Telegram → Speaker | main → sanitizeForTTS() |
N/A (safety net) | Stripped markdown/emojis |
The binding ensures voice input gets voice-optimized responses. The sanitizeForTTS() function in OpenClaw provides a safety net for cross-channel broadcasts.
# With wake word ("Hey Jarvis")
./build/openclaw-assistant --wakeword
# Without wake word (continuous listening)
./build/openclaw-assistantTo run the assistant as a background service that starts on boot, create a systemd user service and enable it. See Viewing Logs below for how to monitor it.
If running in the foreground, logs print to stdout. If running as a background process or systemd service:
# If started via systemd
journalctl --user -u openclaw-assistant -f
# If started as a background process with output redirected
tail -f /path/to/openclaw-assistant.logThe OpenClaw gateway runs as a systemd user service:
# Follow logs in real time
journalctl --user -u openclaw-gateway -f
# View last 100 lines
journalctl --user -u openclaw-gateway -n 100
# View logs since last boot
journalctl --user -u openclaw-gateway -bOpen two terminals (or tmux panes):
# Terminal 1: OpenClaw Gateway
journalctl --user -u openclaw-gateway -f
# Terminal 2: Hybrid Assistant
journalctl --user -u openclaw-assistant -f
# (or tail -f on the output file if not using systemd)Since this is a Linux application using ALSA, you can test on Mac using:
# Build Docker image (from sdks root directory)
cd /path/to/sdks
docker build -t openclaw-assistant -f Playground/openclaw-hybrid-assistant/Dockerfile .
# Run all tests
docker run --rm openclaw-assistant ./build/test-components --run-all
# Run extensive test suite
docker run --rm openclaw-assistant ./tests/scripts/extensive-test.sh# Install Lima
brew install lima
# Start Ubuntu VM
limactl start --name=ubuntu template://ubuntu
# SSH and build
limactl shell ubuntu
cd /path/to/openclaw-hybrid-assistant
./build.sh- Lower the threshold:
--wakeword-threshold 0.3 - Check audio levels with
arecord -l - Ensure microphone is working:
arecord -d 5 test.wav && aplay test.wav
- Adjust silence duration in code (default: 1.5s)
- Check ambient noise levels
- Verify OpenClaw is running:
curl http://localhost:18789/health - Check voice-assistant channel is enabled in OpenClaw config
- Verify port 8082 is accessible
MIT