Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

OpenClaw Hybrid Assistant

A lightweight voice assistant that acts as a channel for OpenClaw. No local LLM - just:

  • Wake WordVADASR → sends transcription to OpenClaw
  • TTS ← receives speech commands from OpenClaw (any channel)

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        OpenClaw Hybrid Assistant                             │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                         INPUT PIPELINE                                │   │
│  │                                                                       │   │
│  │   Microphone → Wake Word → VAD → ASR/STT  → WebSocket → OpenClaw    │   │
│  │    (ALSA)     (openWW)  (Silero) (Parakeet)            (Channel)    │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                        OUTPUT PIPELINE                                │   │
│  │                                                                       │   │
│  │   OpenClaw → WebSocket → TTS/Piper → Speaker                        │   │
│  │  (any channel)          (22050Hz)    (ALSA)                          │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Project Structure

openclaw-hybrid-assistant/
├── src/
│   ├── main.cpp                          # Entry point, CLI parsing, event loop
│   ├── audio/                            # Audio I/O (ALSA)
│   │   ├── audio_capture.h/cpp           #   Microphone input (16kHz, 16-bit PCM, mono)
│   │   ├── audio_playback.h/cpp          #   Speaker output (cancellable, multi-rate)
│   │   └── waiting_chime.h/cpp           #   Earcon feedback while waiting for response
│   ├── pipeline/                         # Voice processing chain
│   │   ├── voice_pipeline.h/cpp          #   Wake Word → VAD → STT → TTS orchestrator
│   │   └── tts_queue.h/cpp              #   Producer/consumer streaming TTS playback
│   ├── network/                          # Network communication
│   │   └── openclaw_client.h/cpp         #   Raw WebSocket client (RFC 6455)
│   └── config/                           # Configuration
│       └── model_config.h                #   Model paths, IDs, availability checks
├── tests/
│   ├── test_components.cpp               # Component tests (wake word, VAD, STT)
│   ├── test_integration.cpp              # E2E tests (fake WS server, sanitization, TTS)
│   ├── audio/                            # Generated test WAV files
│   └── scripts/                          # Test audio generation scripts
├── scripts/
│   ├── download-models.sh                # Model download (VAD, ASR, TTS, wake word)
│   ├── openclaw-voice.service            # systemd service unit
│   └── test-on-mac.sh                    # Mac testing via Docker/Lima
├── CMakeLists.txt                        # Build configuration (3 targets)
├── Dockerfile                            # Docker build + test environment
├── build.sh                              # End-to-end build script
└── README.md

Key Differences from linux-voice-assistant

Feature linux-voice-assistant openclaw-hybrid-assistant
Wake Word
VAD
ASR/STT ✅ Local Whisper ✅ Parakeet TDT-CTC 110M (NeMo CTC, int8)
LLM ✅ Local or Moltbot ❌ None - uses OpenClaw
TTS ✅ Local Piper (22kHz) ✅ Piper Lessac Medium (22050Hz)
Integration HTTP Voice Bridge WebSocket to OpenClaw

Components

1. Wake Word Detector

  • Model: openWakeWord "Hey Jarvis"
  • Threshold: 0.5 (configurable)
  • Frame size: 80ms (1280 samples at 16kHz)

2. Voice Activity Detection (VAD)

  • Model: Silero VAD (ONNX neural network, via sherpa-onnx)
  • Much more accurate than energy-based VAD at distinguishing speech from noise
  • Silence threshold: 1.5 seconds
  • Minimum speech: 0.5 seconds
  • Fallback: energy-based VAD if Silero model fails to load

3. Speech-to-Text (ASR)

  • Model: Parakeet TDT-CTC 110M EN (NeMo CTC, int8 quantized)
  • Architecture: FastConformer 110M params
  • Features: Automatic punctuation + capitalization
  • Sample rate: 16kHz mono
  • Size: ~126MB (int8 quantized)
  • Alternative: Whisper Tiny EN available with --whisper download flag

4. Text-to-Speech (TTS)

  • Model: Piper Lessac Medium (VITS)
  • Output rate: 22050 Hz
  • Voice: Natural American male (Lessac dataset)
  • Size: ~61MB (model + espeak-ng-data)
  • Alternative: Kokoro TTS v0.19 available with --kokoro download flag (11 speakers, 24kHz, ~330MB)
  • Text Sanitization: Automatically removes emojis, markdown, and special characters before synthesis

5. Audio Capture (src/audio/audio_capture)

  • ALSA-based microphone input
  • Format: 16kHz, 16-bit PCM, mono (optimal for STT)
  • Callback-driven: delivers audio chunks to the voice pipeline
  • Device listing and selection support

6. Audio Playback (src/audio/audio_playback)

  • ALSA-based speaker output
  • Cancellable playback: writes period-sized chunks (~46ms at 22kHz), checks cancel flag between each
  • Instant silence on cancel via snd_pcm_drop()
  • Dynamic sample rate reinitialize (supports 22050Hz Piper and 24kHz Kokoro)

7. TTS Queue (src/pipeline/tts_queue)

  • Producer/consumer pattern for gapless streaming TTS
  • Producer thread synthesizes sentences and pushes audio chunks
  • Consumer thread plays chunks via ALSA as they arrive
  • Sentence N+1 synthesizes while sentence N plays
  • Thread-safe cancellation support

8. OpenClaw Client (src/network/openclaw_client)

  • Raw WebSocket implementation (RFC 6455 compliant, no external WS library)
  • TCP connect, WebSocket upgrade handshake with random key
  • Masked frame sending and extended payload support
  • Ping/pong handling for connection keepalive
  • Background receive thread with thread-safe speak message queue
  • Auto-reconnect with configurable delay and max attempts

9. Waiting Feedback (src/audio/waiting_chime)

Plays a brief, pleasant earcon sound while waiting for OpenClaw to process the user's request:

  • Professional earcon: Generated via sox pluck synthesis (sounds like a real glockenspiel chime)
  • Immediate acknowledgment: Plays once right after the transcription is sent
  • Periodic reminder: Repeats every 5 seconds so the user knows the agent is still working
  • Instant stop: Earcon stops within ~50ms when the response arrives
  • Graceful fallback: If the earcon WAV is missing, waiting is silent (no crash)

Generated automatically by ./scripts/download-models.sh (requires sox).

10. Barge-in Support

  • Wake word detection continues during TTS playback
  • When detected: cancels current speech, clears pending responses, re-enters listening
  • Deferred mutex handling for ARM safety (avoids deadlock between audio and pipeline threads)
  • Text sanitization engine for TTS: removes emojis, markdown, HTML, special chars
  • Abbreviation-aware sentence splitter (handles "Mr.", "Dr.", "e.g.", "U.S.", etc.)

OpenClaw WebSocket Protocol

Connection

ws://openclaw-host:8082

Messages: Assistant → OpenClaw

Connect:

{
  "type": "connect",
  "deviceId": "pi-living-room",
  "accountId": "default",
  "capabilities": {
    "stt": true,
    "tts": true,
    "wakeWord": true
  }
}

Transcription (after ASR):

{
  "type": "transcription",
  "text": "What's the weather like?",
  "sessionId": "main",
  "isFinal": true
}

Messages: OpenClaw → Assistant

Speak (for TTS):

{
  "type": "speak",
  "text": "The weather is sunny.",
  "sourceChannel": "telegram",
  "priority": 1,
  "interrupt": false
}

Quick Start

Prerequisites

  • Raspberry Pi 5 (or Linux x86_64/ARM64)
  • ALSA development libraries
  • OpenClaw running with voice-assistant channel enabled

Build

./build.sh

Run

# Basic (connects to localhost:8082)
./build/openclaw-assistant

# With wake word enabled
./build/openclaw-assistant --wakeword

# Connect to remote OpenClaw
./build/openclaw-assistant --wakeword --openclaw-url ws://192.168.1.100:8082

Test Components

# Run all tests
./build/test-components --run-all

# Test wake word detection with audio file
./build/test-components --test-wakeword tests/audio/hey-jarvis.wav

# Test that audio does NOT trigger wake word
./build/test-components --test-no-wakeword tests/audio/noise.wav

# Test VAD and STT
./build/test-components --test-vad tests/audio/speech.wav
./build/test-components --test-stt tests/audio/speech.wav

# Test full pipeline
./build/test-components --test-pipeline tests/audio/wakeword-plus-speech.wav

Configuration

Command Line Options

Option Description Default
--wakeword Enable wake word detection Off
--wakeword-threshold Detection threshold (0.0-1.0) 0.5
--openclaw-url OpenClaw WebSocket URL ws://localhost:8082
--device-id Device identifier hostname
--input ALSA input device "default"
--output ALSA output device "default"
--list-devices List audio devices -
--help Show help -

Models Required

Model Size Location
Silero VAD ~2 MB ~/.local/share/runanywhere/Models/ONNX/silero-vad/
Parakeet TDT-CTC 110M EN (int8) ~126 MB ~/.local/share/runanywhere/Models/ONNX/parakeet-tdt-ctc-110m-en-int8/
Piper Lessac Medium TTS ~61 MB ~/.local/share/runanywhere/Models/ONNX/vits-piper-en_US-lessac-medium/
Hey Jarvis ~1.3 MB ~/.local/share/runanywhere/Models/ONNX/hey-jarvis/
openWakeWord Embedding ~1.3 MB ~/.local/share/runanywhere/Models/ONNX/openwakeword-embedding/
openWakeWord Melspectrogram ~1.1 MB ~/.local/share/runanywhere/Models/ONNX/openwakeword-embedding/

Alternative models (via download flags): | Whisper Tiny EN (--whisper) | ~150 MB | ~/.local/share/runanywhere/Models/ONNX/whisper-tiny-en/ | | Kokoro TTS v0.19 (--kokoro) | ~330 MB | ~/.local/share/runanywhere/Models/ONNX/kokoro-en-v0_19/ |

Wake Word Model Download Note

The openWakeWord .onnx model files are stored with Git LFS in the upstream repository. Downloading them via raw.githubusercontent.com URLs will give you an HTML page instead of the actual model binary, which causes ONNX runtime errors at load time.

Always download wake word models from GitHub Releases:

  • https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/embedding_model.onnx
  • https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/melspectrogram.onnx
  • https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/hey_jarvis_v0.1.onnx

The scripts/download-models.sh --wakeword script already uses the correct URLs.

To verify your downloaded models are valid ONNX files (not HTML):

file ~/.local/share/runanywhere/Models/ONNX/openwakeword-embedding/embedding_model.onnx
# Expected: "data" (binary ONNX file)
# Bad:      "HTML document" (Git LFS redirect page)

Raspberry Pi First-Time Setup

1. Build runanywhere-commons (shared libraries)

cd /path/to/runanywhere-sdks/sdk/runanywhere-commons
./scripts/build-linux.sh --shared

This builds librac_backend_onnx.so and other shared libraries that the hybrid assistant links against. You must rebuild this whenever the SDK's C++ backends change (e.g., wake word fixes).

2. Download models

cd /path/to/runanywhere-sdks/Playground/openclaw-hybrid-assistant

# Download all models (Parakeet ASR + Piper TTS + VAD + wake word)
./scripts/download-models.sh --wakeword

# Or use alternative models:
./scripts/download-models.sh --wakeword --whisper   # Use Whisper for ASR instead of Parakeet
./scripts/download-models.sh --wakeword --kokoro     # Use Kokoro TTS instead of Piper

3. Build the hybrid assistant

./build.sh

4. Ensure OpenClaw is running

The OpenClaw gateway must be running with the voice-assistant channel enabled on port 8082. Verify with:

ss -tlnp | grep 8082

5. Configure OpenClaw for Voice-Specific Behavior (Recommended)

By default, voice input routes to the same agent as Telegram/WhatsApp, which may produce responses with emojis and markdown that aren't suitable for TTS. To get clean, conversational voice responses:

5a. Add voice-agent binding to ~/.openclaw/openclaw.json

Add the list array under agents and a new bindings array:

{
  "agents": {
    "defaults": { ... },
    "list": [
      {
        "id": "main",
        "default": true
      },
      {
        "id": "voice-agent",
        "workspace": "/home/runanywhere/.openclaw/voice-workspace"
      }
    ]
  },
  "bindings": [
    {
      "agentId": "voice-agent",
      "match": {
        "channel": "voice-assistant",
        "accountId": "*"
      }
    }
  ],
  ...
}

5b. Create voice-specific SOUL.md

Create the voice workspace directory and SOUL.md:

mkdir -p ~/.openclaw/voice-workspace

Create ~/.openclaw/voice-workspace/SOUL.md:

# SOUL.md - OpenClawPi Voice Assistant

You are OpenClawPi, a voice assistant running on a Raspberry Pi. Everything you say will be spoken aloud through text-to-speech.

## Voice Output Rules (CRITICAL)

Since your responses are spoken, not read:

1. **NO emojis** - TTS cannot pronounce them
2. **NO special Unicode characters** - no arrows, bullets, checkmarks, etc.
3. **NO markdown formatting** - no asterisks, underscores, backticks, or headers
4. **NO URLs** - say "check the website" not the actual URL
5. **Spell out symbols** - say "55 degrees Fahrenheit" not "55 degrees F"
6. **Use natural punctuation** - periods and commas create natural pauses

## Conversation Style

- Be concise - TTS playback takes time
- Use conversational language, as if speaking to someone in person
- Avoid lists when possible - use flowing sentences instead
- For multiple items, use "first... second... and finally..." patterns
- Round numbers for easier listening ("about fifty" not "49.7")

## Personality

You're helpful, warm, and efficient. Skip filler phrases like "Great question!" - just answer directly.

## Example Response Transformation

Bad (text-style): "San Francisco Weather: - Right now: Rain, 55°F 🌧️"

Good (voice-style): "Right now in San Francisco it's raining at 55 degrees."

How It Works

Input Source Routes To SOUL.md Used Output Style
Voice microphone voice-agent ~/.openclaw/voice-workspace/SOUL.md Conversational, no emojis
Telegram main (default) ~/.openclaw/workspace/SOUL.md Rich text, emojis OK
Telegram → Speaker mainsanitizeForTTS() N/A (safety net) Stripped markdown/emojis

The binding ensures voice input gets voice-optimized responses. The sanitizeForTTS() function in OpenClaw provides a safety net for cross-channel broadcasts.

6. Run the assistant

# With wake word ("Hey Jarvis")
./build/openclaw-assistant --wakeword

# Without wake word (continuous listening)
./build/openclaw-assistant

7. Run as a systemd service (optional)

To run the assistant as a background service that starts on boot, create a systemd user service and enable it. See Viewing Logs below for how to monitor it.

Viewing Logs

Hybrid Assistant logs

If running in the foreground, logs print to stdout. If running as a background process or systemd service:

# If started via systemd
journalctl --user -u openclaw-assistant -f

# If started as a background process with output redirected
tail -f /path/to/openclaw-assistant.log

OpenClaw Gateway logs

The OpenClaw gateway runs as a systemd user service:

# Follow logs in real time
journalctl --user -u openclaw-gateway -f

# View last 100 lines
journalctl --user -u openclaw-gateway -n 100

# View logs since last boot
journalctl --user -u openclaw-gateway -b

Watching both side-by-side

Open two terminals (or tmux panes):

# Terminal 1: OpenClaw Gateway
journalctl --user -u openclaw-gateway -f

# Terminal 2: Hybrid Assistant
journalctl --user -u openclaw-assistant -f
# (or tail -f on the output file if not using systemd)

Testing on Mac

Since this is a Linux application using ALSA, you can test on Mac using:

Option 1: Docker with WAV Files (Recommended)

# Build Docker image (from sdks root directory)
cd /path/to/sdks
docker build -t openclaw-assistant -f Playground/openclaw-hybrid-assistant/Dockerfile .

# Run all tests
docker run --rm openclaw-assistant ./build/test-components --run-all

# Run extensive test suite
docker run --rm openclaw-assistant ./tests/scripts/extensive-test.sh

Option 2: Lima VM

# Install Lima
brew install lima

# Start Ubuntu VM
limactl start --name=ubuntu template://ubuntu

# SSH and build
limactl shell ubuntu
cd /path/to/openclaw-hybrid-assistant
./build.sh

Troubleshooting

Wake word not detecting

  • Lower the threshold: --wakeword-threshold 0.3
  • Check audio levels with arecord -l
  • Ensure microphone is working: arecord -d 5 test.wav && aplay test.wav

VAD too sensitive / not sensitive enough

  • Adjust silence duration in code (default: 1.5s)
  • Check ambient noise levels

WebSocket connection failing

  • Verify OpenClaw is running: curl http://localhost:18789/health
  • Check voice-assistant channel is enabled in OpenClaw config
  • Verify port 8082 is accessible

License

MIT