How to Add Voice to Your OpenClaw Agent: TTS, STT, and Talk Mode
Turn your OpenClaw agent into a hands-free voice assistant with ElevenLabs, Deepgram, and Whisper. Complete setup guide for TTS, STT, and phone integration.
Why Would an Executive Want to Talk to Their AI Agent?
Because typing is slow and your hands are usually busy. Executives spend an average of 4.3 hours per day in meetings according to Microsoft’s 2025 Work Trend Index — that’s time where a keyboard isn’t an option but a quick voice command is. Voice turns your OpenClaw agent from something you sit down to use into something that follows you through your day.

Think about the moments that matter. You’re driving to the office and want a briefing on today’s board prep materials. You’re between back-to-back calls and need a quick cash flow summary. You’re on a plane and want to dictate a response to an investor update.
These aren’t hypotheticals. They’re the exact use cases our clients describe within the first week of deployment. Voice input and output make your AI agent feel less like a tool and more like a chief of staff — always listening, always ready.
How Does Voice Work in OpenClaw’s Architecture?
Voice in OpenClaw flows through three distinct layers: speech-to-text (STT) converts your spoken words into text input, the LLM processes the request and generates a response, and text-to-speech (TTS) converts that response back into audio. Each layer is independently configurable with your choice of provider.
This modular design is what sets OpenClaw apart from closed voice assistants like Alexa or Siri. You pick which STT engine handles your voice data. You pick which TTS voice represents your agent. And critically, you decide whether any of that data touches the cloud at all.
According to the OpenClaw TTS documentation, the platform supports four TTS providers and three STT providers out of the box, with more being added through community contributions.
What Are the Best STT Options for Executive Use?
OpenAI Whisper running locally is the best choice for privacy-conscious executives. It transcribes speech on your hardware — your Mac Mini or MacBook Air — with zero network calls. Audio data never leaves your machine, which means it’s fully compliant with even the strictest data governance policies.
Here’s how each STT provider compares:
| Provider | Accuracy | Latency | Privacy | Cost |
|---|---|---|---|---|
| Whisper (local) | ~95% | 1-3s | Full — on-device | Free |
| Deepgram Nova-2 | ~97% | <300ms | Cloud | ~$0.0043/min |
| OpenAI STT API | ~96% | 500ms-1s | Cloud | ~$0.006/min |
Deepgram’s Nova-2 model achieves a 8.4% word error rate on business English, the lowest in the industry according to Deepgram’s accuracy benchmarks. That accuracy matters when you’re dictating financial figures or names during a portfolio review.
For most of our deployments, we configure Whisper locally as the default and offer Deepgram Nova-2 as a toggle when clients need faster turnaround and are comfortable with cloud STT. The choice depends entirely on your security requirements.
How Do I Configure STT in OpenClaw?
You configure speech-to-text by adding an STT provider block to your openclaw.yaml configuration file. Here’s the Whisper local configuration:
# openclaw.yaml — Local Whisper STT (fully private)
stt:
provider: whisper
model: base # Options: tiny, base, small, medium, large-v3
language: en
device: auto # Uses GPU if available, falls back to CPU
For Deepgram Nova-2, the configuration points to their cloud API:
# openclaw.yaml — Deepgram Nova-2 STT (cloud, low latency)
stt:
provider: deepgram
api_key: "${DEEPGRAM_API_KEY}"
model: nova-2
language: en
smart_format: true # Adds punctuation, numerals, formatting
diarize: false
The model parameter for Whisper directly trades accuracy against speed. The tiny model transcribes in under a second but drops accuracy to ~88%. The large-v3 model hits ~97% accuracy but takes 3-5 seconds on a Mac Mini’s M-series chip. We’ve found base or small hits the sweet spot for conversational dictation — fast enough to feel natural, accurate enough to avoid frustrating misinterpretations.
After configuring, you can test STT immediately with Telegram voice notes or OpenClaw’s built-in Talk Mode on desktop. Send a voice note through Telegram or WhatsApp and confirm the transcription matches your speech.
What Are the Best TTS Options for Agent Output?
ElevenLabs produces the most natural-sounding voice output available today. Their Turbo v2.5 model generates speech that’s nearly indistinguishable from human conversation — a Stanford study from late 2025 found listeners correctly identified ElevenLabs audio as synthetic only 52% of the time, barely above chance.
Here’s the TTS provider comparison:
| Provider | Quality | Latency | Cost | Notable Feature |
|---|---|---|---|---|
| ElevenLabs | Best | ~500ms | ~$0.18/1K chars | Voice cloning, 30+ languages |
| Deepgram Aura-2 | Good | <250ms | ~$0.015/1K chars | Lowest latency |
| OpenAI TTS | Good | ~400ms | ~$0.015/1K chars | Balanced option |
| Edge TTS | Decent | ~300ms | Free | Microsoft’s free tier |
If your agent is reading back a morning briefing — market moves, calendar summary, overnight emails — quality matters more than latency. ElevenLabs is the right call. If you’re in a fast back-and-forth conversation where delays break the flow, Deepgram Aura-2’s sub-250ms response time keeps things feeling real-time.
Edge TTS is free and worth considering if you’re running a hosted deployment and want voice without the per-character cost. The quality is noticeably more robotic, but it’s perfectly usable for straightforward read-backs.
How Do I Configure TTS in OpenClaw?
TTS configuration follows the same pattern as STT — a provider block in openclaw.yaml with API credentials and your preferred voice. Here’s ElevenLabs:
# openclaw.yaml — ElevenLabs TTS
tts:
provider: elevenlabs
api_key: "${ELEVENLABS_API_KEY}"
voice_id: "pNInz6obpgDQGcFmaJgB" # "Adam" — deep, professional male
model: eleven_turbo_v2_5
stability: 0.5
similarity_boost: 0.75
output_format: mp3_44100_128
And Deepgram Aura-2:
# openclaw.yaml — Deepgram Aura-2 TTS
tts:
provider: deepgram
api_key: "${DEEPGRAM_API_KEY}"
model: aura-2
voice: asteria-en # Professional female voice
encoding: mp3
sample_rate: 24000
The voice_id for ElevenLabs determines your agent’s personality. ElevenLabs offers a library of pre-built voices, but you can also clone a custom voice from a 30-second sample. Some of our clients use a neutral professional voice; others prefer something warmer. It’s personal preference — the key is testing with real responses your agent would actually deliver.
The stability parameter controls how consistent the voice sounds across responses. Higher values (0.7+) give predictable, steady output. Lower values (0.3-0.5) add variation that sounds more human but can occasionally produce odd inflections. We default to 0.5 for a good balance.
Can I Actually Call My OpenClaw Agent on the Phone?
Yes. DeepClaw, an open-source project by Deepgram, bridges your OpenClaw agent to a real phone number through the Deepgram Voice Agent API. You call the number, speak your request, and hear your agent respond — just like calling an assistant.
DeepClaw handles the hard parts of phone integration: WebSocket communication between the phone network and OpenClaw, turn-taking so the agent knows when you’ve finished speaking, and barge-in support so you can interrupt a long response without waiting.
According to Deepgram’s engineering blog, voice is now a first-class citizen in OpenClaw. DeepClaw was purpose-built for OpenClaw’s agent protocol and supports the full tool-use pipeline — meaning your agent can still pull calendar data, check email, or run financial queries mid-call.
Here’s the basic setup:
# Clone DeepClaw
git clone https://github.com/deepgram/deepclaw.git
cd deepclaw
# Configure environment
cp .env.example .env
# Add your DEEPGRAM_API_KEY and OPENCLAW_URL to .env
# Start the bridge
docker compose up -d
Once running, DeepClaw provisions a phone number through Deepgram’s Voice Agent API. Call that number, and you’re talking to your OpenClaw agent. The round-trip latency — your voice to the agent’s spoken response — typically runs 1.5-3 seconds depending on the complexity of the request and your TTS provider.
This is particularly powerful for executives who travel frequently. An OpenClaw agent running on a MacBook Air with DeepClaw configured gives you a private AI assistant reachable by phone from anywhere in the world.
What Does a Voice Conversation Actually Sound Like?
Here’s a real interaction pattern from a client using Talk Mode on desktop:
Executive (voice): "What's our burn rate looking like this month
compared to last quarter's average?"
Agent (voice): "Your March burn rate is $387,000, which is 6% below
Q4's average of $412,000. The drop is mainly from the engineering
headcount freeze — you saved $31,000 in contractor costs. Want me
to pull the full variance breakdown?"
Executive (voice): "Yeah, and flag anything over 10% deviation."
Agent (voice): "Done. Three line items flagged: cloud infrastructure
is up 14% from the AWS migration, legal is up 22% from the Series B
docs, and travel is down 31% from the conference policy change.
I've sent the full report to your email."
That entire exchange takes under 20 seconds. The same workflow through a traditional BI dashboard — logging in, navigating to the right report, filtering by date range, mentally comparing numbers — takes 5-10 minutes minimum.
How Private Is Voice Interaction with OpenClaw?
Privacy depends entirely on your configuration choices. Running Whisper locally for STT means your spoken words never leave your machine. The audio file is transcribed on-device and immediately discarded. No cloud provider ever hears your voice.
TTS is trickier. Generating speech requires sending your agent’s text response to ElevenLabs, Deepgram, or OpenAI’s servers. The content of the response — which might include financial data, deal terms, or personnel information — travels over an encrypted connection but does exist momentarily on the provider’s infrastructure.
For executives handling highly sensitive information, we recommend a hybrid approach:
- STT: Whisper local (voice input stays on-device)
- TTS: Edge TTS for routine responses (free, Microsoft infrastructure)
- TTS upgrade: ElevenLabs for high-quality read-backs when content isn’t sensitive
If you’ve added the private on-device LLM option to your deployment, you can achieve a fully air-gapped voice pipeline: Whisper for STT, a local model like Nemotron for reasoning, and Edge TTS for output. Nothing leaves your hardware. That level of sovereignty is rare in AI today — a point Jensen Huang made when he compared OpenClaw to foundational infrastructure like Linux and Kubernetes.
For a complete picture of how we handle data isolation, see our security hardening checklist.
What Hardware Do I Need for Voice?
Any beeeowl hardware deployment handles voice without issue. The Mac Mini’s M-series chip runs Whisper’s base model in real time with CPU to spare. The MacBook Air handles it identically — Apple Silicon is exceptionally efficient for on-device inference.
For hosted deployments, STT and TTS processing happens through cloud APIs, so server specs don’t matter for voice specifically. The LLM reasoning layer is the bottleneck, not the audio processing.
Here’s what we’ve benchmarked on actual client hardware:
| Hardware | Whisper Model | Transcription Speed | CPU Usage |
|---|---|---|---|
| Mac Mini M4 | base | 0.8s per 10s audio | ~15% |
| Mac Mini M4 | small | 1.4s per 10s audio | ~25% |
| MacBook Air M3 | base | 0.9s per 10s audio | ~18% |
| MacBook Air M3 | large-v3 | 4.2s per 10s audio | ~60% |
The large-v3 model on a MacBook Air is usable but pushes the fan. For portable use — airports, hotel rooms, car — we stick with base or small to keep the machine quiet and cool.
How Do I Test Voice After Setup?
Start with Telegram voice notes. Send a 5-10 second voice message to your OpenClaw agent’s Telegram bot and verify three things: the transcription is accurate (check the agent’s log), the response is correct, and the TTS audio sounds natural when played back.
Next, try Talk Mode if you’re using OpenClaw’s web interface. Click the microphone icon, speak your request, and listen to the response. This tests the full round-trip without any messaging platform in the middle.
Finally, if you’ve configured DeepClaw, call the provisioned phone number. Have a real conversation — ask for your calendar, request a financial summary, dictate an email. Phone adds network latency that doesn’t exist in Talk Mode, so this is the true stress test.
Common issues we see during setup:
- Whisper transcription errors: Usually means the model is too small. Move from
tinytobaseorsmall. - TTS audio cuts off: Output format mismatch. Make sure your
output_formatmatches what your playback channel expects. - DeepClaw connection drops: Check that your OpenClaw instance is accessible from DeepClaw’s Docker container. Firewall rules and Docker sandboxing configurations are the usual culprits.
What’s Next for Voice in OpenClaw?
The voice ecosystem around OpenClaw is moving fast. Deepgram shipped Aura-2 in early 2026 with 40% lower latency than the original. ElevenLabs launched real-time conversational voice with under 300ms end-to-end. OpenAI’s Advanced Voice mode demonstrated that multi-modal agents — seeing, hearing, and speaking simultaneously — are technically feasible today.
We expect full duplex voice (both sides talking simultaneously, like a real phone call) to land in OpenClaw’s core within the next two quarters. DeepClaw already supports basic barge-in, but true duplex will make the experience feel indistinguishable from talking to a human assistant.
For now, the stack described in this guide — Whisper for private STT, ElevenLabs or Deepgram for TTS, DeepClaw for phone access — gives you a voice-enabled AI agent that actually works in the real world. Not a demo. Not a prototype. A production system you can call from your car at 7 AM and trust with real numbers.
Voice configuration is included in every beeeowl deployment package. We set up your preferred STT and TTS providers, configure voice IDs, optimize latency for your hardware, and integrate DeepClaw for phone access if needed. One-time investment starts at $2,000 for hosted or $5,000 with hardware included. Request Your Deployment and we’ll have your agent talking back within a week.


