OpenClaw Guides

How to Add Voice to Your OpenClaw Agent: TTS, STT, and Talk Mode

Turn your OpenClaw agent into a hands-free voice assistant with ElevenLabs, Deepgram, and Whisper. Complete setup guide for TTS, STT, and phone integration.

JS
Jashan Singh
Founder, beeeowl|April 5, 2026|10 min read
How to Add Voice to Your OpenClaw Agent: TTS, STT, and Talk Mode
TL;DR Executives don't want to type — they want to talk. OpenClaw supports voice interaction through three layers: STT (speech-to-text via Whisper, Deepgram, or OpenAI) for input, the LLM reasoning layer, and TTS (text-to-speech via ElevenLabs, Deepgram Aura-2, or OpenAI) for output. DeepClaw bridges OpenClaw to real phone calls. Running Whisper locally means voice data never leaves your machine.

Why Would an Executive Want to Talk to Their AI Agent?

Because typing is slow and your hands are usually busy. Executives spend an average of 4.3 hours per day in meetings according to Microsoft’s 2025 Work Trend Index — that’s time where a keyboard isn’t an option but a quick voice command is. Voice turns your OpenClaw agent from something you sit down to use into something that follows you through your day.

How to Add Voice to Your OpenClaw Agent: TTS, STT, and Talk Mode

Think about the moments that matter. You’re driving to the office and want a briefing on today’s board prep materials. You’re between back-to-back calls and need a quick cash flow summary. You’re on a plane and want to dictate a response to an investor update.

These aren’t hypotheticals. They’re the exact use cases our clients describe within the first week of deployment. Voice input and output make your AI agent feel less like a tool and more like a chief of staff — always listening, always ready.

How Does Voice Work in OpenClaw’s Architecture?

Voice in OpenClaw flows through three distinct layers: speech-to-text (STT) converts your spoken words into text input, the LLM processes the request and generates a response, and text-to-speech (TTS) converts that response back into audio. Each layer is independently configurable with your choice of provider.

Diagram showing the OpenClaw voice pipeline — audio input flows through STT engine (Whisper or Deepgram), to LLM reasoning, then TTS engine (ElevenLabs, Deepgram Aura, or Piper) for audio output, with provider options and latency breakdown
The complete voice pipeline: your voice enters as audio, gets transcribed, processed by the LLM, and spoken back — with on-device options at every stage for full privacy.

This modular design is what sets OpenClaw apart from closed voice assistants like Alexa or Siri. You pick which STT engine handles your voice data. You pick which TTS voice represents your agent. And critically, you decide whether any of that data touches the cloud at all.

According to the OpenClaw TTS documentation, the platform supports four TTS providers and three STT providers out of the box, with more being added through community contributions.

What Are the Best STT Options for Executive Use?

OpenAI Whisper running locally is the best choice for privacy-conscious executives. It transcribes speech on your hardware — your Mac Mini or MacBook Air — with zero network calls. Audio data never leaves your machine, which means it’s fully compliant with even the strictest data governance policies.

Here’s how each STT provider compares:

ProviderAccuracyLatencyPrivacyCost
Whisper (local)~95%1-3sFull — on-deviceFree
Deepgram Nova-2~97%<300msCloud~$0.0043/min
OpenAI STT API~96%500ms-1sCloud~$0.006/min

Deepgram’s Nova-2 model achieves a 8.4% word error rate on business English, the lowest in the industry according to Deepgram’s accuracy benchmarks. That accuracy matters when you’re dictating financial figures or names during a portfolio review.

For most of our deployments, we configure Whisper locally as the default and offer Deepgram Nova-2 as a toggle when clients need faster turnaround and are comfortable with cloud STT. The choice depends entirely on your security requirements.

How Do I Configure STT in OpenClaw?

You configure speech-to-text by adding an STT provider block to your openclaw.yaml configuration file. Here’s the Whisper local configuration:

# openclaw.yaml — Local Whisper STT (fully private)
stt:
  provider: whisper
  model: base  # Options: tiny, base, small, medium, large-v3
  language: en
  device: auto  # Uses GPU if available, falls back to CPU

For Deepgram Nova-2, the configuration points to their cloud API:

# openclaw.yaml — Deepgram Nova-2 STT (cloud, low latency)
stt:
  provider: deepgram
  api_key: "${DEEPGRAM_API_KEY}"
  model: nova-2
  language: en
  smart_format: true  # Adds punctuation, numerals, formatting
  diarize: false

The model parameter for Whisper directly trades accuracy against speed. The tiny model transcribes in under a second but drops accuracy to ~88%. The large-v3 model hits ~97% accuracy but takes 3-5 seconds on a Mac Mini’s M-series chip. We’ve found base or small hits the sweet spot for conversational dictation — fast enough to feel natural, accurate enough to avoid frustrating misinterpretations.

After configuring, you can test STT immediately with Telegram voice notes or OpenClaw’s built-in Talk Mode on desktop. Send a voice note through Telegram or WhatsApp and confirm the transcription matches your speech.

What Are the Best TTS Options for Agent Output?

ElevenLabs produces the most natural-sounding voice output available today. Their Turbo v2.5 model generates speech that’s nearly indistinguishable from human conversation — a Stanford study from late 2025 found listeners correctly identified ElevenLabs audio as synthetic only 52% of the time, barely above chance.

Here’s the TTS provider comparison:

ProviderQualityLatencyCostNotable Feature
ElevenLabsBest~500ms~$0.18/1K charsVoice cloning, 30+ languages
Deepgram Aura-2Good<250ms~$0.015/1K charsLowest latency
OpenAI TTSGood~400ms~$0.015/1K charsBalanced option
Edge TTSDecent~300msFreeMicrosoft’s free tier

If your agent is reading back a morning briefing — market moves, calendar summary, overnight emails — quality matters more than latency. ElevenLabs is the right call. If you’re in a fast back-and-forth conversation where delays break the flow, Deepgram Aura-2’s sub-250ms response time keeps things feeling real-time.

Edge TTS is free and worth considering if you’re running a hosted deployment and want voice without the per-character cost. The quality is noticeably more robotic, but it’s perfectly usable for straightforward read-backs.

How Do I Configure TTS in OpenClaw?

TTS configuration follows the same pattern as STT — a provider block in openclaw.yaml with API credentials and your preferred voice. Here’s ElevenLabs:

# openclaw.yaml — ElevenLabs TTS
tts:
  provider: elevenlabs
  api_key: "${ELEVENLABS_API_KEY}"
  voice_id: "pNInz6obpgDQGcFmaJgB"  # "Adam" — deep, professional male
  model: eleven_turbo_v2_5
  stability: 0.5
  similarity_boost: 0.75
  output_format: mp3_44100_128

And Deepgram Aura-2:

# openclaw.yaml — Deepgram Aura-2 TTS
tts:
  provider: deepgram
  api_key: "${DEEPGRAM_API_KEY}"
  model: aura-2
  voice: asteria-en  # Professional female voice
  encoding: mp3
  sample_rate: 24000

The voice_id for ElevenLabs determines your agent’s personality. ElevenLabs offers a library of pre-built voices, but you can also clone a custom voice from a 30-second sample. Some of our clients use a neutral professional voice; others prefer something warmer. It’s personal preference — the key is testing with real responses your agent would actually deliver.

The stability parameter controls how consistent the voice sounds across responses. Higher values (0.7+) give predictable, steady output. Lower values (0.3-0.5) add variation that sounds more human but can occasionally produce odd inflections. We default to 0.5 for a good balance.

Can I Actually Call My OpenClaw Agent on the Phone?

Yes. DeepClaw, an open-source project by Deepgram, bridges your OpenClaw agent to a real phone number through the Deepgram Voice Agent API. You call the number, speak your request, and hear your agent respond — just like calling an assistant.

DeepClaw handles the hard parts of phone integration: WebSocket communication between the phone network and OpenClaw, turn-taking so the agent knows when you’ve finished speaking, and barge-in support so you can interrupt a long response without waiting.

According to Deepgram’s engineering blog, voice is now a first-class citizen in OpenClaw. DeepClaw was purpose-built for OpenClaw’s agent protocol and supports the full tool-use pipeline — meaning your agent can still pull calendar data, check email, or run financial queries mid-call.

Here’s the basic setup:

# Clone DeepClaw
git clone https://github.com/deepgram/deepclaw.git
cd deepclaw

# Configure environment
cp .env.example .env
# Add your DEEPGRAM_API_KEY and OPENCLAW_URL to .env

# Start the bridge
docker compose up -d

Once running, DeepClaw provisions a phone number through Deepgram’s Voice Agent API. Call that number, and you’re talking to your OpenClaw agent. The round-trip latency — your voice to the agent’s spoken response — typically runs 1.5-3 seconds depending on the complexity of the request and your TTS provider.

This is particularly powerful for executives who travel frequently. An OpenClaw agent running on a MacBook Air with DeepClaw configured gives you a private AI assistant reachable by phone from anywhere in the world.

What Does a Voice Conversation Actually Sound Like?

Here’s a real interaction pattern from a client using Talk Mode on desktop:

Executive (voice): "What's our burn rate looking like this month
compared to last quarter's average?"

Agent (voice): "Your March burn rate is $387,000, which is 6% below
Q4's average of $412,000. The drop is mainly from the engineering
headcount freeze — you saved $31,000 in contractor costs. Want me
to pull the full variance breakdown?"

Executive (voice): "Yeah, and flag anything over 10% deviation."

Agent (voice): "Done. Three line items flagged: cloud infrastructure
is up 14% from the AWS migration, legal is up 22% from the Series B
docs, and travel is down 31% from the conference policy change.
I've sent the full report to your email."

That entire exchange takes under 20 seconds. The same workflow through a traditional BI dashboard — logging in, navigating to the right report, filtering by date range, mentally comparing numbers — takes 5-10 minutes minimum.

How Private Is Voice Interaction with OpenClaw?

Privacy depends entirely on your configuration choices. Running Whisper locally for STT means your spoken words never leave your machine. The audio file is transcribed on-device and immediately discarded. No cloud provider ever hears your voice.

TTS is trickier. Generating speech requires sending your agent’s text response to ElevenLabs, Deepgram, or OpenAI’s servers. The content of the response — which might include financial data, deal terms, or personnel information — travels over an encrypted connection but does exist momentarily on the provider’s infrastructure.

For executives handling highly sensitive information, we recommend a hybrid approach:

  • STT: Whisper local (voice input stays on-device)
  • TTS: Edge TTS for routine responses (free, Microsoft infrastructure)
  • TTS upgrade: ElevenLabs for high-quality read-backs when content isn’t sensitive

If you’ve added the private on-device LLM option to your deployment, you can achieve a fully air-gapped voice pipeline: Whisper for STT, a local model like Nemotron for reasoning, and Edge TTS for output. Nothing leaves your hardware. That level of sovereignty is rare in AI today — a point Jensen Huang made when he compared OpenClaw to foundational infrastructure like Linux and Kubernetes.

For a complete picture of how we handle data isolation, see our security hardening checklist.

What Hardware Do I Need for Voice?

Any beeeowl hardware deployment handles voice without issue. The Mac Mini’s M-series chip runs Whisper’s base model in real time with CPU to spare. The MacBook Air handles it identically — Apple Silicon is exceptionally efficient for on-device inference.

For hosted deployments, STT and TTS processing happens through cloud APIs, so server specs don’t matter for voice specifically. The LLM reasoning layer is the bottleneck, not the audio processing.

Here’s what we’ve benchmarked on actual client hardware:

HardwareWhisper ModelTranscription SpeedCPU Usage
Mac Mini M4base0.8s per 10s audio~15%
Mac Mini M4small1.4s per 10s audio~25%
MacBook Air M3base0.9s per 10s audio~18%
MacBook Air M3large-v34.2s per 10s audio~60%

The large-v3 model on a MacBook Air is usable but pushes the fan. For portable use — airports, hotel rooms, car — we stick with base or small to keep the machine quiet and cool.

How Do I Test Voice After Setup?

Start with Telegram voice notes. Send a 5-10 second voice message to your OpenClaw agent’s Telegram bot and verify three things: the transcription is accurate (check the agent’s log), the response is correct, and the TTS audio sounds natural when played back.

Next, try Talk Mode if you’re using OpenClaw’s web interface. Click the microphone icon, speak your request, and listen to the response. This tests the full round-trip without any messaging platform in the middle.

Finally, if you’ve configured DeepClaw, call the provisioned phone number. Have a real conversation — ask for your calendar, request a financial summary, dictate an email. Phone adds network latency that doesn’t exist in Talk Mode, so this is the true stress test.

Common issues we see during setup:

  • Whisper transcription errors: Usually means the model is too small. Move from tiny to base or small.
  • TTS audio cuts off: Output format mismatch. Make sure your output_format matches what your playback channel expects.
  • DeepClaw connection drops: Check that your OpenClaw instance is accessible from DeepClaw’s Docker container. Firewall rules and Docker sandboxing configurations are the usual culprits.

What’s Next for Voice in OpenClaw?

The voice ecosystem around OpenClaw is moving fast. Deepgram shipped Aura-2 in early 2026 with 40% lower latency than the original. ElevenLabs launched real-time conversational voice with under 300ms end-to-end. OpenAI’s Advanced Voice mode demonstrated that multi-modal agents — seeing, hearing, and speaking simultaneously — are technically feasible today.

We expect full duplex voice (both sides talking simultaneously, like a real phone call) to land in OpenClaw’s core within the next two quarters. DeepClaw already supports basic barge-in, but true duplex will make the experience feel indistinguishable from talking to a human assistant.

For now, the stack described in this guide — Whisper for private STT, ElevenLabs or Deepgram for TTS, DeepClaw for phone access — gives you a voice-enabled AI agent that actually works in the real world. Not a demo. Not a prototype. A production system you can call from your car at 7 AM and trust with real numbers.


Voice configuration is included in every beeeowl deployment package. We set up your preferred STT and TTS providers, configure voice IDs, optimize latency for your hardware, and integrate DeepClaw for phone access if needed. One-time investment starts at $2,000 for hosted or $5,000 with hardware included. Request Your Deployment and we’ll have your agent talking back within a week.

Ready to deploy private AI?

Get OpenClaw configured, hardened, and shipped to your door — operational in under a week.

Related Articles

Building a Custom MCP Server: Give Your OpenClaw Agent Access to Internal Tools
OpenClaw Guides

Building a Custom MCP Server: Give Your OpenClaw Agent Access to Internal Tools

MCP lets your OpenClaw agent access internal CRMs, ERPs, and databases without direct access. Learn how to build, secure, and deploy a custom MCP server.

JS
Jashan Singh
Apr 5, 202610 min read
OpenClaw Agent-to-Agent Communication: Setting Up A2A Protocol
OpenClaw Guides

OpenClaw Agent-to-Agent Communication: Setting Up A2A Protocol

Google's A2A protocol lets OpenClaw agents discover and delegate tasks to each other. Learn how to set up multi-agent communication with the A2A Gateway plugin.

JS
Jashan Singh
Apr 5, 20269 min read
How to Give Your OpenClaw Agent Long-Term Memory
OpenClaw Guides

How to Give Your OpenClaw Agent Long-Term Memory

RAG answers 'what does this document say?' but memory answers 'what does this user need?' Learn how to configure persistent memory with Mem0 and OpenClaw's built-in files.

JS
Jashan Singh
Apr 5, 202610 min read
beeeowl
Private AI infrastructure for executives.

© 2026 beeeowl. All rights reserved.

Made with ❤️ in Canada