Voice & TTS

Pawz supports text-to-speech so agents can speak their responses aloud.

Setup

Go to Settings → Voice to configure TTS.

Providers

Google Cloud TTS

No API key needed — uses the free web endpoint. Chirp 3 HD voices: Puck, Charon, Kore, Fenrir, Leda, Orus, Zephyr, Aoede, Callirhoe, Autonoe Neural2 voices: en-US-Neural2-A through F Journey voices: en-US-Journey-D, en-US-Journey-F, en-US-Journey-O

OpenAI TTS

Requires an OpenAI API key. Voices: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer

ElevenLabs

Requires an ELEVENLABS_API_KEY. Voices: Sarah, Charlie, George, Callum, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill Models:

Model	Best for
`eleven_multilingual_v2`	Multi-language, highest quality
`eleven_turbo_v2_5`	Low latency, English-focused
`eleven_monolingual_v1`	English only, legacy

Extra settings:

Stability (0–1, default 0.5) — higher = more consistent
Similarity boost (0–1, default 0.75) — higher = closer to reference voice

Settings

Setting	Default	Description
Provider	—	Google / OpenAI / ElevenLabs
Voice	—	Voice name from the selected provider
Speed	1.0	Playback speed multiplier
Language	—	Language code (13 supported)
Auto-speak	Off	Automatically speak every response

Speech-to-text (STT)

Pawz uses OpenAI Whisper for speech-to-text transcription:

Backend	Setup	Latency	Cost
Whisper API	OpenAI API key (from Models settings)	~1–2s	$0.006/min
Whisper Local	Install `whisper` binary	~3–5s	Free

STT is used in Talk Mode and any voice input feature. Audio is captured as WebM/Opus (or OGG fallback), base64-encoded, and sent to the Whisper endpoint for transcription.

Audio capture settings

The microphone input uses these Web Audio constraints:

Setting	Value
Echo cancellation	Enabled
Noise suppression	Enabled
Sample rate	16 kHz
Format	`audio/webm;codecs=opus` (preferred)

Voice activity detection (VAD)

Talk Mode includes built-in voice activity detection to avoid sending silence to the transcription API:

Parameter	Value	Description
Recording window	8 seconds	Records in 8-second chunks
Minimum audio size	8 KB	Chunks under 8 KB are treated as silence and skipped
Inter-cycle delay	500 ms	Brief pause between recording cycles after errors
Empty transcript	Skipped	If Whisper returns blank text, the cycle restarts

:::info VAD works by checking the size of each recording chunk. Very short or silent recordings produce small files that are automatically discarded before being sent to Whisper. :::

Talk mode

Click the microphone icon in the chat header to enter talk mode. Your speech is transcribed and sent to the agent, and the response is spoken back. Requires either:

Whisper API skill (OpenAI API key)
Whisper Local skill (install whisper binary)

How talk mode works

Listen — microphone captures audio in 8-second windows
Transcribe — audio is sent to Whisper STT → text
Process — transcribed text is sent to your agent as a chat message
Speak — agent’s response is synthesized via your configured TTS provider
Repeat — next recording cycle starts after playback finishes

Voice command mode vs dictation mode

Mode	Behavior	Use case
Voice command (default)	Each utterance is sent as a standalone chat message	Giving instructions, asking questions
Dictation	Utterances are accumulated into a text buffer	Composing long-form content, emails

In voice command mode, every 8-second recording window is independently transcribed and sent to the agent. The agent responds and the reply is spoken aloud before the next cycle begins. To use dictation mode, start your utterance with “dictate” or “type” — the agent will accumulate your speech into a document rather than responding conversationally.

Core

Integrations

Workflows

Tools

Platform

Voice & TTS

Voice & TTS

Setup

Providers

Google Cloud TTS

OpenAI TTS

ElevenLabs

Settings

Speech-to-text (STT)

Audio capture settings

Voice activity detection (VAD)

Talk mode

How talk mode works

Voice command mode vs dictation mode

​Voice & TTS

​Setup

​Providers

​Google Cloud TTS

​OpenAI TTS

​ElevenLabs

​Settings

​Speech-to-text (STT)

​Audio capture settings

​Voice activity detection (VAD)

​Talk mode

​How talk mode works

​Voice command mode vs dictation mode

Voice & TTS

Setup

Providers

Google Cloud TTS

OpenAI TTS

ElevenLabs

Settings

Speech-to-text (STT)

Audio capture settings

Voice activity detection (VAD)

Talk mode

How talk mode works

Voice command mode vs dictation mode