Documentation Index
Fetch the complete documentation index at: https://docs.openpawz.ai/llms.txt
Use this file to discover all available pages before exploring further.
Voice & TTS
Pawz supports text-to-speech so agents can speak their responses aloud.
Setup
Go to Settings → Voice to configure TTS.
Providers
Google Cloud TTS
No API key needed — uses the free web endpoint.
Chirp 3 HD voices:
Puck, Charon, Kore, Fenrir, Leda, Orus, Zephyr, Aoede, Callirhoe, Autonoe
Neural2 voices:
en-US-Neural2-A through F
Journey voices:
en-US-Journey-D, en-US-Journey-F, en-US-Journey-O
OpenAI TTS
Requires an OpenAI API key.
Voices: alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer
ElevenLabs
Requires an ELEVENLABS_API_KEY.
Voices: Sarah, Charlie, George, Callum, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, Bill
Models:
| Model | Best for |
|---|
eleven_multilingual_v2 | Multi-language, highest quality |
eleven_turbo_v2_5 | Low latency, English-focused |
eleven_monolingual_v1 | English only, legacy |
Extra settings:
- Stability (0–1, default 0.5) — higher = more consistent
- Similarity boost (0–1, default 0.75) — higher = closer to reference voice
Settings
| Setting | Default | Description |
|---|
| Provider | — | Google / OpenAI / ElevenLabs |
| Voice | — | Voice name from the selected provider |
| Speed | 1.0 | Playback speed multiplier |
| Language | — | Language code (13 supported) |
| Auto-speak | Off | Automatically speak every response |
Speech-to-text (STT)
Pawz uses OpenAI Whisper for speech-to-text transcription:
| Backend | Setup | Latency | Cost |
|---|
| Whisper API | OpenAI API key (from Models settings) | ~1–2s | $0.006/min |
| Whisper Local | Install whisper binary | ~3–5s | Free |
STT is used in Talk Mode and any voice input feature. Audio is captured as WebM/Opus (or OGG fallback), base64-encoded, and sent to the Whisper endpoint for transcription.
Audio capture settings
The microphone input uses these Web Audio constraints:
| Setting | Value |
|---|
| Echo cancellation | Enabled |
| Noise suppression | Enabled |
| Sample rate | 16 kHz |
| Format | audio/webm;codecs=opus (preferred) |
Voice activity detection (VAD)
Talk Mode includes built-in voice activity detection to avoid sending silence to the transcription API:
| Parameter | Value | Description |
|---|
| Recording window | 8 seconds | Records in 8-second chunks |
| Minimum audio size | 8 KB | Chunks under 8 KB are treated as silence and skipped |
| Inter-cycle delay | 500 ms | Brief pause between recording cycles after errors |
| Empty transcript | Skipped | If Whisper returns blank text, the cycle restarts |
:::info
VAD works by checking the size of each recording chunk. Very short or silent recordings produce small files that are automatically discarded before being sent to Whisper.
:::
Talk mode
Click the microphone icon in the chat header to enter talk mode. Your speech is transcribed and sent to the agent, and the response is spoken back.
Requires either:
- Whisper API skill (OpenAI API key)
- Whisper Local skill (install
whisper binary)
How talk mode works
- Listen — microphone captures audio in 8-second windows
- Transcribe — audio is sent to Whisper STT → text
- Process — transcribed text is sent to your agent as a chat message
- Speak — agent’s response is synthesized via your configured TTS provider
- Repeat — next recording cycle starts after playback finishes
Voice command mode vs dictation mode
| Mode | Behavior | Use case |
|---|
| Voice command (default) | Each utterance is sent as a standalone chat message | Giving instructions, asking questions |
| Dictation | Utterances are accumulated into a text buffer | Composing long-form content, emails |
In voice command mode, every 8-second recording window is independently transcribed and sent to the agent. The agent responds and the reply is spoken aloud before the next cycle begins.
To use dictation mode, start your utterance with “dictate” or “type” — the agent will accumulate your speech into a document rather than responding conversationally.