Documentation Index
Fetch the complete documentation index at: https://docs.openpawz.ai/llms.txt
Use this file to discover all available pages before exploring further.
Model Pricing & Budget
Pawz tracks token usage and costs across all AI providers, enforces daily budgets, and can automatically route tasks to cheaper models when appropriate.
Per-model pricing
Pawz uses built-in pricing data to estimate costs in real time. Prices are per 1 million tokens.
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Provider |
|---|
| Claude 3 Haiku | $0.25 | $1.25 | Anthropic |
| Claude Haiku 4 | $1.00 | $5.00 | Anthropic |
| Claude Sonnet 4.x | $3.00 | $15.00 | Anthropic |
| Claude Opus 4.x | $15.00 | $75.00 | Anthropic |
| Gemini Flash | $0.15 | $0.60 | Google |
| Gemini Pro | $1.25 | $10.00 | Google |
| GPT-4.1 | $2.50 | $10.00 | OpenAI |
| GPT-4.1-mini / nano | $0.15 | $0.60 | OpenAI |
| GPT-4o | $2.50 | $10.00 | OpenAI |
| GPT-4o-mini | $0.15 | $0.60 | OpenAI |
| o3 / o1 | $10.00 | $40.00 | OpenAI |
| o3-mini / o4-mini | $1.10 | $4.40 | OpenAI |
| DeepSeek Chat | $0.27 | $1.10 | DeepSeek |
| DeepSeek Reasoner | $0.55 | $2.19 | DeepSeek |
:::info
Prices are based on each provider’s published rates. Local models (Ollama) have zero cost. For providers not listed (custom, OpenRouter, etc.), costs are estimated based on the closest matching model name.
:::
Cost comparison
To put these numbers in perspective, here’s the approximate cost for a 10,000-token conversation (5K input + 5K output):
| Model | Estimated cost |
|---|
| Gemini Flash | $0.004 |
| GPT-4o-mini | $0.004 |
| GPT-4.1-mini | $0.004 |
| DeepSeek Chat | $0.007 |
| Claude 3 Haiku | $0.008 |
| o3-mini / o4-mini | $0.028 |
| Claude Haiku 4 | $0.030 |
| Gemini Pro | $0.056 |
| GPT-4o | $0.063 |
| GPT-4.1 | $0.063 |
| Claude Sonnet 4.x | $0.090 |
| o3 / o1 | $0.250 |
| Claude Opus 4.x | $0.450 |
Cache token accounting
When providers support prompt caching (e.g., Anthropic), Pawz applies reduced rates for cached tokens:
| Token type | Cost multiplier |
|---|
| Normal tokens | 100% (full price) |
| Cache reads | 10% of normal cost |
| Cache creation | 25% of normal cost |
:::tip
Anthropic’s prompt caching can dramatically reduce costs for repetitive tasks. Long system prompts and skill instructions benefit the most from caching since they remain constant across turns.
:::
Daily budget
Pawz can enforce a daily spending limit to prevent runaway costs.
Configuration
| Setting | Default | Description |
|---|
daily_budget_usd | $10.00 | Maximum daily spend across all models |
Set to 0 to disable budget enforcement entirely.
Configure the budget in Settings → Agent Defaults or directly in the engine config:
{
"daily_budget_usd": 10.0
}
Budget enforcement
The budget is checked before each API call. Pawz provides progressive warnings as spending increases:
| Threshold | Action |
|---|
| 50% of budget | Warning notification |
| 75% of budget | Elevated warning |
| 90% of budget | Urgent warning |
| 100% of budget | Hard block — API calls are rejected |
:::warning
When the daily budget is reached, all AI API calls are blocked until the next day (midnight UTC). Ongoing conversations will stop receiving responses. Set a budget that accommodates your expected daily usage.
:::
DailyTokenTracker
Pawz maintains per-model cost tracking through the DailyTokenTracker:
- Tracks input and output tokens separately per model
- Calculates costs using the pricing table above
- Applies cache token discounts automatically
- Resets daily at midnight
The tracker enables the budget enforcement system and powers the cost display in the chat interface (shown in complete events with usage stats).
Auto-tier model selection
The auto_tier feature automatically routes tasks to cheaper or more expensive models based on complexity.
Task complexity classification
Pawz analyzes each message to determine if it’s simple or complex:
| Classification | Routing | Example tasks |
|---|
| Simple | Routes to cheap_model | ”What time is it?”, “Convert 5kg to lbs”, simple Q&A |
| Complex | Routes to default_model | Multi-step reasoning, code generation, research |
The classification uses keyword signals and heuristics to decide. When auto_tier is enabled in your model routing config, simple messages skip the expensive default model entirely.
Model routing configuration
{
"model_routing": {
"boss_model": "claude-opus-4-6",
"worker_model": "claude-sonnet-4-6",
"cheap_model": "claude-3-haiku",
"auto_tier": true,
"specialty_models": {
"coder": "gemini-2.5-pro"
},
"agent_models": {
"agent-uuid-here": "gpt-4o"
}
}
}
| Field | Description |
|---|
boss_model | Powerful model for the orchestrator/boss agent |
worker_model | Default model for sub-agents (cheaper/faster) |
cheap_model | Budget model for simple tasks when auto_tier is on |
auto_tier | Enable automatic model selection by task complexity |
specialty_models | Per-specialty overrides (e.g., coder, researcher) |
agent_models | Per-agent overrides (highest priority) |
Resolution order (highest priority first):
- Per-agent override (
agent_models)
- Per-specialty override (
specialty_models)
cheap_model (if auto_tier enabled and task is simple)
default_model
Tips for cost optimization
:::tip Cost optimization strategies
-
Use
auto_tier: Enable automatic model selection so simple queries use cheap models. This alone can cut costs 50%+.
-
Set a daily budget: Even a generous budget prevents accidental runaway costs from automated tasks or cron jobs.
-
Use Ollama for development: Local models are free. Use them for testing agent configurations before switching to paid providers.
-
Match model to task: Don’t use Claude Opus for simple questions. Create chat modes (see Foundry) for different tiers.
-
Enable session compaction: Long sessions consume more tokens per turn. Use
/compact or let auto-compaction manage context size.
-
Leverage prompt caching: Anthropic’s cache reads cost only 10% of normal input tokens. Consistent system prompts and skill instructions benefit most.
-
Use
worker_model for orchestration: In multi-agent projects, the boss agent should use a capable model, but workers can use cheaper alternatives.
-
Monitor the
complete events: Each response includes token usage stats (input/output/total tokens and model name) so you can track spending in real time.
:::