Phone calling agentHuman-in-the-loopAnti-prompt injection

CaulBot

Any phone call, delegated. Describe the task, approve the plan, let the AI handle it.

Research, then a plan

Two agents handle each call, and they never overlap. The Research Agent runs first. It parses your natural language request, searches for the right phone number if you didn't provide one, pulls relevant context from the web, and assembles a structured call plan: target, goal, scope, expected approach. You see all of this before anything is dialed.

Approval is enforced at the execution layer, not as a UI courtesy. Plans expire after 15 minutes. If you don't confirm within that window, the call is dropped. The agent only acts on current intent, never on a stale plan.

An AI that makes phone calls for you. It plans first, you approve, then it executes.

Designing for trust under delegation

Delegating a phone call to an AI is a higher-trust act than delegating an email or a calendar invite. The user has to believe the agent will represent them accurately, stop when the situation calls for it, and not place a call that wasn't actually authorized. The product surface is built around making each of those beliefs verifiable rather than asserted.

WhatsApp is the interface for a reason. A new app would have asked the user to learn a new trust surface for an act they're already nervous about. WhatsApp is one they already use to coordinate with humans, including humans they don't fully trust, so the mental model transfers. Plans arrive as messages they can read, edit, or reject. Outcomes come back as messages they can scroll back through later. There's no separate dashboard to check; the conversation history is the audit log.

The approval gate is the product. The 15-minute plan expiration exists because intent goes stale; a user who walked away from their phone for an hour didn't authorize a call now. The agent will not operate on a plan that has aged out, even if the user comes back and assumes it's still valid.

Listen-in mode is for verification without intrusion. The user can have Twilio call their own number while the agent runs the conversation, hearing both sides live without joining the call. They can verify the agent's representation of them in real time, without the callee knowing a third line is open. The control isn't 'monitor the agent' as a dashboard view. It's 'hear it for yourself' as a phone call.

Live call over WebSocket

Once approved, the Phone Agent takes over entirely. Twilio dials the number and opens a media stream over WebSocket; Deepgram transcribes the callee's speech in real time; the agent generates a response; Cartesia synthesizes it and sends it back through the stream. The full loop (speech in, decision, speech out) runs in under a second.

The agent handles IVR menus, hold music, and natural conversation turns without intervention. If you want to follow along, a listen-in mode has Twilio call your own number so you can hear both sides live. If you need to steer mid-call, send a WhatsApp message. It gets synthesized and injected into the active conversation. When the call ends, the outcome summary lands in WhatsApp: what was said, what was agreed, what the agent learned.

Compliance by default

AI phone calling has real legal exposure. CaulBot ships with a five-pillar compliance engine that runs before any call is placed and cannot be disabled.

Every call opens with a mandatory AI disclosure: the agent identifies itself as an AI calling on behalf of the user by name. Opt-out detection runs throughout the call. If the callee says anything resembling "don't call again" or "I'd prefer to speak to a human," the agent acknowledges it, ends the call, and adds the number to a permanent suppression list. Use-case restrictions block entire categories before dialing: marketing, sales, debt collection, political calls, and harassment are rejected at the request stage, not the call stage. Rate limits cap calls per number and per user to prevent abuse patterns.

The goal was to ship compliance before the first user, not retrofit it after. The five pillars (disclosure, opt-out, recording clarity, abuse prevention, and recipient control) are enforced at the execution layer, not surfaced as settings.

When the agent is wrong

The harder question for an AI agent isn't what it does when it succeeds. It's what it does when it doesn't, and how the user finds out. CaulBot's outcome summary surfaces what was said, what was agreed, and what the agent learned, in plain text in WhatsApp after every call. If the agent ended up agreeing to something the user didn't intend, it shows up there. If the call hit a dead end (wrong number, automated rejection, an opt-out request), it shows up there. The summary is not optimized for sounding successful; it's optimized for being read accurately.

Mid-call steering exists for the case where the user notices the call going somewhere wrong. A WhatsApp message during an active call gets synthesized and injected into the agent's context on the next turn. The user can clarify, redirect, or cancel without leaving the messaging surface they started in. The recovery path doesn't require switching apps or interfaces under stress.

For categories the agent shouldn't touch at all (sales, marketing, debt collection, harassment, political), the rejection happens at the request stage, before any plan is generated. The user gets a clear message about why, not a generic refusal. The boundaries are visible, not silent.

Provider abstraction and open integration

Every component is swappable via config: LLM provider, speech-to-text, text-to-speech, telephony, and web search. The default stack is Groq + Deepgram + Cartesia + Twilio + Tavily, but any of those can be replaced without touching the agent logic.

An HTTP REST API runs alongside the WhatsApp interface. POST a call request, poll for status, approve or reject via endpoint. The full flow is accessible programmatically. This makes CaulBot composable: wire it into a workflow, embed it as an MCP tool, or call it from any system that can make an HTTP request. The WhatsApp interface is one consumer of the same underlying API.

Highlights

▶Two-agent design: Research Agent plans and researches; Phone Agent executes, separated by an approval gate with 15-minute expiry
▶Live call loop: Deepgram STT + Cartesia TTS over Twilio WebSocket media streams, with mid-call WhatsApp injection and listen-in mode
▶Compliance engine: mandatory AI disclosure, opt-out detection, suppression list, use-case restrictions, and rate limiting; ships before the first user
▶Anti-prompt-injection: callee speech is treated as untrusted observation data, never as instruction
▶Open integration: HTTP REST API makes CaulBot composable. Embed as an MCP tool or call from any system

Under the hood

Two-agent architectureWebSocket media streamsApproval gates5-pillar compliance engineProvider abstractionMCP-compatible REST API