AI receptionist: a 24/7 virtual front desk that never misses a call

An AI receptionist answers every inbound call, greets the caller by voice, understands what they want in natural language, answers the common questions, and hands off to a human only when it needs to. No hold music, no voicemail, no missed calls after hours. The caller talks; your AI listens, thinks, and talks back in real time.

Sautikit makes this a media problem, not a telephony problem. You attach a webhook to a number, return a <Stream> voice action, and Sautikit forks the live call audio to your WebSocket server as raw PCM. You relay that audio to any LLM voice model (Gemini Live, OpenAI Realtime, or a self-hosted stack) and send the synthesised reply back on the same socket. When the AI decides a human is needed, your flow returns a <Dial> to warm-transfer the caller.

SMEs and professional-services offices — a law firm, property agency, clinic, or accountancy — that cannot afford to miss an inbound call.
Teams expanding across Africa that need a front desk in every timezone without hiring one per office.
Developers building an AI voice agent who want the media transport handled and their own model in control of the conversation.
Product managers replacing legacy voicemail and after-hours forwarding with a measurable, always-on agent.

A caller dials your Sautikit number. The number's routing_url points at your voice webhook.
Sautikit POSTs the call details to that webhook. Your server responds with an XML <Response> containing a <Stream> action.
Sautikit opens a WebSocket to the url in your <Stream>. Your server must advertise the audio.drachtio.org subprotocol on the handshake, or the connection is rejected.
Sautikit forks the live caller audio down that socket as binary PCM frames (16-bit little-endian).
You relay those frames to your LLM (Gemini Live, OpenAI Realtime, or self-hosted). The model transcribes, reasons, and generates a spoken reply.
You send the reply back as PCM frames on the same socket. Sautikit plays them into the call. This is full-duplex: audio flows both ways at once, so the caller can interrupt.
When the AI decides to escalate, your webhook flow returns a <Dial> to a human's number and the caller is warm-transferred.

Endpoints you call:

POST /v1/numbers: claim a phone number for the front desk.
PATCH /v1/numbers/{number_id}: set the routing_url to your voice webhook.
GET /v1/calls/{call_sid}: fetch the call detail record after the call ends.

Voice actions used:

Stream: fork live call audio to your WebSocket for real-time AI.
Dial: warm-transfer the caller to a human when the AI escalates.
Say: a TTS greeting fallback if the media socket is unavailable.

When the number is dialled, your webhook replies with an application/xml body opening the media stream:

<Response>
  <Stream
    name="receptionist"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

track="both_tracks" forks both call legs so your model hears the caller and its own playback. outputSamplingRate is the PCM rate Sautikit sends and expects back. Audio on the wire is 16-bit little-endian PCM.

Your server terminates the socket, relays PCM to your LLM, and pipes the model's PCM back. The one hard requirement: advertise the audio.drachtio.org subprotocol.

import { WebSocketServer } from "ws";
import { connectToLLM } from "./llm.js"; // Gemini Live / OpenAI Realtime / self-hosted
 
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: () => "audio.drachtio.org", // required by Sautikit
});
 
wss.on("connection", async (call) => {
  const llm = await connectToLLM({ sampleRate: 16000 });
 
  // Caller audio (binary PCM) -> LLM
  call.on("message", (frame, isBinary) => {
    if (isBinary) llm.sendAudio(frame);
  });
 
  // LLM audio (binary PCM) -> back into the call on the same socket
  llm.on("audio", (pcm) => call.send(pcm, { binary: true }));
 
  // When the model decides to escalate, close the stream so your
  // voice webhook flow can return the <Dial> below.
  llm.on("handoff", () => call.close());
});

When the AI hands off, end the stream and let your flow return a <Dial> to the human's number, warm-transferring the caller:

<Response>
  <Say>Connecting you to the front desk now.</Say>
  <Dial>+254700000001</Dial>
</Response>

The inbound call leg is billed per second in KES for as long as the call is live on the Sautikit platform — the same rate whether the AI is handling the caller or the call has been transferred. Once <Dial> connects a human, the outbound leg is billed per second too, for the duration of the connected call.

There is no separate Sautikit charge for opening the media stream or for the WebSocket round-trips. Your LLM and voice-model costs (Gemini Live, OpenAI, or self-hosted compute) are billed by that provider on their own metering — Sautikit only moves the audio.

Voice actions concept: the full <Stream> attribute table and the action-response loop.
Build an AI voice engine with Gemini: wiring Stream to Gemini Live end to end.
How to build an AI voice agent: design patterns for real-time voice agents.
Dial voice action: warm transfer, caller ID, and connected-leg options.
AI support agent use case: the same Stream loop applied to inbound support.

SMEs and professional-services offices — a law firm, property agency, clinic, or accountancy — that cannot afford to miss an inbound call.
Teams expanding across Africa that need a front desk in every timezone without hiring one per office.
Developers building an AI voice agent who want the media transport handled and their own model in control of the conversation.
Product managers replacing legacy voicemail and after-hours forwarding with a measurable, always-on agent.

A caller dials your Sautikit number. The number's routing_url points at your voice webhook.
Sautikit POSTs the call details to that webhook. Your server responds with an XML <Response> containing a <Stream> action.
Sautikit opens a WebSocket to the url in your <Stream>. Your server must advertise the audio.drachtio.org subprotocol on the handshake, or the connection is rejected.
Sautikit forks the live caller audio down that socket as binary PCM frames (16-bit little-endian).
You relay those frames to your LLM (Gemini Live, OpenAI Realtime, or self-hosted). The model transcribes, reasons, and generates a spoken reply.
You send the reply back as PCM frames on the same socket. Sautikit plays them into the call. This is full-duplex: audio flows both ways at once, so the caller can interrupt.
When the AI decides to escalate, your webhook flow returns a <Dial> to a human's number and the caller is warm-transferred.