An AI receptionist answers every inbound call, greets the caller by voice, understands what they want in natural language, answers the common questions, and hands off to a human only when it needs to. No hold music, no voicemail, no missed calls after hours. The caller talks; your AI listens, thinks, and talks back in real time.
Sautikit makes this a media problem, not a telephony problem. You attach a webhook to a number, return a <Stream> voice action, and Sautikit forks the live call audio to your WebSocket server as raw PCM. You relay that audio to any LLM voice model (Gemini Live, OpenAI Realtime, or a self-hosted stack) and send the synthesised reply back on the same socket. When the AI decides a human is needed, your flow returns a <Dial> to warm-transfer the caller.
routing_url points at your voice webhook.<Response> containing a <Stream> action.url in your <Stream>. Your server must advertise the audio.drachtio.org subprotocol on the handshake, or the connection is rejected.<Dial> to a human's number and the caller is warm-transferred.Endpoints you call:
POST /v1/numbers: claim a phone number for the front desk.PATCH /v1/numbers/{number_id}: set the routing_url to your voice webhook.GET /v1/calls/{call_sid}: fetch the call detail record after the call ends.Voice actions used:
Stream: fork live call audio to your WebSocket for real-time AI.Dial: warm-transfer the caller to a human when the AI escalates.Say: a TTS greeting fallback if the media socket is unavailable.When the number is dialled, your webhook replies with an application/xml body opening the media stream:
<Response>
<Stream
name="receptionist"
url="wss://your-app.example.com/audio"
track="both_tracks"
outputSamplingRate="16000"
statusCallback="https://your-app.example.com/stream-status"
statusEvents="stream-started stream-stopped stream-error" />
</Response>track="both_tracks" forks both call legs so your model hears the caller and its own playback. outputSamplingRate is the PCM rate Sautikit sends and expects back. Audio on the wire is 16-bit little-endian PCM.
Your server terminates the socket, relays PCM to your LLM, and pipes the model's PCM back. The one hard requirement: advertise the audio.drachtio.org subprotocol.
import { WebSocketServer } from "ws";
import { connectToLLM } from "./llm.js"; // Gemini Live / OpenAI Realtime / self-hosted
const wss = new WebSocketServer({
port: 8080,
handleProtocols: () => "audio.drachtio.org", // required by Sautikit
});
wss.on("connection", async (call) => {
const llm = await connectToLLM({ sampleRate: 16000 });
// Caller audio (binary PCM) -> LLM
call.on("message", (frame, isBinary) => {
if (isBinary) llm.sendAudio(frame);
});
// LLM audio (binary PCM) -> back into the call on the same socket
llm.on("audio", (pcm) => call.send(pcm, { binary: true }));
// When the model decides to escalate, close the stream so your
// voice webhook flow can return the <Dial> below.
llm.on("handoff", () => call.close());
});When the AI hands off, end the stream and let your flow return a <Dial> to the human's number, warm-transferring the caller:
<Response>
<Say>Connecting you to the front desk now.</Say>
<Dial>+254700000001</Dial>
</Response>The inbound call leg is billed per second in KES for as long as the call is live on the Sautikit platform — the same rate whether the AI is handling the caller or the call has been transferred. Once <Dial> connects a human, the outbound leg is billed per second too, for the duration of the connected call.
There is no separate Sautikit charge for opening the media stream or for the WebSocket round-trips. Your LLM and voice-model costs (Gemini Live, OpenAI, or self-hosted compute) are billed by that provider on their own metering — Sautikit only moves the audio.
<Stream> attribute table and the action-response loop.