AI appointment-booking voicebot: book, reschedule, cancel by voice

An AI appointment-booking voicebot answers an inbound call and lets the caller book, reschedule, or cancel in ordinary spoken language. No "press 1 for bookings" menu, no rigid script. The caller says "I'd like to move my cleaning to next Thursday afternoon," and the agent confirms the slot, updates the calendar, and reads back the new time.

With Sautikit you get the real-time audio pipe and the phone number; you bring the intelligence. The <Stream> verb forks live caller audio to your WebSocket server, you relay it to the LLM of your choice for speech-to-text, reasoning, and text-to-speech, and you stream audio back into the call. You own the model, the prompt, and the booking logic. Sautikit bills the call by the second.

Clinics, salons, and service businesses that field a high volume of booking calls — think a busy Nairobi dental clinic or a Lagos hair studio drowning in "are you free Saturday?" calls.
Developers building conversational voice agents who want the carrier layer handled and full control of the AI stack.
Product teams that need 24/7 booking without staffing a phone line, and want the agent to speak local languages their existing IVR can't.
Anyone migrating an AI agent off a per-minute "voice AI platform" and onto raw per-second telephony with no AI markup.

The loop is real-time and full-duplex:

A caller dials a Sautikit number whose routing_url points to your voice webhook.
Your webhook returns an XML <Response> containing a <Stream> verb.
Sautikit opens a WebSocket to the url in that verb. Your server must advertise the audio.drachtio.org subprotocol during the handshake, or the connection is rejected.
Sautikit forks the live caller audio down that socket as binary PCM frames (16-bit little-endian signed PCM).
Your server relays those frames to your LLM (Gemini Live, OpenAI Realtime, or a self-hosted model) for STT, reasoning, and TTS.
Your server writes the LLM's synthesised PCM back onto the same socket, and Sautikit plays it into the call.

Because the stream is bidirectional on one socket, the caller and the agent can talk over each other, and the agent can barge in or pause — exactly what natural conversation needs.

Your server owns the conversation state and the calendar. Use the call SID (sent in the stream metadata) to key a session, run tool-calls from the LLM against your booking database or calendar API, and confirm each change back to the caller in speech before you commit it.

Endpoints you call:

POST /v1/numbers: claim a phone number for the voicebot.
PATCH /v1/numbers/{number_id}: set or update the routing_url (your voice webhook).
GET /v1/calls/{call_sid}: fetch the call detail record after the call ends, for logging and billing reconciliation.

Voice actions used:

Stream: fork live caller audio to your WebSocket and play audio back. This is the core of the real-time loop.
Say: optional TTS greeting before the stream opens.
Dial: optional human handoff — connect the caller to a receptionist when the AI hits something it can't resolve.

When Sautikit POSTs to your routing_url, reply with this. It opens a bidirectional stream at 16 kHz — the right sampling rate for AI models.

<Response>
  <Stream
    name="booking-agent"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

Attribute notes:

url (required): your wss:// WebSocket endpoint.
track (required): inbound_track, outbound_track, or both_tracks. Use both_tracks so the agent hears the caller and its own output.
outputSamplingRate (required): 8000 or 16000. Use 16000 for AI models.
name (optional): a label echoed back in stream events.
statusCallback / statusEvents (optional): where and which lifecycle events (stream-started stream-stopped stream-error) are POSTed.

You can also pass headerMetadata (a JSON blob sent as HTTP handshake headers, handy for auth) and openMetadata (opaque UTF-8 delivered in the first text frame).

This is the glue between the Sautikit socket and your LLM. Advertising the audio.drachtio.org subprotocol is mandatory.

import { WebSocketServer } from "ws";
import { connectLLM } from "./llm.js"; // your Gemini Live / OpenAI wrapper
 
const wss = new WebSocketServer({
  port: 8080,
  // MUST advertise this subprotocol or Sautikit rejects the handshake
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", (sautikit) => {
  // Open your model session (STT + reasoning + TTS)
  const llm = connectLLM({
    systemPrompt:
      "You are the booking agent for Whitedent Clinic, Nairobi. " +
      "Book, reschedule, or cancel appointments. Confirm the date and " +
      "time back to the caller before committing. Speak the caller's language.",
    onAudio: (pcm) => sautikit.send(pcm), // model audio -> back into the call
  });
 
  sautikit.on("message", (data, isBinary) => {
    if (isBinary) {
      // Live caller audio: 16-bit LE signed PCM -> feed the model
      llm.pushAudio(data);
    } else {
      // First text frame carries openMetadata (call SID, etc.)
      const meta = JSON.parse(data.toString());
      llm.setContext({ callSid: meta.call_sid });
    }
  });
 
  sautikit.on("close", () => llm.close());
});

Your connectLLM wrapper is where booking tool-calls live: when the model decides to write a slot, call your calendar API, then have the model confirm the result to the caller.

curl -X PATCH "https://api.sautikit.com/v1/numbers/{number_id}" \
  -H "Authorization: Bearer $SAUTIKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"routing_url": "https://your-app.example.com/voice"}'

Sautikit bills the inbound call per second in KES for the time the call is live on the platform — nothing more. There is no per-minute "AI voice" surcharge and no fee for the number of audio frames or WebSocket bytes.

The AI cost — STT, the LLM, and TTS — is billed by your own provider (Gemini, OpenAI, or your self-hosted GPU bill). That separation is the point: you pay the telephony leg to Sautikit at raw per-second rates and the intelligence leg to whoever you chose, with no platform tax stacked between them. A three-minute booking call costs you three minutes of per-second inbound telephony plus whatever your model provider charges for three minutes of audio.

Stream voice action: full attribute list and the frame format.
Build an AI voice engine with Gemini: end-to-end wiring of the stream to Gemini Live.
How to build an AI voice agent: prompt design, tool-calls, and handling interruptions.
AI receptionist use case: the same stream pattern applied to call answering and routing.
Voice actions concept: how the action-response loop works end to end.

Clinics, salons, and service businesses that field a high volume of booking calls — think a busy Nairobi dental clinic or a Lagos hair studio drowning in "are you free Saturday?" calls.
Developers building conversational voice agents who want the carrier layer handled and full control of the AI stack.
Product teams that need 24/7 booking without staffing a phone line, and want the agent to speak local languages their existing IVR can't.
Anyone migrating an AI agent off a per-minute "voice AI platform" and onto raw per-second telephony with no AI markup.