SautiKit
PricingDevelopersBlogAbout
Sign inStart building

AI appointment-booking voicebot: book, reschedule, cancel by voice

Build an AI voicebot that books, reschedules, and cancels appointments in natural language over the phone using the Stream verb and your own LLM.

use-caseai-voice-agentstreamappointment-bookingllm

Next Steps

  • Voice Actions DSLVoice Actions are the JSON DSL Sautikit uses to control call flow. Your voice_callback_url returns a JSON array of verbs; the platform executes them in order against the live call.
  • Answer real phone calls with Gemini: bridge Gemini Live to SautikitA flagship realtime tutorial: bridge live phone audio from Sautikit's Stream verb to the Google Gemini Live API over WebSocket, so an AI voice agent answers real calls on any phone.
  • Ship an AI voice agent that answers calls: a 2026 developer guideA pillar guide to building a phone AI voice agent: the telephony, STT, LLM, and TTS layers, turn-based vs full-duplex builds, and where Sautikit fits as the voice layer.
  • AI receptionist: a 24/7 virtual front desk that never misses a callBuild a 24/7 AI receptionist with the Stream voice action. Sautikit forks live caller audio to your WebSocket, you relay it to an LLM, and warm-transfer to a human with Dial.
SautiKit

Programmable voice infrastructure for Africa. Buy numbers, place calls, and bill per second, all in local currency, via API.

Product

NumbersCalls & routingRecordingsWallet & billingPricing

Developers

DocumentationAPI referenceQuickstartAI prompt

Company

AboutBlogConsole

© 2026 Sautikit. All rights reserved • Powered by Helloduty

Terms of ServicePrivacy Policy

Sautikit provides voice API services for application developers. Numbers provisioned on this platform are not configured for emergency calling (e.g. 999 / 112). Do not use Sautikit numbers as a replacement for a primary phone line.

Summary

An AI appointment-booking voicebot answers an inbound call and lets the caller book, reschedule, or cancel in ordinary spoken language. No "press 1 for bookings" menu, no rigid script. The caller says "I'd like to move my cleaning to next Thursday afternoon," and the agent confirms the slot, updates the calendar, and reads back the new time.

With Sautikit you get the real-time audio pipe and the phone number; you bring the intelligence. The <Stream> verb forks live caller audio to your WebSocket server, you relay it to the LLM of your choice for speech-to-text, reasoning, and text-to-speech, and you stream audio back into the call. You own the model, the prompt, and the booking logic. Sautikit bills the call by the second.

Who this is for

  • Clinics, salons, and service businesses that field a high volume of booking calls — think a busy Nairobi dental clinic or a Lagos hair studio drowning in "are you free Saturday?" calls.
  • Developers building conversational voice agents who want the carrier layer handled and full control of the AI stack.
  • Product teams that need 24/7 booking without staffing a phone line, and want the agent to speak local languages their existing IVR can't.
  • Anyone migrating an AI agent off a per-minute "voice AI platform" and onto raw per-second telephony with no AI markup.

How it works

The loop is real-time and full-duplex:

  1. A caller dials a Sautikit number whose routing_url points to your voice webhook.
  2. Your webhook returns an XML <Response> containing a <Stream> verb.
  3. Sautikit opens a WebSocket to the url in that verb. Your server must advertise the audio.drachtio.org subprotocol during the handshake, or the connection is rejected.
  4. Sautikit forks the live caller audio down that socket as binary PCM frames (16-bit little-endian signed PCM).
  5. Your server relays those frames to your LLM (Gemini Live, OpenAI Realtime, or a self-hosted model) for STT, reasoning, and TTS.
  6. Your server writes the LLM's synthesised PCM back onto the same socket, and Sautikit plays it into the call.

Because the stream is bidirectional on one socket, the caller and the agent can talk over each other, and the agent can barge in or pause — exactly what natural conversation needs.

State and booking logic

Your server owns the conversation state and the calendar. Use the call SID (sent in the stream metadata) to key a session, run tool-calls from the LLM against your booking database or calendar API, and confirm each change back to the caller in speech before you commit it.

API surface

Endpoints you call:

  • POST /v1/numbers: claim a phone number for the voicebot.
  • PATCH /v1/numbers/{number_id}: set or update the routing_url (your voice webhook).
  • GET /v1/calls/{call_sid}: fetch the call detail record after the call ends, for logging and billing reconciliation.

Voice actions used:

  • Stream: fork live caller audio to your WebSocket and play audio back. This is the core of the real-time loop.
  • Say: optional TTS greeting before the stream opens.
  • Dial: optional human handoff — connect the caller to a receptionist when the AI hits something it can't resolve.
✎

<Stream> is XML-only today. A native JSON form of the stream action is on the roadmap, but until it ships, return the XML <Response> shown below. The other voice actions are available in JSON as usual.

Example

1. The XML your webhook returns

When Sautikit POSTs to your routing_url, reply with this. It opens a bidirectional stream at 16 kHz — the right sampling rate for AI models.

<Response>
  <Stream
    name="booking-agent"
    url="wss://your-app.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-app.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

Attribute notes:

  • url (required): your wss:// WebSocket endpoint.
  • track (required): inbound_track, outbound_track, or both_tracks. Use both_tracks so the agent hears the caller and its own output.
  • outputSamplingRate (required): 8000 or 16000. Use 16000 for AI models.
  • name (optional): a label echoed back in stream events.
  • statusCallback / statusEvents (optional): where and which lifecycle events (stream-started stream-stopped stream-error) are POSTed.

You can also pass headerMetadata (a JSON blob sent as HTTP handshake headers, handy for auth) and openMetadata (opaque UTF-8 delivered in the first text frame).

2. A WebSocket bridge sketch (Node.js)

This is the glue between the Sautikit socket and your LLM. Advertising the audio.drachtio.org subprotocol is mandatory.

import { WebSocketServer } from "ws";
import { connectLLM } from "./llm.js"; // your Gemini Live / OpenAI wrapper
 
const wss = new WebSocketServer({
  port: 8080,
  // MUST advertise this subprotocol or Sautikit rejects the handshake
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", (sautikit) => {
  // Open your model session (STT + reasoning + TTS)
  const llm = connectLLM({
    systemPrompt:
      "You are the booking agent for Whitedent Clinic, Nairobi. " +
      "Book, reschedule, or cancel appointments. Confirm the date and " +
      "time back to the caller before committing. Speak the caller's language.",
    onAudio: (pcm) => sautikit.send(pcm), // model audio -> back into the call
  });
 
  sautikit.on("message", (data, isBinary) => {
    if (isBinary) {
      // Live caller audio: 16-bit LE signed PCM -> feed the model
      llm.pushAudio(data);
    } else {
      // First text frame carries openMetadata (call SID, etc.)
      const meta = JSON.parse(data.toString());
      llm.setContext({ callSid: meta.call_sid });
    }
  });
 
  sautikit.on("close", () => llm.close());
});

Your connectLLM wrapper is where booking tool-calls live: when the model decides to write a slot, call your calendar API, then have the model confirm the result to the caller.

3. Attach the webhook to a number

curl -X PATCH "https://api.sautikit.com/v1/numbers/{number_id}" \
  -H "Authorization: Bearer $SAUTIKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"routing_url": "https://your-app.example.com/voice"}'

Pricing notes

Sautikit bills the inbound call per second in KES for the time the call is live on the platform — nothing more. There is no per-minute "AI voice" surcharge and no fee for the number of audio frames or WebSocket bytes.

The AI cost — STT, the LLM, and TTS — is billed by your own provider (Gemini, OpenAI, or your self-hosted GPU bill). That separation is the point: you pay the telephony leg to Sautikit at raw per-second rates and the intelligence leg to whoever you chose, with no platform tax stacked between them. A three-minute booking call costs you three minutes of per-second inbound telephony plus whatever your model provider charges for three minutes of audio.

ℹ

Keep calls tight. Because both telephony and model usage are time-based, a snappy agent that confirms and hangs up is cheaper on both bills than one that pads with filler speech.

Next steps

  • Stream voice action: full attribute list and the frame format.
  • Build an AI voice engine with Gemini: end-to-end wiring of the stream to Gemini Live.
  • How to build an AI voice agent: prompt design, tool-calls, and handling interruptions.
  • AI receptionist use case: the same stream pattern applied to call answering and routing.
  • Voice actions concept: how the action-response loop works end to end.