Answer real phone calls with OpenAI: bridge the Realtime API to Sautikit

A caller dials your number. Instead of an IVR tree, a natural voice answers, listens, thinks, and replies. In real time, on any phone, with no app to install. That is what you get when you bridge the OpenAI Realtime API to Sautikit's live audio stream, and this tutorial wires it end to end.

TL;DR

Sautikit's <Stream> voice action forks live call audio to your WebSocket as raw PCM frames; you relay them to the OpenAI Realtime API and write OpenAI's synthesized PCM back on the same socket to speak into the call.

The one real catch: OpenAI Realtime audio is 24 kHz pcm16, while Sautikit streams at your telephony rate (here 16 kHz). You must resample 16 kHz ⇄ 24 kHz in both directions.

<Stream> is returned as application/xml today (JSON stream support is on the roadmap); your WS server must accept the audio.drachtio.org subprotocol.

Chat and app-based assistants assume a smartphone, a data plan, and a download. A phone number assumes none of that. Anyone with a handset (a feature phone on a rural network, a landline, a roaming SIM) can reach an AI voice agent by dialing. For support lines, appointment booking, order status, or after-hours triage, that reach is the whole point: you meet callers where they already are.

The hard part has always been the audio pipe: getting live call audio out to an LLM and synthesized audio back in fast enough to feel like a conversation. Sautikit's Stream verb is that pipe.

Inbound call
  → Sautikit voice_callback returns RAW XML <Stream .../>
  → Sautikit opens a WebSocket to your Node 'ws' server (binary PCM in/out)
  → your server resamples + relays PCM ⇄ OpenAI Realtime API (WebSocket)
  → OpenAI's synthesized PCM is resampled back to 16 kHz and written to the Sautikit socket
  → audio plays into the live call

Two WebSockets, one bridge process. Sautikit is the telephony leg; OpenAI is the intelligence leg. Your server is the translator that keeps sample rates and framing aligned.

When a call connects, Sautikit fetches your voice_callback_url. For realtime AI you return raw XML (not the JSON actions form) with a <Stream> element. Set the Content-Type to application/xml.

import express from "express";
 
const app = express();
app.use(express.urlencoded({ extended: false }));
 
app.post("/voice", (req, res) => {
  // req.body includes From, Digits, etc. for JSON flows; here we go raw XML.
  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Stream
    name="openai"
    url="wss://your-server.example.com/openai"
    track="both_tracks"
    outputSamplingRate="16000"
    statusCallback="https://your-server.example.com/stream-status"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>`;
 
  res.set("Content-Type", "application/xml");
  res.send(xml);
});
 
app.listen(3000);

track="both_tracks" forwards both the caller and any outbound audio; use inbound_track if you only want the caller's voice into OpenAI. outputSamplingRate="16000" tells Sautikit to deliver 16 kHz PCM. OpenAI Realtime runs at 24 kHz, so unlike a same-rate bridge you will resample between the two legs (covered below).

Sautikit connects to your url and requires the audio.drachtio.org WebSocket subprotocol. Reject the handshake if it is absent. Incoming messages are binary PCM frames.

import { WebSocketServer } from "ws";
import { openRealtimeSession } from "./openai.js";
import { resample } from "./resample.js"; // 16 kHz ⇄ 24 kHz, see below
 
const wss = new WebSocketServer({
  port: 8080,
  handleProtocols: (protocols) =>
    protocols.has("audio.drachtio.org") ? "audio.drachtio.org" : false,
});
 
wss.on("connection", async (sautiSocket) => {
  // One OpenAI Realtime session per call.
  const openai = await openRealtimeSession({
    // OpenAI → call: downsample 24 kHz → 16 kHz, then write on the SAME socket.
    onAudio: (pcm24) => {
      if (sautiSocket.readyState === sautiSocket.OPEN) {
        sautiSocket.send(resample(pcm24, 24000, 16000)); // binary frame plays into the call
      }
    },
    // Barge-in: caller started talking, stop the current reply.
    onInterrupt: () => {
      // Optionally flush any buffered playback here.
    },
  });
 
  // Call → OpenAI: upsample each inbound 16 kHz frame to 24 kHz.
  sautiSocket.on("message", (data, isBinary) => {
    if (isBinary) openai.sendAudio(resample(data, 16000, 24000));
  });
 
  sautiSocket.on("close", () => openai.close());
  sautiSocket.on("error", () => openai.close());
});

The bridge stays thin: bytes in from the call are upsampled and sent to OpenAI, bytes out from OpenAI are downsampled and written back to the call. All the conversation logic lives inside the Realtime session.

The OpenAI Realtime API is itself a WebSocket: you connect with your model in the query string, send a session.update event to configure audio and behavior, then stream audio up as input_audio_buffer.append events and receive synthesized audio down as response.output_audio.delta events. Model IDs, the exact event/field names, and whether a beta header is required all move fast; check the current platform.openai.com/docs Realtime docs before shipping. The pattern below is stable.

import WebSocket from "ws";
 
// NOTE: model id and event/field names change between the beta and GA
// interfaces; verify against the current platform.openai.com Realtime docs.
// The GA interface drops the old `OpenAI-Beta: realtime=v1` header; the
// beta interface required it. Pick the model that is current when you build.
const MODEL = "gpt-realtime"; // ← from platform.openai.com/docs
const OPENAI_URL = `wss://api.openai.com/v1/realtime?model=${MODEL}`;
 
export async function openRealtimeSession({ onAudio, onInterrupt }) {
  const ws = new WebSocket(OPENAI_URL, {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      // If you target the beta interface, also send:
      // "OpenAI-Beta": "realtime=v1"
    },
  });
 
  await new Promise((resolve) => ws.on("open", resolve));
 
  // 1) Configure the session: pcm16 audio, a voice, instructions, server VAD.
  ws.send(
    JSON.stringify({
      type: "session.update",
      session: {
        // Field layout differs between beta and GA — confirm against docs.
        instructions: "You are a concise phone support agent.",
        voice: "alloy",
        input_audio_format: "pcm16",
        output_audio_format: "pcm16",
        // Server-side turn detection so OpenAI decides when the caller
        // stopped talking and replies on its own.
        turn_detection: { type: "server_vad" },
      },
    })
  );
 
  ws.on("message", (raw) => {
    const msg = JSON.parse(raw.toString());
 
    // 2) Synthesized audio out → resample + play into the call.
    // GA emits response.output_audio.delta; the beta name was response.audio.delta.
    if (
      msg.type === "response.output_audio.delta" ||
      msg.type === "response.audio.delta"
    ) {
      onAudio(Buffer.from(msg.delta, "base64")); // base64 pcm16 @ 24 kHz
    }
 
    // 3) Barge-in: OpenAI detected the caller talking over the agent.
    if (msg.type === "input_audio_buffer.speech_started") {
      onInterrupt();
      // Optionally stop the in-progress reply explicitly:
      ws.send(JSON.stringify({ type: "response.cancel" }));
    }
  });
 
  return {
    // Send inbound call PCM up as a base64 pcm16 append (already 24 kHz here).
    sendAudio(pcm24) {
      ws.send(
        JSON.stringify({
          type: "input_audio_buffer.append",
          audio: pcm24.toString("base64"),
        })
      );
    },
    close() {
      if (ws.readyState === ws.OPEN) ws.close();
    },
  };
}

The load-bearing details: configure pcm16 in and out via session.update, append caller audio as base64 with input_audio_buffer.append, and decode the base64 audio from response.output_audio.delta (the GA event; the beta name was response.audio.delta) before forwarding it. The stream ends on response.output_audio.done / response.done. Everything else is prompt and policy.

This is the one place the OpenAI bridge is not a plain byte pump. Mismatched sample rates are the number-one cause of chipmunk or slow-motion audio, and OpenAI and Sautikit do not agree by default:

Frames arriving from Sautikit are 16 kHz mono pcm16 (because you set outputSamplingRate="16000"). OpenAI expects 24 kHz, so you must upsample 16 kHz → 24 kHz before input_audio_buffer.append.
Frames from response.output_audio.delta are 24 kHz pcm16. Sautikit plays back at 16 kHz, so you must downsample 24 kHz → 16 kHz before writing to the Sautikit socket.

Both legs are mono 16-bit little-endian PCM, so only the rate changes; the sample format does not. Do the conversion with a real resampler rather than naive sample dropping/duplication, which introduces aliasing and audible artifacts. A small worker that wraps a resampling library (or an ffmpeg/sox subprocess in the pipeline) is enough.

The snippet below is illustrative — treat resample() as the seam where you plug in your chosen resampler, and verify its exact API against that library's docs.

// resample.js — ILLUSTRATIVE. Wire in a real resampling library here.
// Input/output are Buffers of 16-bit LE mono PCM. `from`/`to` are Hz.
export function resample(pcmBuffer, from, to) {
  if (from === to) return pcmBuffer;
  // Convert Buffer → Int16 samples, run a proper polyphase/interpolating
  // resampler, then convert back to a 16-bit LE Buffer. Do NOT just drop or
  // duplicate samples — use a real DSP resampler to avoid aliasing.
  return runResampler(pcmBuffer, from, to);
}

If you ever set outputSamplingRate="8000" on the <Stream>, the conversion becomes 8 kHz ⇄ 24 kHz instead — the same requirement, different ratio.

Natural conversation means the caller can talk over the agent. OpenAI's Realtime API detects this with server-side VAD and emits input_audio_buffer.speech_started the moment the caller starts speaking. On the GA interface it will automatically cancel the in-progress response; you can also send response.cancel to be explicit. When you see the signal, stop feeding queued outbound chunks into the call so the caller is not talking over stale audio. Because playback flows through the Sautikit socket you control, dropping buffered outbound PCM on interrupt is enough to make the agent feel responsive.

This is the OpenAI analogue of Gemini's serverContent.interrupted: same intent, different event name.

The statusEvents="stream-started stream-stopped stream-error" on your <Stream> element tells Sautikit to POST lifecycle events to your statusCallback. Each carries a callSessionState of StreamStarted, StreamStopped, or StreamError plus a streamSid.

app.post("/stream-status", express.json(), (req, res) => {
  const { callSessionState, streamSid } = req.body;
  console.log(`[stream] ${streamSid} → ${callSessionState}`);
  // StreamError → alert; StreamStopped → tear down the OpenAI session.
  res.sendStatus(200);
});

Use StreamStarted to confirm the pipe is up, StreamError to page yourself, and StreamStopped to close the matching OpenAI Realtime session and free resources.

Do I need a special endpoint for realtime streaming?

No. You reuse the same voice_callback_url as any Sautikit call. The difference is you return raw <Stream> XML with Content-Type: application/xml instead of the JSON actions array.

Why must the WebSocket accept the audio.drachtio.org subprotocol?

Sautikit negotiates that subprotocol when it opens the socket. If your ws server does not offer it back during the handshake, the connection is refused and no audio flows. Confirm it in your handleProtocols callback.

Do I really have to resample?

Yes. OpenAI Realtime pcm16 is 24 kHz, and Sautikit's <Stream> runs at your outputSamplingRate (8 kHz or 16 kHz). Because the two rates differ, you must resample in both directions or the audio plays back at the wrong speed. This is the main way the OpenAI bridge differs from a same-rate engine.

Can the AI voice agent both listen and speak on one connection?

Yes. <Stream> is bidirectional: audio Sautikit sends you is the caller; binary PCM you send back on the same socket is played into the call. You never open a second connection to Sautikit.

What does an AI voice call cost?

Standard voice pricing applies: inbound calls are free (KES 0) and outbound bills at KES 3.00/min, billed per second from the moment the call connects. There is no per-minute AI tax on top from Sautikit — the OpenAI Realtime usage is billed separately, on your own OpenAI bill, so the model cost stays fully in your control. See /pricing for the source of truth.

Will there be a JSON version of <Stream>?

Yes, it is on the roadmap. Today <Stream> is returned as application/xml; a JSON form embeddable in the { actions: [...] } response is coming.

Create a Sautikit workspace and claim a number (from KES 116, instant).
Top up over M-Pesa; no card required.
Deploy your ws bridge, point a number's voice_callback_url at the webhook above, and dial in to talk to OpenAI.

Start with Sautikit → · See pricing → · Need SMS, WhatsApp & an agent desk? Helloduty →

Voice actions reference: every verb, including <Stream> attributes.
Bridge Gemini Live to Sautikit: the same pattern with Google's Gemini Live API, where both legs stay at 16 kHz.

TL;DR

Sautikit's <Stream> voice action forks live call audio to your WebSocket as raw PCM frames; you relay them to the OpenAI Realtime API and write OpenAI's synthesized PCM back on the same socket to speak into the call.

The one real catch: OpenAI Realtime audio is 24 kHz pcm16, while Sautikit streams at your telephony rate (here 16 kHz). You must resample 16 kHz ⇄ 24 kHz in both directions.

<Stream> is returned as application/xml today (JSON stream support is on the roadmap); your WS server must accept the audio.drachtio.org subprotocol.