Voice Actions DSL

The Sautikit VoiceAction DSL: a JSON verb set for controlling call flow returned from your webhook handler.

2026-06-27

Voice Actions are the JSON DSL Sautikit uses to control call flow. When a call arrives or a step completes, Sautikit POSTs the call state to your voice_callback_url. Your server returns a JSON object with an actions array of verbs (Say, Play, GetDigits, Dial, Conference, Record, Redirect, Reject, or Hangup), which the platform executes in order.

Voice actions can be expressed in two forms:

JSON form: the native VoiceAction DSL described in this document. Sautikit parses and validates it before translating it to PBX XML, so you get clear, Sautikit-level errors when a flow is malformed. This is the format to build on.
XML form: a raw XML body Sautikit forwards to the PBX byte-for-byte, without parsing or validating it. If you are coming from Twilio (TwiML) or Africa's Talking, this is the form you already write: your existing voice XML works as-is, and it also covers the handful of cases the JSON DSL does not yet express. The trade-off: you get PBX-level errors, not Sautikit validation errors, when it is wrong.

Both forms are shown as tabs on each verb's reference page, with JSON as the default tab.

Runtime note: JSON is the format Sautikit parses and validates. Raw XML you return from your voice URL is forwarded to the PBX unchanged; validation, if any, happens at the PBX, not at Sautikit. Return JSON in your webhook responses unless you have a specific reason to hand the PBX raw XML.

Sautikit accepts two response formats from your voice callback:

JSON: the VoiceAction DSL described here. Parsed and validated by Sautikit before being translated to PBX XML.
Raw XML: forwarded to the PBX byte-for-byte without Sautikit validation. Reach for it when you are bringing existing voice XML from Twilio or Africa's Talking, or for cases the JSON DSL does not yet cover; because Sautikit does not parse it, any errors surface as PBX-level failures rather than Sautikit validation errors.

The JSON DSL exists so your server can construct call flows with standard JSON libraries, receive type-checked errors from the Sautikit validator, and stay decoupled from the underlying FreeSWITCH XML dialect. If you are moving from an XML-based voice API, the verb names map closely. The main difference is that Sautikit uses getDigits for DTMF collection (see the verb table below).

{
  "actions": [
    { "say":  { "text": "Habari, karibu Sautikit." } },
    { "getDigits": { "timeout": 5, "numDigits": 1,
        "nested": [{ "say": { "text": "Bonyeza 1 kwa Kiswahili, 2 kwa English." } }]
    }},
    { "hangup": {} }
  ]
}

actions is ordered: verbs execute top to bottom.
Maximum 50 verbs per response (MaxActionSteps).
Exactly one verb key per object; all other keys must be absent.

Verb	Key	Purpose
Say	`say`	Synthesise text to the caller via TTS
Play	`play`	Stream an audio file URL to the caller
GetDigits	`getDigits`	Collect DTMF keypad input with an optional prompt
Dial	`dial`	Connect the caller to a phone number or SIP URI
Conference	`conference`	Place the caller in a named conference room
Record	`record`	Record caller audio and POST the file URL to your action endpoint
Redirect	`redirect`	Transfer call flow to another URL
Reject	`reject`	Reject an inbound call with "rejected" or "busy" signal
Hangup	`hangup`	End the call immediately
Stream	`stream`	Fork live call audio to a WebSocket for real-time AI (transcription, LLM voice agents such as Google Gemini)

{
  "say": {
    "text": "Your OTP is 4 8 2 1.",
    "voice": "alice",
    "language": "en-US",
    "loop": 1
  }
}

voice and language default to PBX defaults when omitted. loop: 0 means play once.

{
  "play": {
    "url": "https://cdn.example.com/hold-music.mp3",
    "loop": 0
  }
}

The URL must resolve to a host on your workspace's CDN allow-list.

{
  "getDigits": {
    "timeout": 5,
    "numDigits": 1,
    "finishOnKey": "#",
    "nested": [
      { "say": { "text": "Press 1 for sales, 2 for support." } }
    ]
  }
}

nested verbs (Say or Play) play while waiting for input. finishOnKey defaults to #. When digits are collected the platform POSTs back to your voice_callback_url with the Digits field populated.

{
  "dial": {
    "number": "+254722000001",
    "callerId": "+254700000001",
    "timeout": 30,
    "record": "record-from-answer"
  }
}

Use sip instead of number to dial a SIP URI ("sip": "sip:alice@pbx.example.com"). number and sip are mutually exclusive.

{
  "conference": {
    "name": "weekly-team-call",
    "maxParticipants": 10,
    "record": true,
    "beep": true,
    "waitUrl": "https://cdn.example.com/hold.mp3",
    "statusEventsCallbackUrl": "https://ivr.example.com/conference-events",
    "statusEvents": "start end join leave"
  }
}

{
  "record": {
    "action": "https://ivr.example.com/recording-done",
    "method": "POST",
    "timeout": 5,
    "maxLength": 120,
    "finishOnKey": "#",
    "transcribe": false
  }
}

{
  "redirect": {
    "url": "https://ivr.example.com/after-hours",
    "method": "POST"
  }
}

Transfers call flow to a new URL that returns its own VoiceAction response. Use this to implement menus without a single monolithic handler.

{
  "reject": { "reason": "busy" }
}

reason is "rejected" (default) or "busy". Use busy to simulate a busy signal rather than a hard rejection.

{ "hangup": {} }

Ends the call. No parameters required.

Stream forks the live call's audio to a WebSocket endpoint as raw binary PCM frames, so an external service (a transcriber, or an LLM voice agent such as Google Gemini Live) can process the audio in real time. In bidirectional mode, your WebSocket server plays audio back into the call by sending binary PCM frames on the same socket, which makes full-duplex AI voice agents possible.

XML form (forward this via the raw-XML response today; see the note below):

<Response>
  <Stream
    name="gemini-agent"
    url="wss://your-service.example.com/audio"
    track="both_tracks"
    outputSamplingRate="16000"
    statusEvents="stream-started stream-stopped stream-error" />
</Response>

JSON form:

{
  "stream": {
    "name": "gemini-agent",
    "url": "wss://your-service.example.com/audio",
    "track": "both_tracks",
    "outputSamplingRate": 16000,
    "statusEvents": "stream-started stream-stopped stream-error"
  }
}

Attribute	Required	Purpose
`url`	yes	The `ws://` or `wss://` endpoint that receives the audio frames
`track`	yes	`inbound_track`, `outbound_track`, or `both_tracks` (mono mix)
`outputSamplingRate`	yes	PCM sample rate sent to the socket and expected back: `8000` (PSTN) or `16000` Hz
`inputSamplingRate`	no	Informational hint for the source rate (e.g. `8000`, `16000`, `48000`)
`name`	no	Friendly identifier echoed back in stream status events
`headerMetadata`	no	Flat JSON object of string key/value pairs sent as WebSocket handshake headers (auth tokens, tenant IDs)
`openMetadata`	no	Opaque UTF-8 data delivered in the first WebSocket text frame
`statusCallback`	no	URL Sautikit POSTs fire-and-forget stream status notifications to
`statusEvents`	no	Space-separated subset of `stream-started`, `stream-stopped`, `stream-error`

Your WebSocket server must accept the audio.drachtio.org subprotocol. Stream status events carry a callSessionState of StreamStarted, StreamStopped, or StreamError, plus a streamSid and the negotiated sampling rates.

If you are moving from a PBX XML dialect, the following table shows the equivalent JSON key for each common XML verb:

XML verb	Sautikit VoiceAction	Notes
`<Say>`	`say`	Identical semantics
`<Play>`	`play`	Identical semantics
`<Gather>` / `<GetDigits>`	`getDigits`	Collects DTMF; uses `numDigits` and `finishOnKey`
`<Dial>`	`dial`	Supports `number` or `sip` key
`<Conference>`	`conference`	Top-level verb in Sautikit (not nested inside dial)
`<Record>`	`record`	Identical semantics
`<Redirect>`	`redirect`	Identical semantics
`<Reject>`	`reject`	Identical semantics
`<Hangup>`	`hangup`	Identical semantics
`<Stream>`	`stream`	Real-time media fork to a WebSocket; available via the XML form today

For hold queues, use a conference room with startOnEnter: false rather than a dedicated queue verb.

If your voice_callback_url returns a non-2xx status or does not respond within 10 seconds, the platform hangs up the call. Always respond quickly; do any heavy work asynchronously after returning the initial action set.

If the platform cannot parse your JSON (unknown verb, too many actions, malformed body), it logs the error and hangs up.