Voice Actions DSL
The Sautikit VoiceAction DSL: a JSON verb set for controlling call flow returned from your webhook handler.
Voice Actions are the JSON DSL Sautikit uses to control call flow. When a call arrives or a step completes, Sautikit POSTs the call state to your voice_callback_url. Your server returns a JSON object with an actions array of verbs (Say, Play, GetDigits, Dial, Conference, Record, Redirect, Reject, or Hangup), which the platform executes in order.
Voice actions can be expressed in two forms:
Both forms are shown as tabs on each verb's reference page, with JSON as the default tab.
Runtime note: JSON is the format Sautikit parses and validates. Raw XML you return from your voice URL is forwarded to the PBX unchanged; validation, if any, happens at the PBX, not at Sautikit. Return JSON in your webhook responses unless you have a specific reason to hand the PBX raw XML.
Sautikit accepts two response formats from your voice callback:
The JSON DSL exists so your server can construct call flows with standard JSON libraries, receive type-checked errors from the Sautikit validator, and stay decoupled from the underlying FreeSWITCH XML dialect. If you are moving from an XML-based voice API, the verb names map closely. The main difference is that Sautikit uses getDigits for DTMF collection (see the verb table below).
{
"actions": [
{ "say": { "text": "Habari, karibu Sautikit." } },
{ "getDigits": { "timeout": 5, "numDigits": 1,
"nested": [{ "say": { "text": "Bonyeza 1 kwa Kiswahili, 2 kwa English." } }]
}},
{ "hangup": {} }
]
}actions is ordered: verbs execute top to bottom.MaxActionSteps).| Verb | Key | Purpose |
|---|---|---|
| Say | say | Synthesise text to the caller via TTS |
| Play | play | Stream an audio file URL to the caller |
| GetDigits | getDigits | Collect DTMF keypad input with an optional prompt |
| Dial | dial | Connect the caller to a phone number or SIP URI |
| Conference | conference | Place the caller in a named conference room |
| Record | record | Record caller audio and POST the file URL to your action endpoint |
| Redirect | redirect | Transfer call flow to another URL |
| Reject | reject | Reject an inbound call with "rejected" or "busy" signal |
| Hangup | hangup | End the call immediately |
| Stream | stream | Fork live call audio to a WebSocket for real-time AI (transcription, LLM voice agents such as Google Gemini) |
{
"say": {
"text": "Your OTP is 4 8 2 1.",
"voice": "alice",
"language": "en-US",
"loop": 1
}
}voice and language default to PBX defaults when omitted. loop: 0 means play once.
{
"play": {
"url": "https://cdn.example.com/hold-music.mp3",
"loop": 0
}
}The URL must resolve to a host on your workspace's CDN allow-list.
{
"getDigits": {
"timeout": 5,
"numDigits": 1,
"finishOnKey": "#",
"nested": [
{ "say": { "text": "Press 1 for sales, 2 for support." } }
]
}
}nested verbs (Say or Play) play while waiting for input. finishOnKey defaults to #. When digits are collected the platform POSTs back to your voice_callback_url with the Digits field populated.
{
"dial": {
"number": "+254722000001",
"callerId": "+254700000001",
"timeout": 30,
"record": "record-from-answer"
}
}Use sip instead of number to dial a SIP URI ("sip": "sip:alice@pbx.example.com"). number and sip are mutually exclusive.
{
"conference": {
"name": "weekly-team-call",
"maxParticipants": 10,
"record": true,
"beep": true,
"waitUrl": "https://cdn.example.com/hold.mp3",
"statusEventsCallbackUrl": "https://ivr.example.com/conference-events",
"statusEvents": "start end join leave"
}
}{
"record": {
"action": "https://ivr.example.com/recording-done",
"method": "POST",
"timeout": 5,
"maxLength": 120,
"finishOnKey": "#",
"transcribe": false
}
}{
"redirect": {
"url": "https://ivr.example.com/after-hours",
"method": "POST"
}
}Transfers call flow to a new URL that returns its own VoiceAction response. Use this to implement menus without a single monolithic handler.
{
"reject": { "reason": "busy" }
}reason is "rejected" (default) or "busy". Use busy to simulate a busy signal rather than a hard rejection.
{ "hangup": {} }Ends the call. No parameters required.
Stream forks the live call's audio to a WebSocket endpoint as raw binary PCM frames, so an external service (a transcriber, or an LLM voice agent such as Google Gemini Live) can process the audio in real time. In bidirectional mode, your WebSocket server plays audio back into the call by sending binary PCM frames on the same socket, which makes full-duplex AI voice agents possible.
XML form (forward this via the raw-XML response today; see the note below):
<Response>
<Stream
name="gemini-agent"
url="wss://your-service.example.com/audio"
track="both_tracks"
outputSamplingRate="16000"
statusEvents="stream-started stream-stopped stream-error" />
</Response>JSON form:
{
"stream": {
"name": "gemini-agent",
"url": "wss://your-service.example.com/audio",
"track": "both_tracks",
"outputSamplingRate": 16000,
"statusEvents": "stream-started stream-stopped stream-error"
}
}| Attribute | Required | Purpose |
|---|---|---|
url | yes | The ws:// or wss:// endpoint that receives the audio frames |
track | yes | inbound_track, outbound_track, or both_tracks (mono mix) |
outputSamplingRate | yes | PCM sample rate sent to the socket and expected back: 8000 (PSTN) or 16000 Hz |
inputSamplingRate | no | Informational hint for the source rate (e.g. 8000, 16000, 48000) |
name | no | Friendly identifier echoed back in stream status events |
headerMetadata | no | Flat JSON object of string key/value pairs sent as WebSocket handshake headers (auth tokens, tenant IDs) |
openMetadata | no | Opaque UTF-8 data delivered in the first WebSocket text frame |
statusCallback | no | URL Sautikit POSTs fire-and-forget stream status notifications to |
statusEvents | no | Space-separated subset of stream-started, stream-stopped, stream-error |
Your WebSocket server must accept the audio.drachtio.org subprotocol. Stream status events carry a callSessionState of StreamStarted, StreamStopped, or StreamError, plus a streamSid and the negotiated sampling rates.
If you are moving from a PBX XML dialect, the following table shows the equivalent JSON key for each common XML verb:
| XML verb | Sautikit VoiceAction | Notes |
|---|---|---|
<Say> | say | Identical semantics |
<Play> | play | Identical semantics |
<Gather> / <GetDigits> | getDigits | Collects DTMF; uses numDigits and finishOnKey |
<Dial> | dial | Supports number or sip key |
<Conference> | conference | Top-level verb in Sautikit (not nested inside dial) |
<Record> | record | Identical semantics |
<Redirect> | redirect | Identical semantics |
<Reject> | reject | Identical semantics |
<Hangup> | hangup | Identical semantics |
<Stream> | stream | Real-time media fork to a WebSocket; available via the XML form today |
For hold queues, use a conference room with startOnEnter: false rather than a dedicated queue verb.
If your voice_callback_url returns a non-2xx status or does not respond within 10 seconds, the platform hangs up the call. Always respond quickly; do any heavy work asynchronously after returning the initial action set.
If the platform cannot parse your JSON (unknown verb, too many actions, malformed body), it logs the error and hangs up.