AI agent engineering

Build voice, vision, and multimodal agents on one real-time runtime.

Start from a live experience, then own the pipeline: room or SIP entry, classic or realtime speech, tools and knowledge, observability, and human handoff.

Try a live agentConnect phone calls

BYOK AI providers Voice + vision buffers Tools, traces, and handoff

Choose your operating surface

Use the shortest path that still gives your team enough control.

Start ready-made, move into SDK control only when the product requires it.

Website widgetsAdd low-configuration meeting, calling, or AI embeds.Open Widgets Guide Phone and SIP setupConfigure providers, PSTN routing, numbers, and call handling.Open Telephony Guide Phone Agents StudioConfigure and test a telephone agent before building custom code.Open Phone Agents Studio

Prove it in 30 secondsTalk to a live agent first — then build your own pipeline.

Try a live agent Call +1 785 369 1724 (US)

$0.002per agent minuteInfrastructure only — your STT, LLM, and TTS providers stay direct (BYOK)

Compare agent costs Best Vapi alternatives

Choose a build goal

Headless agent quickstart

Use MediaSFU as the room and socket engine behind your custom agent UI.

The source-backed flow is consistent across the repo: protect room creation and join credentials with a backend proxy, mount the SDK in headless mode, then wait for the SDK to expose the room socket before you start voice or multimodal buffers.

Backend proxy firstHeadless room attachVoice + vision buffers

Step 01Proxy room create and joinKeep production keys on your backend. The repo docs recommend a backend proxy or localLink-based routing so public clients never ship real MediaSFU secrets.

Step 02Mount MediaSFU with no visible UIRender the SDK in headless mode and feed room state through sourceParameters plus updateSourceParameters. This is the same pattern used by the headless handler and widget blocks.

Step 03Wait for room attachDo not start the agent from the REST create or join response. Wait until MediaSFU pushes roomName and socket or localSocket into SDK state after signaling attaches.

Step 04Start voice or multimodal buffersAfter the room is attached, use the exposed socket to startDataBuffer, react to startBuffers, and listen for pipelineResult or pipelineResultVision instead of making text sessions your default entry point.

Headless MediaSFU attach pattern

1import React, { useEffect, useMemo, useState } from "react";
2import {
3  MediasfuGeneric,
4  PreJoinPage,
5  type CreateJoinRoomType,
6  type CreateMediaSFURoomOptions,
7  type JoinMediaSFURoomOptions,
8} from "mediasfu-reactjs";
9
10// /api/mediasfu/rooms injects the real apiUserName/apiKey on the server.
11const createOrJoinViaProxy: CreateJoinRoomType = async ({ payload }) => {
12  const response = await fetch("/api/mediasfu/rooms", {
13    method: "POST",
14    headers: { "Content-Type": "application/json" },
15    body: JSON.stringify(payload),
16  });
17
18  const data = await response.json();
19  return { success: response.ok, data };
20};
21
22export function HeadlessAgentRoom() {
23  const [sourceParameters, setSourceParameters] = useState<Record<string, any>>({});
24  const [bufferStarted, setBufferStarted] = useState(false);
25
26  const noUIOptions = useMemo<CreateMediaSFURoomOptions | JoinMediaSFURoomOptions>(
27    () => ({
28      action: "create",
29      duration: 15,
30      capacity: 5,
31      userName: "agent-user",
32      eventType: "conference",
33      dataBuffer: true,
34      bufferType: "all",
35    }),
36    []
37  );
38
39  const bufferConfig = useMemo(
40    () => ({
41      audio: {
42        format: "wav",
43        channels: 1,
44        sampleRate: 16000,
45        pipeline: ["stt", "ttllm", "tts", "return"],
46        sttNickName: "support-stt",
47        llmNickName: "support-llm",
48        ttsNickName: "support-tts",
49        returnAudioFormat: "base64",
50      },
51      vision: {
52        fps: 1.0,
53        pipeline: ["visionllm", "tts", "return"],
54        llmNickName: "support-vision",
55        ttsNickName: "support-tts",
56        returnAudioFormat: "base64",
57      },
58    }),
59    []
60  );
61
62  const socket = sourceParameters.localSocket?.id
63    ? sourceParameters.localSocket
64    : sourceParameters.socket;
65
66  useEffect(() => {
67    if (bufferStarted || !socket?.id || !sourceParameters.roomName) return;
68
69    const onStartBuffers = () => {
70      socket.emit(
71        "startBuffer",
72        {
73          roomName: sourceParameters.roomName,
74          member: sourceParameters.member || "agent-user",
75        },
76        (ack: { success?: boolean; reason?: string }) => {
77          if (!ack?.success) {
78            console.error("Buffer attach failed", ack?.reason);
79          }
80        }
81      );
82    };
83
84    const onPipelineResult = (data: any) => {
85      console.log("voice pipeline", data);
86    };
87
88    const onPipelineResultVision = (data: any) => {
89      console.log("vision pipeline", data);
90    };
91
92    socket.on("startBuffers", onStartBuffers);
93    socket.on("pipelineResult", onPipelineResult);
94    socket.on("pipelineResultVision", onPipelineResultVision);
95    socket.emit(
96      "startDataBuffer",
97      {
98        roomName: sourceParameters.roomName,
99        config: bufferConfig,
100      },
101      (ack: { success?: boolean; reason?: string }) => {
102        if (!ack?.success) {
103          console.error("Buffer session start failed", ack?.reason);
104          return;
105        }
106
107        setBufferStarted(true);
108      }
109    );
110
111    return () => {
112      socket.off("startBuffers", onStartBuffers);
113      socket.off("pipelineResult", onPipelineResult);
114      socket.off("pipelineResultVision", onPipelineResultVision);
115    };
116  }, [bufferConfig, bufferStarted, socket, sourceParameters.member, sourceParameters.roomName]);
117
118  return (
119    <div style={{ width: 0, height: 0, overflow: "hidden" }}>
120      <MediasfuGeneric
121        PrejoinPage={(options: any) => <PreJoinPage {...options} />}
122        credentials={{ apiUserName: "dummy-user", apiKey: "dummy-key" }}
123        returnUI={false}
124        connectMediaSFU={true}
125        noUIPreJoinOptions={noUIOptions}
126        sourceParameters={sourceParameters}
127        updateSourceParameters={setSourceParameters}
128        createMediaSFURoom={createOrJoinViaProxy}
129        joinMediaSFURoom={createOrJoinViaProxy}
130      />
131    </div>
132  );
133}

Full SDK headless guide

Agent architecture overview

Use this guide to design production MediaSFU agents across audio, vision, and multimodal workflows. Start with the runtime shape, then connect providers, buffers, tools, and operational controls. You will learn how to:

Configure AI Credentials for Voice and Vision services.
Build classic STT/LLM/TTS pipelines, realtime speech-to-speech runtimes, vision, and custom processing steps.
Manage data buffers for real-time audio and video frames.
Handle errors effectively and return results to the client.

By the end of this guide, you'll have a comprehensive understanding of how to integrate speech recognition, text generation, speech synthesis, direct speech-to-speech, and image analysis into your MediaSFU applications.

Note: Dashboard-configured AI credentials take precedence over ephemeral parameters for the same keys (unless the dashboard field is empty). Use ephemeral parameters for additional fields not already set on the dashboard.

What the newer Media runtime makes explicit

The raw pipeline array is only one layer of the system. The production path also includes runtime selection, context assembly, observability, and escalation design.

Runtime SurfaceRoute into the real Media runtimeThe pipeline does not start at the first STT token. Room attach, widget or SIP entry, speechEngine, and runtime overrides decide whether the turn uses classic stages or realtime speech-to-speech.

Context AssemblyLoad tools and approved knowledge firstUseful turns are transcript plus persona, policy, retrieval, and callable tools. MCP integrations belong in the response path before the model answer is finalized.

ObservabilityTrace what happened on each turnCapture transcript, latency, tool use, quality checkpoints, and summaries so you can explain why the agent responded the way it did and whether it met your SLA.

Fallback DesignKeep a human handoff path readyStrong agent flows define escalation triggers, operator-ready summaries, and takeover paths for cases where confidence, policy, or customer intent requires a person.

Choose the speech runtime before picking providers

Classic PipelineSTT -> LLM -> TTS["stt", "ttllm", "tts", "return"]

Best when you want separate provider choice, transcripts as the main intermediate artifact, custom translation, or per-stage overrides.

Realtime Speech-to-SpeechNative spoken-turn model["realtime", "return"]

Best when one compatible realtime credential should handle listening, reasoning, and speaking with lower turn friction.

A production turn is more than STT to LLM to TTS

01
Entry point attachesA widget, SIP route, or headless room becomes live and exposes the socket and runtime state that will drive the buffers.
02
Turn detection packages inputVoice activity, silence windows, or frame cadence decide when MediaSFU has enough audio or vision data to assemble a turn.
03
Context is assembledClassic runs assemble transcript, prompts, provider settings, approved knowledge, and callable tools. Realtime runs still need the same policy and context, even when audio stays speech-native.
04
The model answers or chooses an actionThe agent can respond directly, call a tool, request clarification, or branch into an escalation and handoff path.
05
Output and audit artifacts are emittedTTS playback or realtime audio, structured results, latency traces, summaries, and handoff context are returned to the client or operator surface.

Building custom apps? Start from these GitHub repos:

VoIP custom apps (Telephony)Agents custom apps (Web + Vision/Multimodal)