构建语音智能体

音频处理

某些传输层（如默认的 OpenAIRealtimeWebRTC）会自动为你处理音频输入和输出。对于 OpenAIRealtimeWebSocket 等其他传输机制，你需要自行处理会话音频：

import {
  RealtimeAgent,
  RealtimeSession,
  TransportLayerAudio,
} from '@openai/agents/realtime';

const agent = new RealtimeAgent({ name: 'My agent' });
const session = new RealtimeSession(agent);
const newlyRecordedAudio = new ArrayBuffer(0);

session.on('audio', (event: TransportLayerAudio) => {
  // play your audio
});

// send new audio to the agent
session.sendAudio(newlyRecordedAudio);

会话配置

你可以在构造时向 RealtimeSession 传入额外选项，或在调用 connect(...) 时进行配置。

import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
});

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
  config: {
    inputAudioFormat: 'pcm16',
    outputAudioFormat: 'pcm16',
    inputAudioTranscription: {
      model: 'gpt-4o-mini-transcribe',
    },
  },
});

这些传输层允许你传入任何与 session 匹配的参数。

对于在 RealtimeSessionConfig 中尚无对应参数的新参数，你可以使用 providerData。传入 providerData 的内容会直接作为 session 对象的一部分传递。

交接

与常规智能体类似，你可以使用交接将你的智能体拆分为多个智能体，并在它们之间进行编排，以提升智能体性能并更好地限定问题范围。

import { RealtimeAgent } from '@openai/agents/realtime';

const mathTutorAgent = new RealtimeAgent({
  name: 'Math Tutor',
  handoffDescription: 'Specialist agent for math questions',
  instructions:
    'You provide help with math problems. Explain your reasoning at each step and include examples',
});

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
  handoffs: [mathTutorAgent],
});

与常规智能体不同，交接在实时智能体中行为略有不同。执行交接后，进行中的会话会更新为新的智能体配置。因此，智能体会自动访问当前会话历史，并且目前不会应用输入过滤器。

此外，这意味着在交接过程中不能更改 voice 或 model。你也只能连接至其他实时智能体。如果你需要使用不同的模型，例如像 gpt-5-mini 这样的推理模型，你可以使用通过工具进行委派。

工具

与常规智能体一样，实时智能体可以调用工具来执行操作。你可以使用与常规智能体相同的 tool() 函数来定义工具。

import { tool, RealtimeAgent } from '@openai/agents/realtime';
import { z } from 'zod';

const getWeather = tool({
  name: 'get_weather',
  description: 'Return the weather for a city.',
  parameters: z.object({ city: z.string() }),
  async execute({ city }) {
    return `The weather in ${city} is sunny.`;
  },
});

const weatherAgent = new RealtimeAgent({
  name: 'Weather assistant',
  instructions: 'Answer weather questions.',
  tools: [getWeather],
});

在实时智能体中，你只能使用函数工具，并且这些工具会在与你的 Realtime 会话相同的位置执行。这意味着如果你在浏览器中运行 Realtime 会话，你的工具也会在浏览器中执行。如果你需要执行更敏感的操作，可以在工具内部向后端服务器发起 HTTP 请求。

在工具执行期间，智能体无法处理来自用户的新请求。改进体验的一种方式是让你的智能体在即将执行工具时进行预告，或说出特定短语，为执行工具争取时间。

访问会话历史

除了访问智能体调用特定工具时传入的参数外，你还可以访问 Realtime 会话跟踪的当前会话历史快照。如果你需要基于当前会话状态执行更复杂的操作，或者计划将工具用于委派，这会很有用。

import {
  tool,
  RealtimeContextData,
  RealtimeItem,
} from '@openai/agents/realtime';
import { z } from 'zod';

const parameters = z.object({
  request: z.string(),
});

const refundTool = tool<typeof parameters, RealtimeContextData>({
  name: 'Refund Expert',
  description: 'Evaluate a refund',
  parameters,
  execute: async ({ request }, details) => {
    // The history might not be available
    const history: RealtimeItem[] = details?.context?.history ?? [];
    // making your call to process the refund request
  },
});

工具执行前的审批

如果你用 needsApproval: true 定义工具，智能体会在执行工具之前触发 tool_approval_requested 事件。

监听该事件后，你可以向用户展示 UI 以批准或拒绝该工具调用。

import { session } from './agent';

session.on('tool_approval_requested', (_context, _agent, request) => {
  // show a UI to the user to approve or reject the tool call
  // you can use the `session.approve(...)` or `session.reject(...)` methods to approve or reject the tool call

  session.approve(request.approvalItem); // or session.reject(request.rawItem);
});

护栏

护栏提供一种方式，监控智能体的发言是否违反了一组规则，并立即切断响应。这些护栏检查将基于智能体响应的转录文本执行，因此需要启用模型的文本输出（默认启用）。

你提供的护栏会在模型返回响应时异步运行，允许你基于预定义的分类触发器（例如“提到特定禁词”）来切断响应。

当护栏被触发时，会话会发出 guardrail_tripped 事件。该事件还提供包含触发该护栏的 itemId 的 details 对象。

import { RealtimeOutputGuardrail, RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
});

const guardrails: RealtimeOutputGuardrail[] = [
  {
    name: 'No mention of Dom',
    async execute({ agentOutput }) {
      const domInOutput = agentOutput.includes('Dom');
      return {
        tripwireTriggered: domInOutput,
        outputInfo: { domInOutput },
      };
    },
  },
];

const guardedSession = new RealtimeSession(agent, {
  outputGuardrails: guardrails,
});

默认情况下，护栏每 100 个字符或在响应文本生成结束时运行一次。由于朗读文本通常更耗时，这意味着在大多数情况下，护栏会在用户听到之前捕获违规内容。

如果你想修改此行为，可以向会话传入一个 outputGuardrailSettings 对象。

import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
});

const guardedSession = new RealtimeSession(agent, {
  outputGuardrails: [
    /*...*/
  ],
  outputGuardrailSettings: {
    debounceTextLength: 500, // run guardrail every 500 characters or set it to -1 to run it only at the end
  },
});

轮次检测 / 语音活动检测

Realtime 会话会自动检测用户何时开始说话，并使用 Realtime API 的内置语音活动检测模式触发新轮次。

你可以向会话传入一个 turnDetection 对象来更改语音活动检测模式。

import { RealtimeSession } from '@openai/agents/realtime';
import { agent } from './agent';

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
  config: {
    turnDetection: {
      type: 'semantic_vad',
      eagerness: 'medium',
      createResponse: true,
      interruptResponse: true,
    },
  },
});

修改轮次检测设置有助于校准不必要的打断以及处理静默。查看Realtime API 文档，了解不同设置的更多细节

打断

使用内置语音活动检测时，在智能体说话时插话会自动触发智能体根据你所说的内容检测并更新上下文。它还会发出 audio_interrupted 事件。该事件可用于立即停止所有音频播放（仅适用于 WebSocket 连接）。

import { session } from './agent';

session.on('audio_interrupted', () => {
  // handle local playback interruption
});

如果你想进行手动打断，例如在 UI 中提供一个“停止”按钮，你可以手动调用 interrupt()：

import { session } from './agent';

session.interrupt();
// this will still trigger the `audio_interrupted` event for you
// to cut off the audio playback when using WebSockets

无论哪种方式，Realtime 会话都会处理对智能体生成的打断，截断其对已对用户说过内容的认知，并更新历史记录。

如果你使用 WebRTC 连接智能体，它还会清空音频输出。如果你使用 WebSocket，则需要自行停止已排队待播放的音频的播放。

文本输入

如果你想向智能体发送文本输入，可以在 RealtimeSession 上使用 sendMessage 方法。

如果你希望让用户以两种模态与智能体交互，或为对话提供额外上下文，这会很有用。

import { RealtimeSession, RealtimeAgent } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Assistant',
});

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
});

session.sendMessage('Hello, how are you?');

会话历史管理

RealtimeSession 会在 history 属性中自动管理会话历史：

你可以使用它向用户渲染历史或对其执行其他操作。由于在整个对话过程中历史会不断变化，你可以监听 history_updated 事件。

如果你想修改历史，例如完全移除一条消息或更新其转录文本，可以使用 updateHistory 方法。

import { RealtimeSession, RealtimeAgent } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Assistant',
});

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
});

await session.connect({ apiKey: '<client-api-key>' });

// listening to the history_updated event
session.on('history_updated', (history) => {
  // returns the full history of the session
  console.log(history);
});

// Option 1: explicit setting
session.updateHistory([
  /* specific history */
]);

// Option 2: override based on current state like removing all agent messages
session.updateHistory((currentHistory) => {
  return currentHistory.filter(
    (item) => !(item.type === 'message' && item.role === 'assistant'),
  );
});

限制

目前无法在事后更新/更改函数工具调用
历史中的文本输出需要启用转录文本和文本模态
因打断而被截断的响应没有转录文本

通过工具进行委派

将会话历史与工具调用结合，你可以将对话委派给另一个后端智能体执行更复杂的操作，然后将结果返回给用户。

import {
  RealtimeAgent,
  RealtimeContextData,
  tool,
} from '@openai/agents/realtime';
import { handleRefundRequest } from './serverAgent';
import z from 'zod';

const refundSupervisorParameters = z.object({
  request: z.string(),
});

const refundSupervisor = tool<
  typeof refundSupervisorParameters,
  RealtimeContextData
>({
  name: 'escalateToRefundSupervisor',
  description: 'Escalate a refund request to the refund supervisor',
  parameters: refundSupervisorParameters,
  execute: async ({ request }, details) => {
    // This will execute on the server
    return handleRefundRequest(request, details?.context?.history ?? []);
  },
});

const agent = new RealtimeAgent({
  name: 'Customer Support',
  instructions:
    'You are a customer support agent. If you receive any requests for refunds, you need to delegate to your supervisor.',
  tools: [refundSupervisor],
});

下面的代码将会在服务器上执行。本示例通过 Next.js 的 Server Actions（Server Actions）实现。

// This runs on the server
import 'server-only';

import { Agent, run } from '@openai/agents';
import type { RealtimeItem } from '@openai/agents/realtime';
import z from 'zod';

const agent = new Agent({
  name: 'Refund Expert',
  instructions:
    'You are a refund expert. You are given a request to process a refund and you need to determine if the request is valid.',
  model: 'gpt-5-mini',
  outputType: z.object({
    reasong: z.string(),
    refundApproved: z.boolean(),
  }),
});

export async function handleRefundRequest(
  request: string,
  history: RealtimeItem[],
) {
  const input = `
The user has requested a refund.

The request is: ${request}

Current conversation history:
${JSON.stringify(history, null, 2)}
`.trim();

  const result = await run(agent, input);

  return JSON.stringify(result.finalOutput, null, 2);
}