CodeGym /Courses /ChatGPT Apps /Voice / Realtime context: App behavior during voice conve...

Voice / Realtime context: App behavior during voice conversations

ChatGPT Apps
Level 8 , Lesson 4
Available

1. Context: what “voice mode” means for the ChatGPT App

To begin with, it’s important to understand that within the ChatGPT Apps SDK you do not write your own audio client, do not control the microphone, and do not stream audio yourself. The ChatGPT client (web or mobile app) does that.

We assume you already have a basic understanding of the widget, callTool, and GiftGenius from previous modules — here we look at the same elements through the lens of voice mode.

From your perspective as an App developer, everything looks like this:

  • The user speaks into the microphone. The ChatGPT client performs speech recognition and sends text to the model.
  • In the stream you “see” the same as if the user were typing, only messages arrive faster and are more conversational.
  • The model replies with text, which the client speaks aloud.
  • At the same time the model can call your tools (callTool), change the widget’s displayMode, update widgetState, and suggest follow‑ups — just like in text mode.

The key difference is that the user may barely look at the screen, or only glance at the phone. Your UI stops being the primary interaction channel and becomes a complement to voice, not the other way around.

Two consequences follow from this:

  • Everything that really matters must be understandable by ear, through GPT’s utterances.
  • The widget should be “glanceable” in the good sense: with a quick glance, the status and key options are immediately visible, without reading fine print.

For our GiftGenius this is an immediate hint: the scenario “I’m driving, pick a gift for my mom” is not just a text chat. It’s a multimodal dialogue where voice leads and the UI backs it up.

2. How a voice scenario differs from a text one

To avoid the trap of “it’s all the same, the user just speaks instead of types,” it’s useful to compare text and voice modes across several axes.

Aspect Text mode Voice mode
User attention Looks at the screen, reads, scrolls May not look at all (hands‑free)
Form of requests More structured; people edit Conversational, fragments, “uh‑huh”, “let’s do more”
Tolerance for pauses 1–2 seconds of silence is fine Long silence feels painful
Role of UI Primary carrier of details Auxiliary, a “scoreboard” with brief visual anchors
Input errors Typos, but the text is visible Unclear speech, noise, false yes/no

From this come several important conclusions.

  • You cannot rely on the user “reading the card.” Critical things must be spoken: what you understood, what you intend to do, and what result you obtained.
  • The UI must support a “one‑second glance” scenario. Status, progress, and the main choice — all should be up front in large type. Details are secondary.
  • Fill the pauses. While your MCP server is working on a heavy request, the model should say what’s happening, and the widget should show progress, so it doesn’t feel like the assistant is frozen.

You can think of voice mode as an audiobook with illustrations: you have a voice narrator (GPT) and pictures (the widget). You need to synchronise them so they complement rather than duplicate or contradict each other.

3. The widget’s role in voice mode: from “control panel” to “scoreboard”

In a text scenario, the widget often serves as a full-fledged interface: forms, tables, filter carousels, action buttons. In voice mode its role changes. Guidelines for multimodal interfaces and VUI show that in voice scenarios the UI becomes more of an information scoreboard (glanceable UI): it’s needed for quick checks and confirmations, not for sustained reading.

For GiftGenius this means the following.

When the user goes through a voice wizard, show in the inline widget or fullscreen:

  • Large status: “Step 2 of 3: budget and gift type”.
  • Minimal text but clear labels: “Budget up to $50”, “Prefers a digital gift”.
  • A couple of large CTA buttons, if the voice scenario allows taps: “Change budget”, “Continue”.
  • One simple progress bar or stepper, not ten tiny indicators.

An example of a simple “scoreboard” in an inline widget for a voice scenario (TypeScript + React, heavily simplified):

type VoiceUiMode = "default" | "voiceGlance";

interface GiftStepProps {
  step: number;
  totalSteps: number;
  summary: string; // brief description of what has been collected already
  uiMode: VoiceUiMode;
}

export function GiftVoiceStep(props: GiftStepProps) {
  const fontSize = props.uiMode === "voiceGlance" ? "text-lg" : "text-sm";

  return (
    <div className="rounded-xl border p-3 flex flex-col gap-2">
      <div className={`${fontSize} font-semibold`}>
        Step {props.step} of {props.totalSteps}
      </div>
      <div className={`${fontSize} text-muted-foreground`}>
        {props.summary}
      </div>
    </div>
  );
}

There’s nothing “voice‑specific” per se here, but the idea is clear: when uiMode === "voiceGlance" we make everything larger and simpler. The signal that we’re in voice mode can come from different places: from indirect signs to an explicit flag that the model sets in widgetState or a tool response.

4. Synchronising modalities: what GPT says and what the App shows

The key voice‑UX principle for Apps is modality synchronisation: the voice and the visual UI should tell the same story, but at different levels of detail.

A common mistake is when a developer makes the model read out loud everything shown in the widget: long gift lists, JSON structures with filters, etc. That turns into torture. The recommendation is: the voice gives a brief summary, and the UI shows details.

A good synchronisation example for GiftGenius.

User: “Pick a gift for my mom, she loves gardening, budget up to $50.”

Model (voice): “I’ve found a few options. The best in my view is a garden tool kit for $45. I’ve also shown two similar options on the screen. Want me to tell you more or go straight to choosing?”

Widget (inline): shows three gift cards with brief descriptions and CTA buttons “Select” / “Show similar”.

A dialogue‑like JSON representation of a step (not a real protocol, just a way of thinking):

{
  "user": "Pick a gift for my mom...",
  "assistant_text": "I’ve found several options...",
  "widget": {
    "displayMode": "inline",
    "state": {
      "view": "gift_list",
      "items": [
        { "id": "g1", "title": "Garden tool set", "price": 45 },
        { "id": "g2", "title": "Gardener’s apron", "price": 30 },
        { "id": "g3", "title": "Flower seed set", "price": 20 }
      ]
    }
  }
}

An important detail: in the system prompt you can explicitly specify how the model should talk about the UI so it doesn’t “read JSON”: “If you show a list of options in the widget, don’t read each one fully. Briefly describe the best option and say the rest are visible on the screen.”

In the future, when you work with the Realtime API and your own voice clients, the principle will remain the same: UI and audio must be aligned. You’ll just have direct control over streaming there.

5. Realtime and latency: how to avoid awkward silence

Technically, tool_calls in voice mode are the same as in text: the model decides to call your tool, you return a response, the widget updates. But in voice, a new UX problem appears — latency. While your MCP server hits external APIs or computes a heavy report, the user hears… nothing. And that feels much worse than waiting for text in a chat.

There are two layers of protection: voice and visual.

  • On the voice layer, your system prompt should allow (and encourage) the model to say “I’m working” and ask additional questions while the tool is still computing. For example: “I’m going to pick some gifts; it’ll take a few seconds. In the meantime, tell me if there are any other constraints.”
  • On the visual layer, your widget should very clearly show progress: a loader, a status like “Looking for options…”, current step. Without this the user will assume it’s frozen and start talking again, breaking the voice flow.

In practice this is convenient to handle via a deferred job: the tool immediately returns a "pending" status and a jobId, and the selection runs in the background. The widget shows progress based on "pending", and the voice says it’s “working”.

The simplest scheme for a server‑side tool that returns a “stub” with a job id instead of blocking until the full result might look like this:

// Pseudocode for a server-side GiftGenius tool
export async function startGiftSearch(params: SearchParams) {
  const jobId = await createBackgroundJob(params); // put a task into the queue

  return {
    status: "pending",
    jobId,
    message: "Gift search started"
  };
}

When the widget sees status: "pending", it can switch the UI into a progress mode:

if (toolOutput.status === "pending") {
  return (
    <div className="p-4 rounded-xl border flex items-center gap-3">
      <Spinner />
      <div className="text-base">
        Finding gifts… This will take a few seconds.
      </div>
    </div>
  );
}

And in response to the same tool output, according to your instructions, the model will say roughly the same out loud and possibly ask a follow‑up clarifying question. Later, when the background job finishes and, say, an MCP notification job.completed arrives, the widget updates to a list of gifts and the voice reads a summary.

This gives us behaviour as close to realtime as possible, even if the backend is not instantaneous.

6. Security and confirmations in voice

Voice interfaces are tricky when it comes to critical actions: payments, data deletion, settings changes. Speech recognition isn’t perfect, users talk “on the go,” and “uh‑huh” can easily turn into “yes, buy it.” That’s why confirmation flows are especially important for voice scenarios.

There are two basic patterns.

  • Explicit Voice Confirmation. For dangerous actions you require a specific phrase. For example: “To confirm the purchase, say: ‘I confirm the purchase’” — and in the system prompt you prohibit executing payment on vague “uh‑huh”, “okay”, “let’s do it”.
  • Visual Confirmation Only. The model guides the user to the action by voice (“I’ve prepared the order; the total and cart contents are shown on the screen”), but the actual trigger is tapping the “Pay” button in the widget. This is especially relevant in commerce scenarios, and we’ll come back to it in module 14.

For GiftGenius this might look like this.

Model: “I picked a great gardening set for $45. I can place the order through ChatGPT. The total price and shipping address are shown on the screen. To confirm by voice, say ‘I confirm the purchase’, or tap the ‘Pay’ button on the screen.”

Widget (fullscreen): shows the final order, highlights the amount and address in bold, and two prominent buttons: “Pay” and “Cancel”.

Inside the widget you can reflect the confirmation status:

type CheckoutState = "review" | "waiting_voice_confirm" | "confirmed";

if (state.phase === "waiting_voice_confirm") {
  return (
    <div className="space-y-3">
      <h2 className="text-xl font-semibold">Almost done</h2>
      <p className="text-base">
        Confirm the purchase by saying
        “I confirm the purchase” or tap the “Pay” button.
      </p>
      <Button variant="primary">Pay</Button>
      <Button variant="ghost">Cancel</Button>
    </div>
  );
}

Thus, if the model misinterprets something in the voice channel, the user still has a visual “safety” layer.

7. Simple voice commands and tool design

A voice user won’t phrase commands exactly like your tool’s parameters. They’ll say “pick the first one”, “show cheaper”, “let’s skip tech”. Your job is to design tools and the system prompt so that the model can easily map such phrases to your tool calls (callTool).

For GiftGenius you can include actions like these:

  • Select one of the shown options by index or id.
  • Adjust the budget: “cheaper”, “up to 30 dollars”.
  • Filter by type: “digital gifts only”, “nothing that needs postal delivery”.

This is convenient to express via a tool with a simple enum parameter action and additional fields:

// Pseudo-schema of a tool in TypeScript
type VoiceActionInput =
  | { action: "select_item"; itemId: string }
  | { action: "refine_budget"; maxPrice: number }
  | { action: "filter_type"; type: "digital" | "physical" };

export function handleVoiceAction(input: VoiceActionInput) {
  switch (input.action) {
    case "select_item":
      // mark the gift as selected
      break;
    case "refine_budget":
      // recompute the selection for the new budget
      break;
    case "filter_type":
      // filter the existing list
      break;
  }
}

In the system prompt you describe how these actions map to voice commands: “If the user says ‘pick the first option’, call the gift.voiceAction tool with action="select_item" and the id of the first gift on screen,” etc.

From a UX standpoint this reduces cognitive load: users don’t need exact formulations like “Adjust the filters so there are only digital gifts under $30.” They speak naturally, and the model translates that into a data structure.

8. GiftGenius voice scenario: three steps

Let’s put it all together and design a complete voice scenario for GiftGenius without going deep into low‑level Realtime API.

Imagine a user: they’re driving and start ChatGPT’s voice mode. They say: “Please pick a gift for my mom; she loves gardening; budget up to $50.”

Step 1. Collect information by voice

Model: “Great, let’s pick a gift. I’ll clarify a couple of things: when do you need the gift — in the next few days or later? Any constraints, like nothing heavy or bulky?”

Widget (inline): for now just a small panel with the status “Picking a gift for: mom, gardening, up to $50.” Fonts are a bit larger than usual so they can be read at a glance.

Widget state code might look like this:

interface GiftSessionState {
  mode: "voice" | "text";
  step: 1 | 2 | 3;
  recipientSummary: string;
  budget?: number;
}

const [state, setState] = useState<GiftSessionState>({
  mode: "voice",
  step: 1,
  recipientSummary: "Mom, likes gardening"
});

The server side updates recipientSummary and budget as the user answers, and the widget reacts.

Step 2. Search and waiting

After the model has collected enough information, it calls your gift search tool. The tool can launch a background job if the selection is complex and return status: "pending". While the background job runs, the model says: “I’m going to look for suitable options; this will take a few seconds. In the meantime, tell me whether she prefers physical gifts or if digital certificates are fine.”

The widget switches into a PiP‑like mode if the user navigates elsewhere, or stays inline with progress: “Looking for gifts…” and a small indicator.

Step 3. Results and selection

When results are ready, the model: “I found three options. First — a garden tool kit for $45. Second — a gardener’s apron for $30. I’ve shown them on the screen. Say ‘pick the first’ or ‘show something cheaper’.”

The widget shows three large cards with prices and short descriptions. Each card has a “Select” and a “Similar” CTA. Plus a separate “Show more options” button.

If the user says: “Pick the second,” the model calls your voiceAction tool with action="select_item" and the id of the second gift. The widget highlights it as selected, and the model says: “Great, we selected the gardener’s apron for $30.”

Optional Step 4. Checkout

If the App is integrated with payments (in a future module 14), checkout begins. The model states the terms and asks for confirmation by voice or button. The widget switches to a fullscreen wizard with steps “Review order” → “Shipping address” → “Confirmation”.

It’s important that at every step all key points are spoken by voice, and the widget provides visual support, especially if the user has stopped and is looking at the screen.

9. Practical implementation notes and the limits of the Apps SDK

All the described GiftGenius steps are implemented inside a regular ChatGPT App — without your own audio client or WebRTC. Here it’s important to remember the boundaries of the stack.

It’s easy to “fly off” into Realtime API, WebRTC, audio streaming, and imagine your own voice platform. That’s what module 20 is for. In this lecture, remember the boundaries of the ChatGPT App inside the ChatGPT client.

In the current architecture:

  • The audio stream is managed by the ChatGPT client. You don’t send or receive audio bytes in the widget.
  • On the backend you still see ordinary tool calls and text messages, but the model may be in voice mode and its responses will be spoken.
  • The platform may pass indirect signs that it’s currently in voice mode (via user-agent or environment fields). But you should not build a hard dependency on this: the API may change, and your App should remain useful in pure text mode too.

Therefore, a good implementation strategy is as follows. First design a UX that works well both for text and for voice: brief statuses, clear CTAs, understandable progress stages. Then add a few voice‑focused improvements: slightly larger fonts in "voiceGlance" mode, more explicit progress, emphasis on statuses like “Step 2 of 3” and clear states like “Waiting for confirmation”.

Additionally, in the system prompt describe the model’s voice behaviour: how it comments on widget state, which phrases it uses for confirmations, which words it avoids (for example, don’t read JSON, don’t speak every tiny detail in a list).

If later you build your own Custom Voice Client on the Realtime API, all these UX decisions will carry over smoothly. The only difference will be your level of access to events and streaming, not the principles.

10. Common mistakes when working with the Voice / Realtime context

Mistake #1: “Reading the UI out loud” instead of a summary.
Sometimes developers design tools so that the model starts reading the entire JSON response or a full list of cards out loud. In voice mode that kills the UX: the user loses the thread and you waste tokens. It’s better for the voice to give a short summary and focus on one or two options, leaving the rest on screen.

Mistake #2: No visual feedback in voice mode.
It’s tempting to think: “Since the user is talking, they’re listening; UI isn’t needed.” In practice users often glance at the screen or come back to it a minute later. If there’s no status, no progress, and no clear outcome at that moment, they’ll assume the App froze or did nothing. Always show “I’m thinking”, “Step 2 of 3”, “Results are ready”, etc.

Mistake #3: Dangerous actions without strong confirmation.
Even in text mode it’s risky to do “Pay” with a single click, and in voice it’s even more dangerous to execute a purchase on a vague “uh‑huh”. Ignoring explicit confirmation flows (voice and/or visual) leads to wrong purchases and trust issues. Think through which actions require double confirmation, and state that explicitly in the system prompt and UI.

Mistake #4: Designing only for the eye, not for the ear.
Sometimes an App is designed as if the user is always reading: overly complex phrasing, long buttons, overloaded descriptions. In voice mode all of that also needs to be spoken — resulting in a “word salad”. Aim for key meaning to fit into short, simple phrases that are easy to understand by ear.

Mistake #5: Confusing the Apps SDK with your own voice client.
Some students start looking in the Apps SDK for microphone events, audio streaming, WebRTC like in the Realtime API, and get disappointed that “none of that exists.” It’s important to understand: a ChatGPT App lives inside the ChatGPT client, and voice is handled by the platform. You work with text, tool calls, and widget state, designing the UX so that voice mode “just works well.” If you need full control over voice, that’s a separate, more complex project with the Realtime API.

Mistake #6: No strategy for latency.
If you don’t think through what the model says and what the widget shows during long operations, the user will interrupt, ask new questions, and break your flow. Latency in voice feels stronger than in text. Use interim statuses, background job processing, and voice “I’m thinking; meanwhile tell me…” so silence doesn’t turn into a bug.

1
Task
ChatGPT Apps, level 8, lesson 4
Locked
No‑Awkward‑Silence Loader — pending state + follow‑up on delays
No‑Awkward‑Silence Loader — pending state + follow‑up on delays
1
Survey/quiz
UX and interfaces, level 8, lesson 4
Unavailable
UX and interfaces
UX and interfaces (Inline, Fullscreen, Voice)
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION