CodeGym /Courses /ChatGPT Apps /Fault tolerance: step rollback, retries, and error contro...

Fault tolerance: step rollback, retries, and error control

ChatGPT Apps
Level 11 , Lesson 3
Available

1. Why errors in a ChatGPT App are the norm, not an emergency

In the previous lecture, we discussed how to break a task into steps and build a multi‑step workflow in a ChatGPT App. Now let’s add the honest reality to this scheme: errors, timeouts, and user interruptions.

In classic web apps, logic often revolves around the “happy path,” and errors are treated as rare and exceptional — a red 500 page, etc. In a ChatGPT App, the picture is different: you are working in a distributed system with an LLM, external APIs, MCP, a widget, and a user who can close the tab at any moment. Errors and interruptions are an everyday routine.

There are a few characteristics that make life harder:

  • First, LLMs are nondeterministic. Even with the same prompt, the model might make a slightly different decision: call another tool, change parameters, or decide it’s better to “clarify.”
  • Second, there are network and infrastructure constraints. A ChatGPT tool‑call has timeouts (usually tens of seconds), as does your Next.js/Vercel backend. If an external API is slow, everything can be cut off midway.
  • Third, there’s the UX factor: the user gets distracted, closes the chat, returns a day later — and you can’t keep a database transaction open all that time.

Hence the main thesis of this lecture:

A fault‑tolerant workflow assumes any step can fail and explicitly defines what happens next.

An error is not only a reason to show a message to the user, but also a signal to the model, which can change strategy, propose a rollback, try another tool, or gracefully end the scenario.

2. The error landscape in a workflow: what kinds there are

To handle failures well, you first need to distinguish between them. In LLM applications based on ChatGPT Apps, several error classes are common.

Technical errors. This is the classic set from distributed systems: network timeouts, 5xx from your own or external APIs, MCP server crashes, a bug in a tool handler. For example, in GiftGenius your MCP tool search_products calls the catalog, and it responds with 503 Service Unavailable. That’s a candidate for an automatic retry.

Logical (model) errors. This includes model refusals (it decided the request violates policy), hallucinations, or, for example, a broken JSON in a tool response. The model could have generated invalid arguments for a tool‑call, and your JSON validation rejected them. This is most often an input data error, not an infrastructure issue.

Business errors. These are about meaning: the item is out of stock, the user’s budget is too small for the chosen filters, the promo code is invalid, the reservation expired. In GiftGenius, this is the situation “out of 500 candidates, none fits the constraints.” A retry rarely helps here: you either need to change parameters or explain to the user that the constraint is unrealistic.

UX interruptions. The user breaks the scenario on purpose: closes ChatGPT, clicks “Back” in the widget, cancels an action, changes an answer to a previous step. This should also be considered part of the normal flow, not an error. It’s important to restore and roll back state in such cases — we’ll talk about this shortly.

A particularly tricky case, at the intersection of logical and technical errors, is infinite agent loops: the model gets an error, thinks “hmm, I’ll try again,” gets another error, and so on until the context or budget runs out. Guarding against this behavior is an important part of error design.

3. Basic strategies: retry, fail‑fast, rollback, involve the user

You can view any error as a branching point: we either try to repeat the step, roll back, or involve the user. And importantly, these strategies can be combined.

For technical and transient failures (the network blinked, the API returned 503), it’s logical to do a limited retry with backoff. For logical and business errors (“the validator rejected the budget,” “items ran out”), repeating is pointless — you need to fail fast and ask the user to change the input or parameters.

For operations that already changed something in the external world (created an order, reserved inventory), you need a rollback — either as a “step back” in the UI/context, or as real compensating actions (order cancellation, refund).

Finally, some situations inherently require user participation: for example, if a payment processor rejects a card with “card declined by bank,” you can’t fix that automatically. The model should clearly explain what happened and offer options: try a different card, lower the amount, or abandon the purchase.

For a robust workflow, it’s very useful to explicitly list, for each step, which error types are possible and what you do for each — auto‑retry, rollback, ask the user, or just log and end the branch.

4. Retries and backoff: when and how to use them

Let’s start with the most natural developer reaction: “Well, let’s just try again.” The idea is sound, but as always, the devil is in the details.

Which errors are safe to retry

A good heuristic from integration practice goes like this: network errors and 5xx can be retried with a pause, while 4xx usually should not.

That is, if you get 503, 504, or simply no response from an external API, it makes sense to repeat the request with a short delay. If the server returns 400 Bad Request or 422 Unprocessable Entity, most likely the problem is in the data, and repeating with the same parameters won’t change anything.

A simple callWithRetry utility in TypeScript

Let’s write a small utility for the MCP or backend layer that you can use in tools:

type RetryOptions = {
  maxRetries: number;
  baseDelayMs: number;
};

async function callWithRetry<T>(
  fn: () => Promise<T>,
  { maxRetries, baseDelayMs }: RetryOptions
): Promise<T> {
  let attempt = 0;

  // we don't want infinite loops
  while (true) {
    try {
      return await fn();
    } catch (err: any) {
      attempt++;
      const status = err?.status ?? err?.response?.status;

      // do not retry 4xx
      const isClientError = typeof status === "number" && status >= 400 && status < 500;
      if (attempt > maxRetries || isClientError) {
        throw err;
      }

      const delay = Math.min(baseDelayMs * 2 ** (attempt - 1), 10_000);
      // small pause to avoid a thundering herd on the API
      const jitter = Math.random() * 200;

      await new Promise((r) => setTimeout(r, delay + jitter));
    }
  }
}

This function:

  • retries calling fn a limited number of times;
  • uses exponential backoff with small random jitter to avoid the “thundering herd” on concurrent retries;
  • stops retrying on 4xx.

It’s a good fit inside an MCP tool that calls a product catalog or an internal recommendations API, for example.

Where exactly to perform retries

A common mistake is attempting to retry every request everywhere, including layers you don’t control. In the ChatGPT ecosystem, you have several places for retries:

  • inside your own backend/MCP (as we did in callWithRetry);
  • inside a background worker/queue (we’ll discuss job queues and DLQ in future modules);
  • sometimes — in the widget itself, when it’s a lightweight “refresh list” request with no side effects.

It’s important not to duplicate logic: if your job worker already does three retries with backoff, there’s no point in adding five more retries in the widget. And of course, never do while(true) { try ... } — that’s a surefire way to DDoS yourself.

5. Step idempotency: protection against duplicates

Retries create a second problem: how not to perform the same action twice. In the LLM world, this is especially acute: the model can accidentally call the same tool multiple times, ChatGPT can repeat a tool‑call after a timeout, the user can press “Regenerate,” and then the UI or agent may add another call on their own.

The idea of idempotency is simple: a step is idempotent if repeating it with the same input does not cause additional side effects. Requesting a product feed — fine, recalculating recommendations — fine, but charging money again or creating a second order with the same data — definitely not fine.

Idempotency key in a ChatGPT App

A classic pattern: for each logical step with side effects, you generate an idempotency_key (usually a UUID), pass it through the model to the MCP tool, and store the mapping “key → result” there. If the tool is called a second time with the same key, it does not repeat the action and simply returns the previously saved result.

In our GiftGenius, there is a create_order step. Imagine the user clicks “Pay,” the model calls the tool, the payment succeeds, but somewhere along the way the response is lost. The model or platform decides to repeat the call, and if we don’t have idempotency, we’ll get a duplicate order or double charge.

A simple example of an idempotent tool in TypeScript

Let’s build a very simplified MCP tool handler create_order with an idempotency key. For simplicity, we’ll use an in‑memory Map; in real life, this would be a DB or cache.

type CreateOrderInput = {
  userId: string;
  items: Array<{ sku: string; qty: number }>;
  idempotencyKey: string;
};

type CreateOrderResult = { orderId: string; status: "created" };

const idempotencyStore = new Map<
  string,
  { paramsHash: string; result: CreateOrderResult }
>();

export async function createOrderTool(input: CreateOrderInput): Promise<CreateOrderResult> {
  const { idempotencyKey, ...rest } = input;
  const paramsHash = JSON.stringify(rest);

  const existing = idempotencyStore.get(idempotencyKey);
  if (existing) {
    // if the key already exists, ensure the params match
    if (existing.paramsHash !== paramsHash) {
      throw new Error("Idempotency key reuse with different params");
    }
    return existing.result;
  }

  // here we perform the real order creation and payment
  const result: CreateOrderResult = {
    orderId: "order_" + Math.random().toString(36).slice(2),
    status: "created",
  };

  idempotencyStore.set(idempotencyKey, { paramsHash, result });
  return result;
}

Here we:

  • require idempotencyKey in the tool input;
  • store a parameter hash with it (for simplicity, JSON.stringify here);
  • treat a repeat call with the same key but different data as an error;
  • return the previous result on a repeat call with the same data.

In a real project, you should:

  • store keys in a DB with TTL (so the table doesn’t grow without bound);
  • log the idempotency_key and include it in _meta MCP messages to track it easily via Inspector and dashboards.

6. Step rollbacks and the Saga pattern

Idempotency protects against duplicates but doesn’t solve another case: what to do if one of the steps in the middle of the scenario fails.

In e‑commerce, this is a classic problem: you’ve already created an order and reserved items in the warehouse, and then something goes wrong at the payment stage. You can’t just “forget it” — you need to roll back the previous state somehow.

Logical vs technical rollback

In a ChatGPT workflow, there are two levels of rollback.

Logical rollback is returning to the previous scenario step and adjusting the context. For example, an error occurs at the “payment” step, and you decide to roll back to “choose payment method” or even “choose a gift.” Then it’s important to:

  • update the WorkflowContext on the backend (current step, selected parameters);
  • inform the model about the step change via a tool‑call/ToolOutput so it “forgets” the old branch and adapts subsequent behavior;
  • update the widget UI so steps and buttons reflect the new state.

Technical rollback is at the business level: canceling created entities and compensating external effects. For example: cancel the order, release the warehouse reservation, initiate a refund. This is the Saga pattern: for each “dangerous” step, you plan a compensating action in advance.

Forward/compensate flow for GiftGenius

For a simplified GiftGenius checkout, we can draw this sequence:

flowchart TD
  A[Step 1: create_order] --> B[Step 2: reserve_items]
  B --> C[Step 3: charge_card]

  C -->|success| D[Status: completed]

  C -->|error| E[Compensation: cancel_reservation]
  E --> F[Compensation: cancel_order]
  F --> G[Status: failed + user message]

Each action that changes the external world (creating an order, reserving items, charging a card) has a corresponding compensating action (cancel the order, release the reservation, refund). They are not always symmetric or possible one‑to‑one, but that’s the general principle.

A mini example with compensation in code

Let’s look at a small code snippet that performs these steps:

async function completeCheckout(ctx: { userId: string }) {
  const order = await createOrderInDb(ctx.userId);

  try {
    await reserveItems(order.id);
    await chargeCard(order.id);
    return { orderId: order.id, status: "paid" as const };
  } catch (err) {
    // compensating actions
    await safeCancelReservation(order.id);
    await safeCancelOrder(order.id);
    throw err;
  }
}

Here:

  • createOrderInDb, reserveItems, chargeCard are forward steps;
  • safeCancelReservation and safeCancelOrder are compensating steps that should themselves be idempotent (if we attempt to cancel something already canceled, nothing bad happens).

Note that on error we do not hide it — we propagate it. The model (via ToolOutput) should receive a clear error and then explain it to the user in human terms and propose the next step.

7. Rolling back steps and state sync: avoiding desynchronization

There’s a special kind of “error” that’s easy to underestimate: state desynchronization between the UI, backend, and model.

Typical scenario:

  1. The user goes through steps 1 → 2 → 3.
  2. Something goes wrong at step 3; the user clicks the “Back” button in the widget.
  3. The widget dutifully returns its local state to step 2.
  4. But the model “remembers” we were at step 3 and already tried to pay. In the next message it keeps talking about payment, even though the user sees the gift selection screen.

To prevent this, it’s useful to introduce an explicit step rollback event. The widget sends it to MCP/the model — either as a tool call or as a ToolOutput.

For example, you can create a simple tool user_navigated_to_step that records the current step and its state:

type NavigateInput = {
  workflowId: string;
  stepId: string;
};

export async function userNavigatedToStep(input: NavigateInput) {
  await workflowRepo.setCurrentStep(input.workflowId, input.stepId);
  return {
    message: `User moved to step ${input.stepId}`,
  };
}

When the user clicks “Back,” the widget calls this tool; the model sees its result in the history of tool calls and understands it should continue the dialog from the new step.

On the UI side, the handler looks roughly like this:

async function handleBackClick() {
  const { workflowId, prevStepId } = widgetState;

  await window.openai.tools.call("user_navigated_to_step", {
    workflowId,
    stepId: prevStepId,
  });

  setWidgetState((s) => ({ ...s, currentStepId: prevStepId }));
}

An important point: the backend/agent is the source of truth for the current step, and the model sees it through tools. Then even when restoring a session later, you can synchronize the context correctly.

8. Error UX: what the user sees vs what the model sees

We already learned how to survive errors technically (retries, rollbacks, idempotency, state sync). Now we need to make it look reasonable to both the user and the model.

Even a perfectly implemented retry and rollback won’t help if the error UX is “like old Java servlets”: red text, a stack trace, and a cryptic “Unexpected error.”

In a ChatGPT App, error messages have two audiences:

  • the user, who needs to understand what happened and what they can do next;
  • the model, which needs sufficiently structured information to decide whether to retry, change parameters, suggest an alternative, or end the scenario.

Good practice:

  • at the MCP/tools level, return a structured error with a code, type, retryable flag, and a brief technical message;
  • give the model exactly this structure (for example, in result.structuredContent), not a mile‑long stack trace;
  • show a short, human‑friendly message in the UI.

A mini example of an error structure returned by a tool:

type ToolError = {
  code: string;          // e.g. "PAYMENT_TIMEOUT"
  message: string;       // short technical description
  retryable: boolean;    // whether it is safe to try again
};

throw {
  isError: true,
  error: <ToolError>{
    code: "PAYMENT_TIMEOUT",
    message: "Payment provider did not respond in time",
    retryable: true,
  },
};

The model sees retryable: true and can try a different tool or suggest the user retry.

On the widget side, you simply map those codes to user‑friendly texts:

function ErrorBanner({ code }: { code: string }) {
  const text =
    code === "PAYMENT_TIMEOUT"
      ? "The payment service didn’t respond in time. Please try again in a minute."
      : "Something went wrong. Please try again.";

  return <div className="error-banner">{text}</div>;
}

And one more important point: don’t show users exception stacks, tokens, or secrets. It’s both ugly and unsafe. Log the technical details on your side and give users a short, safe message.

Insight

In LLM systems like ChatGPT, incorrect tool calls are the rule rather than the exception. The model regularly generates arguments that fail validation: mixed‑up types, missing fields, invalid values, broken structures. This is not an “error” in the usual engineering sense — it’s part of the stochastic nature of the model, and the entire error interface needs to adapt to it.

The key idea: an error message is not a “something broke” signal; it’s an instruction for fixing the next attempt. Its main audience is the model itself. If the message is structured and includes precise guidance, the model can automatically correct parameters and repeat the call properly. This is exactly what Tool‑Reflection techniques are based on: correct feedback improves the agent’s next action without human involvement.

I recommend sticking to these requirements for error formats:

  • the message should point to the specific field that failed validation — not a generic “Invalid parameters”;
  • it’s important to explicitly describe the expected format or allowed values so the model can pick a suitable one;
  • the message should be concise, formal, and structured: fields such as error_type, field, expected, or allowed_values help the model a lot;
  • when possible, include a minimal example of valid input — this often boosts the model’s recovery accuracy.

The ideal error feedback for the model includes two facts: what went wrong and how to fix it.

9. Logging and metrics for workflow errors

Even if the error UX is polished, user‑facing messages alone aren’t enough to understand what actually breaks. You need structured logs and per‑step metrics.

A minimally useful set when logging each workflow step:

  • user_id or at least a session_id;
  • workflow_id and step_id;
  • step status (success, failed, retry, rolled_back);
  • error_code (if any);
  • idempotency_key and correlation_id if the step is tied to external calls.

MCP and Agents have _meta fields; they’re a convenient place for idempotency_key and correlation_id so you can see them in logs and in the Inspector.

The simplest logging example in Node.js/TypeScript (you can use console or winston/pino):

function logStepFailure(params: {
  userId?: string;
  workflowId: string;
  stepId: string;
  errorCode: string;
  idempotencyKey?: string;
}) {
  console.error(
    JSON.stringify({
      level: "error",
      event: "workflow_step_failed",
      ...params,
      timestamp: new Date().toISOString(),
    })
  );
}

Such logs are easy to parse, dashboard, and use to compute:

  • conversion between steps;
  • the most frequent error types;
  • the share of steps ending in retry vs final failure.

Not every error should trigger a production alert. Critical ones — MCP outages, systemic timeouts, mass failures on a specific step — should go to monitoring. But “no gift search results” is a business event, not an incident.

10. Evolving GiftGenius: a resilient checkout step

Let’s put it all together now: retries, idempotency, Saga, state sync, error UX, and logging — on the example of a single step in our training app, GiftGenius — checkout.

What we already have

By now we already:

  • have a multi‑step workflow: gather information → generate ideas → choose a gift → checkout;
  • configured tool gating: at the checkout step only commerce tools are available (create_order, get_payment_methods, etc.);
  • have a WorkflowContext storing the selected gift, budget, userId, and the current step.

What we’ll add in this lecture

For the checkout step we’ll add:

  1. an idempotency_key for the create_order tool;
  2. retry for temporary errors from the payment provider;
  3. compensation for partially successful operations;
  4. proper error UX in the widget.

Generating an idempotency key in the widget when the user clicks “Pay”:

import { v4 as uuid } from "uuid";

async function handlePayClick() {
  const idempotencyKey = uuid();
  setWidgetState((s) => ({ ...s, idempotencyKey }));

  await window.openai.tools.call("create_order", {
    userId: widgetState.userId,
    items: [/* ... */],
    idempotencyKey,
  });
}

On the create_order tool side, we use that same idempotent handler we wrote above: it stores the key and result and does not create a new order on repeats.

Wrap the payment API interaction in callWithRetry to try charging a few times on network glitches. Don’t forget to include retryable: true in the error so the model understands it can suggest a retry.

If something breaks after successfully creating the order and charging (for example, an external webhook is late), we log it with a correlation_id and workflow_id and then:

  • try a background retry (in a future module about queues and events);
  • or explicitly mark the step as failed, run compensations, and explain to the user what happened.

11. Common mistakes when designing fault‑tolerant workflows

Mistake #1: “We retry everything until it works.”
Automatically retrying any step until success is a reliable way to create local chaos. Network and 5xx errors can be retried with backoff and attempt limits. But 4xx, business errors, and logical model failures should be fixed with data or explained to the user. Otherwise you get unstable behavior, odd charges, and noisy logs.

Mistake #2: No idempotency where money and orders are involved.
If a tool like create_order or charge_card isn’t idempotent, any repeat call (due to a timeout, Regenerate, or an agent bug) can lead to duplicates. In LLM scenarios, repeats happen much more often than in classic REST frontends, so an idempotency_key is not a “nice to have” — it’s mandatory for payment and other critical steps.

Mistake #3: No compensating actions (no Saga).
You created an order, reserved items, then failed on payment and just showed “something went wrong.” As a result, you have half‑orders, lingering reservations, and financial tails in the system. For every step that changes the external world, plan what you’ll do if the next step fails: cancel, refund, mark as “expired,” etc.

Mistake #4: Letting the agent loop in retries forever.
If you don’t limit the number of attempts (for example, via maxRetries in helpers or max_iterations in agent logic) and don’t mark errors as retryable: false where retries are useless, the model can loop: “Try again… again…”. This burns tokens, time, and patience.

Mistake #5: UI/model state desynchronization on rollback.
Developers often implement a “Back” button only in the UI, forgetting to sync the step with the backend and the model. As a result, the user sees step 2 while the model lives on step 3 and makes odd suggestions. The solution is explicit events like user_navigated_to_step and updating the WorkflowContext on each transition.

Mistake #6: Technical messages for users and no logs for developers.
The user gets “Error: ECONNRESET at TcpSocket.onEnd…,” while you get zero info about which workflow_id and step_id failed. The sound approach: user — short, clear text and a suggestion of what to do next; developer — a structured log with workflow_id, step_id, error_code, idempotency_key, and correlation_id.

Mistake #7: No alerting strategy.
Either everything is alerted — including “no suitable gifts for your very narrow filter” — or nothing is alerted, including a real MCP outage. Separate critical system failures (service down, mass timeouts, missing webhooks) from expected business events. The former go to monitoring and on‑call; the latter are just counted in analytics.

1
Task
ChatGPT Apps, level 11, lesson 3
Locked
Structured tool error for invalid input
Structured tool error for invalid input
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION