1. What is an incident in the world of ChatGPT Apps
In classic web, an incident is usually something like “the server is down,” “500 errors spiked,” “latency doubled.” A formal definition from ITIL: an incident is an unplanned interruption to a service or a reduction in the quality of a service.
In the world of ChatGPT Apps and GiftGenius, the picture is more complex. We have a model layer that can:
- not call the required tool even though everything is available;
- call a tool with incorrect parameters;
- “hallucinate” a result, ignoring your MCP.
Therefore, an incident may be not only HTTP 500, but also a situation where all backend metrics are green while users are massively complaining: “the bot is dumb and doesn’t show gifts” — because the model stopped calling suggest_gifts or confuses arguments. This is a quality incident.
It’s convenient to think of incidents by category:
| Category | Example symptom | Example metric (SLI) |
|---|---|---|
| Availability | MCP is not responding, “Error talking to app” in ChatGPT | % of successful responses /mcp |
| Latency | gift selection takes 10+ seconds | p95 duration of suggest_gifts |
| Quality | the model doesn’t call the required tool, mixes up currency | share of requests without a tool call when it’s explicitly required |
| Commerce | checkout no longer succeeds, funds don’t move | checkout_success_rate |
An incident is the moment when an actual metric goes out of the bounds of a pre-agreed SLO. For example:
- we agreed: p95 of gift selection < 4 seconds. Now it’s 9 seconds;
- we want 99% of checkouts per week to succeed, but it dropped to 94%;
- we expect that in purchase scenarios the model almost always calls create_checkout_session, but logs show a sharp rise in “misses.”
Important: an incident is not “someone complained in chat.” A complaint is a trigger, while the decision “yes, this is an incident” is made based on SLO/SLI and dashboards.
2. How SLO/SLI turn into incidents
In the observability module you already defined key metrics: latency, availability, error rate, checkout success. Now we use them as “guards at the door.”
The simplest scenario: we have an SLO for checkout_success_rate. We keep structured event logs:
// Example of a checkout log event in the MCP server
logger.info({
event: 'checkout_result',
request_id,
user_id,
checkout_session_id,
status: 'success', // or 'failed'
error_code: null,
});
On top of these logs we build a metric: the share of status = "success" among all checkout_result over the last N minutes/hours. When this share drops below a threshold (for example, 95% over 10 minutes), monitoring sends an alert to the on-call channel. This is incident detection: the SLI went outside the SLO boundary.
Alerts can likewise fire on:
- a spike in the error_rate of tools suggest_gifts, search_products;
- rising p95/p99 latency;
- an anomalous drop in workflow_completed (people don’t reach the end of the flow);
- an anomalous rise in LLM cost without a traffic increase (an economic incident).
All this is possible only because we log in a structured way instead of writing “something’s wrong with checkout again.” Once metrics and alerts are configured, we’ve learned to notice that something went wrong. The next question: what happens after detection, who reacts, and how?
3. Incident lifecycle: from detection to post-mortem
To avoid living in constant-fire mode, it’s helpful to describe a standard incident pipeline. Many SRE teams formalize it as a chain:
flowchart TD
D["Detection (discovery)"] --> T["Triage (severity assessment)"]
T --> M["Mitigation (quick containment)"]
M --> R["Resolution (final fix)"]
R --> P["Post-mortem (review and improvements)"]
Let’s walk through the stages using GiftGenius.
Detection — how to recognize it’s bad
Problem detection can be automatic or manual.
Automatic — alerts from monitoring based on SLO/SLI:
- PagerDuty / Opsgenie / email / Slack bot is screaming: SEV-1: checkout_success_rate < 60% over 10 minutes;
- latency alert: p95(suggest_gifts) > 10 s;
- cost anomaly: “LLM costs doubled with the same number of workflow_completed.”
Manual detection — when support (or you personally in Telegram) receives a flood of messages like “payment doesn’t go through,” “the widget spins forever.” Sometimes this highlights the problem before monitoring catches up.
Practical takeaway: even if you don’t have perfect monitoring yet, get used to looking at any mass user complaints through the lens of metrics: “which metric backs this and how can we measure it?”
Triage — classification and prioritization
After detection you need to answer two questions: how bad is it and who will fix it.
It’s useful to have a simple severity scale:
- SEV-1: critical — users can’t purchase, the app is broken on a key path (for example, checkout=0 with live traffic).
- SEV-2: serious but degraded — some users can’t complete the flow; latency increased significantly, but not to zero throughput.
- SEV-3: minor bugs — one of the additional tools occasionally fails; only an edge case breaks.
For GiftGenius, commerce incidents are almost always SEV-1: if money isn’t moving, you don’t just have a technical problem, but direct loss in revenue and trust.
At this step an on-call is assigned (or you yourself, if it’s a one-person team) and a decision is made: “Yes, this is an official SEV‑1 incident; we follow runbook N” (a runbook is a prewritten step-by-step instruction; we’ll cover the structure in a separate section).
Mitigation — stop the bleeding
Mitigation is not about finding the root cause; it’s about quick measures to reduce user impact. Examples:
- rollback the latest MCP/Agents/ACP release;
- turn off the problematic feature flag;
- switch GiftGenius into a “viewer-only” mode: show recommendations but don’t allow checkout;
- temporarily reduce load (rate limiting) or disable heavy tools.
Typical code for a “degraded mode” in our MCP:
// Pseudo-code: a global flag that can be toggled quickly
let checkoutDisabled = false;
export function setCheckoutDisabled(value: boolean) {
checkoutDisabled = value;
}
export async function createCheckoutSession(args: CheckoutArgs) {
if (checkoutDisabled) {
// Tell the model that payments are temporarily unavailable
return {
error: 'checkout_temporarily_disabled',
message: 'Payment is temporarily unavailable; show the user an explanation.',
};
}
// regular session creation logic
}
In your feature-flag system you can flip setCheckoutDisabled(true) as part of mitigation: at least users won’t receive 500s and stuck payments; they’ll see an honest message instead.
Resolution — the final fix
Once the “bleeding is stopped,” you have time to find the root cause and fix it:
- a bug in MCP/ACP code;
- a problem with a third-party provider (Stripe, payment gateway);
- OpenAI API limits (429, overload);
- a broken prompt or a model change that stopped calling a tool.
Resolution usually includes:
- a fix (patch/rollback/config);
- deploy to staging, then to production;
- verification of all SLI/SLO;
- returning flags to their normal state.
Post-mortem — learn from failures
After an incident, especially SEV‑1/SEV‑2, conduct a post-mortem: a document where you honestly answer:
- what happened (facts and timeline);
- how it was noticed;
- how you responded;
- what worked well and what didn’t;
- what changes you will make to prevent recurrence.
A post-mortem isn’t about blame; it’s about improving the system and process. Use it to update runbooks and alerts—and sometimes even architecture.
4. Roles and responsibilities: even if you’re “a team of one”
For the pipeline above to work in real life, it’s important to agree in advance who makes which decisions during a fire. Even if your team fits in one lift, it makes sense to formalize incident roles. This reduces chaos.
Common roles:
- On-call engineer — the first person paged who makes technical stabilization decisions (rollback, feature flags, temporary fallbacks).
- Incident commander — the person running the process: keeps the timeline, sets task priorities, ensures the team doesn’t thrash. In a micro-team it’s the same on-call wearing another “hat.”
- Communications — handles user and stakeholder updates: messages in Slack, on the status page, in the app UI (widget/chat), in the ChatGPT store.
- Scribe — records important steps and facts; the post-mortem is later written from this outline.
In a one-person team you are all four roles; it’s simply useful to consciously switch modes: “now I’m the engineer and fixing,” “now I’m communicating,” “now I’m recording the timeline.”
5. Runbook: policy instead of memory
A runbook is a document describing step by step what to do for a specific incident type: which graphs to check, which buttons to press, what trade-offs are acceptable. It greatly reduces improvisation and stress.
Runbook structure
Typically, a runbook contains:
- A short description of the incident and how it’s detected. Example: “Spike of ACP checkout errors > 5% over 5 minutes” or “Error talking to app for >20% of requests.”
- Scope — who is affected: all traffic, a specific region, a specific tool only.
- Where to look: links to dashboards (checkout SLO, MCP error rate, logs for tool_name = create_checkout_session), to MCP Inspector, etc.
- Quick mitigation steps: “check Stripe status,” “rollback the latest ACP release,” “enable recommendations-only mode (no purchasing).”
- Steps for the final analysis and fix.
- What to update as a result: alerts, code, documentation.
Mini example runbook for GiftGenius (checkout failing)
Let’s describe it as structured data, closer to code:
type Severity = 'SEV-1' | 'SEV-2' | 'SEV-3';
interface RunbookStep {
title: string;
description: string;
}
interface Runbook {
id: string;
title: string;
severity: Severity;
detection: string;
steps: RunbookStep[];
}
export const checkoutFailureRunbook: Runbook = {
id: 'rb-checkout-failure',
title: 'Increase in checkout errors in GiftGenius',
severity: 'SEV-1',
detection: 'Alert: checkout_success_rate < 60% over 10 minutes',
steps: [
{
title: 'Check external statuses',
description: 'Open the status pages for Stripe and the ACP backend; make sure there is no global outage.',
},
{
title: 'Check recent releases',
description: 'Verify whether there were MCP/ACP deploys in the last 30 minutes. Roll back if necessary.',
},
],
};
In a real runbook you’ll add more steps: enable the read-only feature flag, show a banner in the widget, collect logs for the post-mortem.
Example widget copy during a commerce incident
It’s useful to prewrite user-facing text in the runbook. For example, the GiftGenius widget can show:
"We are currently experiencing temporary technical issues with payments. You can still save your favorite gift ideas, and we’ll complete the purchase a bit later."
Then wire that text into a UI state:
// Pseudocode for widget state
const [checkoutAvailable, setCheckoutAvailable] = useState(true);
if (!checkoutAvailable) {
return (
<Alert>
Payment is temporarily unavailable. You can still browse and save gift ideas.
</Alert>
);
}
6. Practice in GiftGenius: code around incidents
So the topic isn’t purely organizational, let’s look at a couple of code snippets that directly help with incident management.
Health-check endpoint for MCP/Backend
A simple but important tool is a health check. In Next.js 16 you can implement it using a route handler:
// app/api/health/route.ts
import { NextRequest, NextResponse } from 'next/server';
export function GET(_req: NextRequest) {
// You can add checks for DB, queues, etc.
return NextResponse.json({
status: 'ok',
mcp: 'healthy',
timestamp: new Date().toISOString(),
});
}
The monitoring system will periodically poll /api/health. If timeouts or 5xx come instead of 200 OK, that’s a clear signal of an Availability incident (MCP is down).
Classifying an incident by metrics
On the analytics service side or in an admin backend script you can keep simple severity logic:
type Severity = 'SEV-1' | 'SEV-2' | 'SEV-3';
interface IncidentContext {
checkoutSuccessRate: number; // 0..1
giftSearchErrorRate: number; // 0..1
p95GiftSearchMs: number;
}
export function classifyIncident(ctx: IncidentContext): Severity | null {
if (ctx.checkoutSuccessRate < 0.6) return 'SEV-1'; // payments aren’t flowing
if (ctx.giftSearchErrorRate > 0.3 || ctx.p95GiftSearchMs > 8000) return 'SEV-2';
return null; // not an incident yet
}
You can run this snippet on a cron or trigger it from monitoring: when it returns SEV‑1, automatically create an incident in your system and notify the on-call.
Logging key incident events
Incidents are about events as well as metrics: when an incident is created, mitigated, or resolved. It’s convenient to keep these in separate logs.
function logIncidentEvent(event: {
incidentId: string;
type: 'created' | 'mitigated' | 'resolved';
severity: Severity;
requestId?: string;
message: string;
}) {
logger.warn({
level: 'WARN',
service: 'incident-manager',
...event,
timestamp: new Date().toISOString(),
});
}
For example, when enabling the “read-only” mode for GiftGenius:
setCheckoutDisabled(true);
logIncidentEvent({
incidentId: 'inc-2025-11-21-001',
type: 'mitigated',
severity: 'SEV-1',
message: 'Checkout disabled, app switched to recommendations-only mode',
});
You can then easily find these events and correlate them with metric time series.
7. Operational calendar: life after “hooray, it’s fixed”
Incident management isn’t only firefighting; it’s regular preventive care. In SRE practices, the operational cycle is often expressed as an operational calendar with regular reviews of SLOs, costs, and security.
You can loosely group activities by cadence.
Weekly
Once a week (or every other week), it makes sense to:
- review the main SLOs: latency, error rate, checkout success, share of incidents by category;
- see whether there were alerts during the week that “self-quenched,” and decide whether to strengthen/relax thresholds;
- briefly review at least one incident (even SEV‑3) — it trains the post-mortem muscle.
Monthly
Once a month it’s good to:
- review costs (LLM, ACP/Stripe fees, infrastructure) and compare them to revenue — a tie-in with topics 1–2 of module 19;
- look at product metrics: activation, retention, conversion workflow_completed → checkout_success — connected to the marketing and growth module;
- scan security logs for anomalies: odd login patterns, authorization errors, unusual bursts of requests (a bridge to the security module).
Quarterly
Once a quarter you:
- rotate secrets: OpenAI API keys, Stripe, OAuth clients, etc.;
- check whether SLOs are outdated: maybe the app has grown and now p95 of 2 seconds instead of 1 is normal—or conversely, you can tighten goals;
- revisit runbooks: new incident types, updated dependencies (SDK, MCP spec, etc.).
You can keep the calendar as a simple wiki page or a README in the GiftGenius repo; the key is for it to be “living” and updated.
8. Incidents, money, and product: why a commerce fire is the hottest
Module 19 is about the economics and “operational life” of the app, and incidents are closely tied to money here. Commerce incidents — when checkout fails, funds are blocked, or charged twice — almost always take priority over, say, a sporadic timeout in gift search.
Reasons are simple:
- immediate revenue losses right now;
- risk of losing trust (a user who was charged but didn’t receive the product is unlikely to return);
- potential legal and reputational consequences.
Therefore, in your GiftGenius incident catalogue, commerce incidents should be explicitly marked as SEV‑1 with strict SLOs for response time (for example, “on-call response within 15 minutes; mitigation within one hour”).
Economic anomalies (for example, LLM cost suddenly rising without revenue growth) are incidents too, but typically SEV‑2: they don’t break UX immediately but can eat all your margin if unnoticed.
From the product side, any major incident is a reason to think:
- is the workflow too complex (maybe simpler means more reliable);
- should you add a fallback: for example, if MCP is down, the model at least returns advice without external data;
- do you need to change UX to communicate issues honestly instead of hiding them.
9. Mini exercises (for self-practice)
Although this lecture isn’t a workshop, I highly recommend actually doing the following steps in your GiftGenius:
- Describe at least two runbooks in one document:
- “Mass payment (checkout) failures”;
- “MCP doesn’t respond / ChatGPT shows Error talking to app.”
- Create an operational calendar for a month:
- which SLOs you will review every week;
- what cost review you will do at month’s end;
- what security checks you will include (at least basic ones).
This will take a couple of hours, but it will greatly change how you look at your application: it will stop being “just code” and become a living service.
Common mistakes in incident management for ChatGPT Apps
Error #1: “An incident is only when everything is down”
Many people habitually treat only a full MCP or database outage as an incident. In AI apps, “soft” quality incidents often hurt more: the model stopped calling the required tool, the checkout flow became confusing, users don’t reach the end even though HTTP metrics are green. If you don’t treat such situations as incidents and don’t review them, app quality will degrade unnoticed.
Error #2: No clear SLOs and boundaries for “normal operation”
Without formal SLOs, any debate about an incident turns into “I feel everything is slow” vs. “it’s fast on my machine.” That’s why SLOs are the basis of incident management: they make the severity of a problem objective.
Error #3: Improvisation instead of runbooks
A common picture: an alert fires, everyone panics and jumps into prod, someone rolls back a release, someone edits configs, an hour later “seems fixed,” but no one remembers what actually helped. Without runbooks, every incident is mini-chaos, and the team doesn’t learn. Even one simple checkout runbook greatly reduces stress.
Error #4: Ignoring user communications
Sometimes engineers fix the system in silence while users only see a spinner and “something went wrong.” This is especially toxic for commerce scenarios: people worry about money. It’s important to have prewritten message templates in the widget, in the app description, and, if necessary, in external channels to honestly state the problem and the expected time to fix.
Error #5: Blaming “OpenAI is at fault” without examining your part
It’s easy to pin everything on “OpenAI is flaky,” but in practice even with upstream issues you can do a lot on your side: handle timeouts and errors correctly, switch to a mode without MCP, reduce the number of retries so as not to worsen the situation. The concept of shared responsibility means you own your part of the chain even if one of the providers is unstable.
Error #6: No post-mortems and no operational cycle
If an incident ends with “well, looks done, moving on,” and no documents, alerts, or code change—your system is doomed to repeat the same mistakes. Post-mortems and regular reviews of SLOs, costs, and security are not bureaucracy; they’re a way to make agreements with your future self and team so that in a year GiftGenius is more reliable, not more fragile.
GO TO FULL VERSION