1. Why streams are particularly sensitive to load
In previous parts we covered how MCP events work, statuses job.progress/job.completed, async jobs, and streaming channels (SSE/HTTP-stream) for GiftGenius. Now it’s important to look at what happens to this architecture under real load.
While you have a single user who occasionally runs a gift selection, everything looks great. But as soon as GiftGenius hits production and hundreds of requests arrive at once to “pick gifts for all employees for the company party,” you suddenly discover that:
- the server has hundreds of long‑lived SSE connections;
- workers are eagerly sending job.progress on every tiny change;
- logs grow by gigabytes per day;
- the user’s UI starts to stutter, even though “the server doesn’t seem to be crashing.”
A classic HTTP request lives for milliseconds or seconds. An SSE or HTTP stream can live minutes or even hours. It holds a connection, memory, file descriptors. Every event sent means JSON serialization, network copying, GC work. If you treat this as “just one more console.log on the backend,” the system will quickly turn into a space heater.
MCP events have another property: they’re often generated multiple times for the same task. A worker that updates progress every 0.1% produces an impressive number of events per job. As a result, you get “noise”: a huge amount of tiny messages that:
- load the network and CPU;
- clog queues and buffers;
- make debugging and log analysis painful.
So treat both streams and MCP events as seriously as database queries or model calls: they’re expensive resources that require normalization, control, and monitoring.
To handle this, keep three big themes in mind:
- Rate limits — restrict how many and how often you can afford to generate and send events/streams.
- Backpressure — respond when the consumer can’t keep up with the producer.
- Monitoring and metrics — measure what’s happening and notice in time when things start to boil over.
2. Rate‑limiting for streams and events
Let’s start with the most straightforward — limits.
It’s important to understand that in streaming scenarios the “dangerous offender” is often not the client, but the server. In typical REST APIs you limit the number of requests to the server so a user doesn’t turn into a DDoS. In the world of MCP and streams it’s very easy to organize a reverse DDoS: a worker or an MCP server bombards the client with thousands of events per second.
Which limits you need
Usually you think along three axes.
First, limits per user or session. You can’t allow one user to open twenty parallel GiftGenius master widgets, each with its own SSE stream. A reasonable constraint is a few active streams per session and a limit on the number of jobs in the running status for a single user or tenant.
Second, limits per job. Here we care about event frequency. It’s sufficient to send job.progress no more than once every N milliseconds or only on noticeable change, for example every 5% of progress. You don’t need to send a message for every processed item in the catalog. It also makes sense to limit payload size: a progress event shouldn’t carry megabytes of text.
Third, limits per IP or organization. This protects against abuse when someone runs a script that spams tasks, or when your app suddenly becomes popular. Here, familiar API gateway and proxy mechanisms come into play.
A simple implementation of an event frequency limit
Consider a GiftGenius worker that selects gifts in the background for a long list of recipients and periodically sends progress via an MCP notification event/progress. We want to send events no more often than once every 500 milliseconds and only when the percentage changes by at least 5 points.
Pseudo TS code for the worker:
// suppose there is an mcpClient.sendNotification(...)
let lastSentPercent = 0;
let lastSentAt = 0;
function reportProgress(jobId: string, percent: number, message: string) {
const now = Date.now();
const percentDelta = percent - lastSentPercent;
const timeDelta = now - lastSentAt;
// send only if >= 500 ms have passed OR increased by >= 5%
if (percentDelta >= 5 || timeDelta >= 500) {
mcpClient.sendNotification("event/progress", {
jobId,
percent,
message,
});
lastSentPercent = percent;
lastSentAt = now;
}
}
This approach is called throttling: we “thin out” the event stream by time and by value change.
If you break the process into stages (“Stage 1 of 3”, “Stage 2 of 3”), the logic is even simpler: send events only when the stage changes.
A limit on the number of simultaneously open streams
On the MCP server side you likely have an HTTP handler for SSE:
// app/api/events/[userId]/route.ts (Next.js 16 App Router)
export async function GET(
req: Request,
{ params }: { params: { userId: string } },
) {
const userId = params.userId;
if (!canOpenMoreStreams(userId)) {
return new Response("Too many streams", { status: 429 });
}
const stream = new ReadableStream({
start(controller) {
registerSseClient(userId, controller);
},
cancel() {
unregisterSseClient(userId);
},
});
return new Response(stream, {
headers: { "Content-Type": "text/event-stream" },
});
}
The canOpenMoreStreams function can check the current number of open connections for the user and compare it to a threshold (for example, no more than three parallel streams). If the limit is exceeded, return 429 and, in GPT instructions, explain to the model that in such a situation it’s better not to start another long‑running wizard, but to suggest to the user that “there’s already an active selection; let’s wait for it to finish.”
In small systems you can implement such checks in process memory. In more serious infrastructure this moves into an MCP gateway or a separate rate‑limit service.
3. Backpressure: what to do when the consumer can’t keep up
Rate limits constrain how much we want to produce events. But even with careful limits it’s still possible for the consumer to “choke”: the user has poor mobile internet, the browser tab is stalled, or ChatGPT is heavily loaded at the moment.
Backpressure is the system’s response to the consumer falling behind. Instead of accumulating data endlessly and eventually crashing with an OOM, we consciously:
- slow down;
- aggregate events;
- drop less important ones.
Where pressure arises
A typical scenario for GiftGenius might look like this. The worker writes events to a queue (for example, Redis Streams or just a table in a DB), the MCP server reads them and pushes them into the SSE channel. If the client is slow (3G, old laptop, lots of other tabs), the TCP buffer starts to fill up, the Node process can’t drain the queue fast enough, and it ends up accumulating events in memory. Then you see the familiar:
FATAL ERROR: Ineffective mark-compacts near heap limit
You already have backpressure at the network (TCP) level, but it doesn’t know about your domain entities. It just says: “Hey, slow down, the buffer is full.” Our task is to interpret this at the level of MCP events.
Bounded buffering and dropping events
For progress and statuses we have a nice property: not all events are equally valuable. The user cares about the latest percentage, not the history of all intermediate “51%, 52%, 53%, 54%”. That means we can safely drop some events and send only the last one.
Let’s say we have a layer that receives progress events from workers and puts them into a buffer for each jobId:
type ProgressEvent = { jobId: string; percent: number; message: string };
const progressBuffers = new Map<string, ProgressEvent[]>();
const MAX_BUFFER = 10;
function bufferProgress(event: ProgressEvent) {
const buffer = progressBuffers.get(event.jobId) ?? [];
buffer.push(event);
// limit buffer size
if (buffer.length > MAX_BUFFER) {
// keep only the last few events
progressBuffers.set(event.jobId, buffer.slice(-MAX_BUFFER));
} else {
progressBuffers.set(event.jobId, buffer);
}
}
A separate timer, say every 500 ms, looks at the buffer and sends only the last event, ignoring the rest:
setInterval(() => {
for (const [jobId, buffer] of progressBuffers.entries()) {
if (!buffer.length) continue;
const last = buffer[buffer.length - 1];
sendProgressToClient(last); // SSE/MCP notification
progressBuffers.set(jobId, []); // clear
}
}, 500);
This is an example of conflation: combining several updates into one up‑to‑date one. For progress, it’s a golden pattern.
For events like “log” or partial_result the strategy may differ. There, losing events is often unacceptable: log text matters, and a missing JSON chunk can break data structure. In these cases you can:
- aggregate the message (join several log lines into one packet);
- or send a control signal to the worker to “slow down log generation.”
In asynchronous systems the second option is harder, but it’s worth at least considering it.
Limiting queue depth
Backpressure isn’t limited to the buffer of events right before sending. You need to look at all the queues in the system:
- the task queue waiting for a worker;
- the event queue between the worker and the MCP server;
- buffers inside streaming libraries on the server side.
For each queue it’s important to set a reasonable depth limit. If a queue overflows, you either start responding to clients with “the system is overloaded, please try again later,” or you drop less important jobs, or you move some scenarios to “offline mode” (for example, generate a report and send a link later).
Another useful technique is prioritizing event types. Under overload you can start sending only job.completed and job.failed, and lower the priority of job.progress or turn it off entirely.
4. Monitoring streams and events
Without measurements, all this beauty with rate limits and backpressure turns into guesswork. You need to see that the number of streams looks suspiciously high, events are arriving with lag, and clients are dropping in batches.
Streams behave differently from regular HTTP requests: their duration can be minutes and hours, so classic metrics like “requests per second” and “average latency” don’t give the full picture.
Key metrics
For SSE or HTTP/stream it’s useful to track several groups of indicators.
- Connection metrics. How many SSE streams are active right now? How long does a connection live on average? What percentage of streams end with an error or timeout? A sharp spike in active connections indicates a potential traffic storm or a resource leak (clients aren’t closing connections). A sharp drop indicates a mass disconnect (for example, network problems or a critical server bug).
- Event metrics. How many events do you send per second across all streams (EPS — events per second)? What’s the average event size? How many payload deserialization or validation errors do you observe? If you suddenly see event sizes growing — perhaps someone started sending the entire report text in job.progress instead of a short string.
- Job metrics. Distribution by statuses (pending, running, completed, failed, canceled), average execution time by task type, the percentage of jobs going to retry or dead‑letter. This helps reveal problems not only at the network level but also in workers: an external API slowed down, mass errors appeared.
- Backpressure and system metrics. In streaming systems you often watch buffer and queue depths between components, as well as the percentage of time a stream is blocked waiting for the consumer to free space. If your queues are almost always filled to the brim, that’s a clear signal the system is at its limit. It’s also important to track system indicators: CPU and memory on servers handling streaming, and errors/timeouts at the network level. Sometimes the network throughput between the MCP server and ChatGPT becomes the bottleneck.
Together these four groups answer three questions: how many streams are alive now, how much data you’re pushing, how jobs behave, and where exactly the system starts to choke.
What to log
Logs are the second pillar of observability. It’s important to log events and connections so you can later reconstruct the history for a specific job.
Typically, logs for each event and stream include:
- jobId and/or eventId;
- userId and sessionId (if you have multi‑tenancy);
- event type (progress, completed, failed, resource.updated);
- channel type (SSE or HTTP/stream);
- send timestamp and, if possible, the timestamp when the worker produced the event.
This lets you compute the lag: the difference between when the worker generated the event and when it went out to the socket. Growth of this lag time is a good indicator of backpressure issues.
Be careful not to let logs become a source of overload themselves. For high‑frequency events like job.progress, logging every single event isn’t always reasonable; you can enable sampling — log every Nth event instead of all of them — or aggregate statistics.
In code, that can look like a simple helper:
function logEvent(event: {
type: string;
jobId: string;
userId?: string;
channel: "sse" | "http-stream";
payload: unknown;
}) {
console.info({
...event,
timestamp: new Date().toISOString(),
});
}
In a real project you wrap this with a structured logging library, but the idea is the same: as much useful context as possible in each entry.
5. Alerts and degradation policies
Once you have metrics and logs, the next step is to configure alerts and think through how the system should “degrade” when it’s under strain. The idea is that it’s better to honestly work worse than to suddenly crash.
Alert examples
For GiftGenius it makes sense to watch a few typical situations.
First, an anomalous number of active streams. If you usually have dozens of active SSE connections and suddenly there are thousands, you should find out what’s going on. Maybe you became popular, or maybe you have a bug and connections aren’t closing.
Second, the delay between a job actually finishing and the client receiving job.completed. If this delay starts to exceed a threshold (say, 5–10 seconds), somewhere between the worker and the client events are accumulating or connections are stalling.
Third, a high share of job.failed or job.canceled compared to successful ones. The cause may be in the worker (a broken external API, a new bug) or in increased user sensitivity to delays (they start canceling tasks more often).
Finally, an elevated level of connection errors and stream breaks: if the number of abnormal disconnects is growing, there may be network or client‑side problems, and you should consider fallback scenarios.
Degradation patterns
When the system is overloaded, you can switch on a “resource‑saving mode.” This is better than just starting to answer 500 to everything.
The most common pattern is adaptive event frequency. If you see the event rate (events per second) shoot up ten times higher than usual and lag starts to grow in queues, reduce progress event frequency. If it was every 1%, make it every 10%. If it was every 500 ms, make it once every 2–3 seconds. The user will do fine without ultra‑precise progress, but not with a completely frozen UI.
For less important events — for example, resource.updated during background product feed updates — you can temporarily disable sending altogether while the system is under load.
Another technique is to switch some scenarios from streams to periodic polling. If SSE channels are collapsing, the MCP server can send a system event like system.overloaded to the widget, and the widget can switch to the strategy “every N seconds I poll a REST endpoint for the job status.”
6. A small practical fragment for GiftGenius
To tie it all together, let’s imagine we already have:
- an MCP tool startGiftSearch that creates a job and returns a jobId;
- a worker that performs the search and sends event/progress and event/completed;
- an SSE endpoint /api/events/[userId] that a Next.js widget connects to.
Let’s add a simple layer of protection against an “event storm” and minimal monitoring.
Progress limiting by step and time
In the worker we add throttling and conflation as discussed above. Now events are sent no more often than every half second and only when the change is at least 5%.
Tracking active streams
In the SSE endpoint we keep a per‑user counter:
const activeStreams = new Map<string, number>();
const STREAM_LIMIT = 3;
function canOpenMoreStreams(userId: string) {
const current = activeStreams.get(userId) ?? 0;
return current < STREAM_LIMIT;
}
function registerSseClient(userId: string, controller: ReadableStreamDefaultController) {
const current = activeStreams.get(userId) ?? 0;
activeStreams.set(userId, current + 1);
// here you store the controller in some structure,
// so you can write events into this stream later
}
function unregisterSseClient(userId: string) {
const current = activeStreams.get(userId) ?? 1;
activeStreams.set(userId, Math.max(0, current - 1));
}
The server can additionally export metrics like activeStreams.size to Prometheus/Grafana or any other monitoring system.
A very simple event‑rate metric
To begin with, you can at least roughly count how many events you send:
let eventsSentLastMinute = 0;
function sendProgressToClient(ev: ProgressEvent) {
// ... serialization and write to the SSE stream
eventsSentLastMinute++;
}
setInterval(() => {
console.info({
metric: "events_per_minute",
value: eventsSentLastMinute,
timestamp: new Date().toISOString(),
});
eventsSentLastMinute = 0;
}, 60_000);
Over time you can replace this with proper counters and alerts, but as a starting point — it’s already useful.
Putting it all together: limits, backpressure, metrics/alerts, and a reasonable UX fallback turn your GiftGenius from a “demo for demos” into a system that weathers real traffic storms. In the next modules, where we’ll talk about gateways, production architecture, and full observability, these patterns will come in handy again.
7. Typical mistakes when working with streams, rate limits, and monitoring
Mistake #1: no limits on the number of streams and event frequency.
Developers added SSE “because it looks cool,” workers diligently send progress for every processed object, and everything seems to work in the demo. But at the first spike of real users the server starts spending most resources serializing and sending thousands of tiny events, and the ChatGPT UI turns into a slide show.
Mistake #2: attempting to buffer “everything at once” without limits.
An unbounded array of “unsent events” appears in the code and grows until the client recovers. Spoiler: it doesn’t recover; the server dies first. Any buffer must have a hard maximum, and the overflow handling logic must be explicit.
Mistake #3: treating all event types the same.
Progress can be aggregated and dropped (the last percentage matters more than the history of movement). You can’t do that with logs and partial results — losing one chunk may mean corrupted data. When you design the system, group events by importance in advance and design a strategy for each group under overload.
Mistake #4: lack of observability.
No metrics on active streams, no accounting of event rate, and logs that only say “something went wrong.” In this situation you learn about issues only from user feedback and CPU load graphs. Setting up at least basic metrics and logs by jobId and eventId is not a luxury, but a necessity.
Mistake #5: rigid UX that doesn’t account for degradation.
The widget and GPT instructions assume the stream is always available, progress updates “in real time,” and partial results arrive strictly by scenario. At the first sign of network issues the user sees a “stuck” progress bar and no explanation. It’s much better to build an honest fallback into the UX: “Live updates are having issues right now; I’ll keep working and let you know when I’m done” — and switch to less frequent updates or polling.
Mistake #6: trusting that “our users won’t create many simultaneous tasks.”
In practice, if you don’t limit the number of parallel jobs and streams, someone will definitely open five tabs, start a “maxed out” gift search in each, and go grab coffee. The “maybe it’ll be fine” approach in production almost always ends with getting acquainted with monitoring amid the thunder of alerts.
GO TO FULL VERSION