CodeGym /Courses /ChatGPT Apps /Cost control and cost instrumentation

Cost control and cost instrumentation

ChatGPT Apps
Level 19 , Lesson 0
Available

1. Why “works” ≠ “pays off”

LLM applications have an important trait: in addition to fixed hosting costs, they often incur variable costs for executing some requests related to model calls.

It’s important to distinguish two worlds:

  • when the model runs on the ChatGPT side (the user interacts with your App in ChatGPT, and it calls mcp-tools) — the user pays for tokens via their ChatGPT subscription;
  • when your backend/MCP server calls the OpenAI API or other LLM services itself — you pay for those tokens.

It’s in the second case that you get classic variable LLM costs that depend on the number and “weight” (tokens_in/tokens_out) of requests.

A classic scenario:

  1. You happily ship GiftGenius to prod, everything is blazing fast, users are delighted.
  2. A month later the bill for OpenAI + cloud + Stripe fees arrives, and it suddenly turns out that “successful growth” actually means “we’re paying more per gift than we earn from the sale.”

The FinOps approach says: cost is just another metric, like latency or error rate. You should log it, aggregate it, and make decisions based on it — not “guess in Excel.”

The goal of this lecture is to get you to the point where you can answer questions like:

  • “How much did this specific gift selection for user user42 cost?”
  • “How much money did the suggest_gifts tool burn this week, and how many orders did it bring in?”

And to make sure the answers come not from thin air but from logs and metrics.

2. Cost structure of a ChatGPT App

Let’s start with a cost map. Without it, everything else is just a chaotic collection of numbers.

LLM costs (variable)

This is everything related to model calls from your backend:

  • Calls to OpenAI models from the MCP server or agents: GPT-5.1 / GPT-5-mini / embeddings / rerank / vision / TTS/STT, etc.
  • Additional models: reranking for search, embeddings for recommendations, image generation.

One subtle point to remember: when you build the interface via the Apps SDK and use only the built-in ChatGPT model, you don’t pay for tokens — the user pays (through their ChatGPT subscription). But as soon as your MCP server starts calling the OpenAI API itself (Agents, Responses API, embeddings, etc.), tokens are billed to your account.

The basic idea: the cost of such calls is proportional to tokens_in and tokens_out, multiplied by the price per token.

An MCP-tool invocation by itself is free for the developer in terms of tokens; costs appear only where, in its handler, you decide to call the OpenAI API or another LLM.

Infrastructure

This is all the hardware and services around it:

  • MCP servers: Vercel / AWS / GCP / bare metal.
  • Agents (if they run as separate services).
  • Databases: Postgres/MySQL, vector databases, S3/object storage.
  • Caches: Redis/KeyDB.
  • Queues and workers: e.g., for background generation, feed recomputation, etc.

These costs are more often monthly fixed (or stepwise fixed), so they’re usually computed from aggregated cloud provider spend rather than per request.

Payments and external services

GiftGenius uses ACP/Stripe, which means you get:

  • Fees for every successful payment (Stripe on the order of a few percent + a fixed component).
  • Losses from fraud and chargebacks.
  • Cost of external APIs: email/SMS/push notifications, additional analytics, etc.

At the start it’s pennies, but at scale you begin to feel it, so it’s helpful to separate them at least in logs and reports.

A small reminder table

Category Examples How to roughly compute
LLM GPT‑5.1, GPT‑5‑mini, embeddings, rerank
tokens_in/out × price_per_token
Infrastructure MCP, Agents, DB, Redis, queues, CDN Divide the provider bill by traffic/period
Payments and services Stripe, email API, SMS, analytics Number of events × rate/fee

Our goal: tie these categories to specific events in the system (tool calls, workflows, checkout), rather than looking only at final monthly totals.

3. Where to capture usage data: three layers

To compute cost not “once a month” but in real time, you need to build instrumentation into the code. There are only three places.

MCP server: each tool invocation

The MCP server is the natural point through which ChatGPT calls your tools. Here we can:

  • Capture the start/end of the call.
  • Measure duration_ms (or latency_ms).
  • Collect tokens from the OpenAI response (if the MCP invokes our model) or at least estimate them.
  • Set user_id, tenant_id, request_id/trace_id to join logs.

Schematically, a tool_invocation log event for GiftGenius looks like this:

{
  "timestamp": "2025-11-20T12:34:56Z",
  "level": "info",
  "event": "tool_invocation",
  "request_id": "abc123",
  "user_id": "user42",
  "service": "mcp-giftgenius",
  "tool_name": "suggest_gifts",
  "tokens_in": 120,
  "tokens_out": 350,
  "cost_estimate_usd": 0.045,
  "latency_ms": 320
}

Now the same as a TypeScript type and a bit of code.

// types/telemetry.ts
export interface ToolInvocationLog {
  event: 'tool_invocation';
  requestId: string;
  userId?: string;
  toolName: string;
  tokensIn?: number;
  tokensOut?: number;
  costEstimateUsd?: number;
  latencyMs: number;
}
// mcp/logger.ts
export function logToolInvocation(payload: ToolInvocationLog) {
  console.log(JSON.stringify({
    timestamp: new Date().toISOString(),
    level: 'info',
    ...payload,
  }));
}

Now a wrapper around the MCP tool handler (say, suggest_gifts).

// mcp/tools/suggestGifts.ts
export async function handleSuggestGifts(ctx: Context, input: Input) {
  const started = Date.now();

  const llmResult = await callGiftModel(input); // call OpenAI here

  const duration = Date.now() - started;
  const { prompt_tokens, completion_tokens } = llmResult.usage ?? {};
  const costEstimate = estimateCost(prompt_tokens, completion_tokens);

  logToolInvocation({
    event: 'tool_invocation',
    requestId: ctx.requestId,
    userId: ctx.userId,
    toolName: 'suggest_gifts',
    tokensIn: prompt_tokens,
    tokensOut: completion_tokens,
    costEstimateUsd: costEstimate,
    latencyMs: duration,
  });

  return llmResult.output;
}

Even if you estimate tokens “by eye” via text length, it’s already better than nothing.

Agent level (Agents SDK): workflow steps

If you use the Agents SDK, an agent can call multiple tools in a row. Here it’s important to log the step context: what task the agent is trying to solve.

For example, on each tool invocation by the agent runner you can add fields workflow_name and step_name: “idea search,” “filter by budget,” “prepare checkout.”

This will let you build reports not only by tools but also by scenario steps: perhaps 80% of the cost goes into some useless “extra clarifying step.”

Example of a small “hook” around the agent:

// agents/logStep.ts
export function logAgentStep(data: {
  requestId: string;
  workflow: string;
  step: string;
  toolName: string;
}) {
  console.log(JSON.stringify({
    timestamp: new Date().toISOString(),
    level: 'info',
    event: 'agent_step',
    ...data,
  }));
}

And use it from the runner:

// agents/giftAgent.ts
logAgentStep({
  requestId: run.requestId,
  workflow: 'gift_selection',
  step: 'rank_candidates',
  toolName: 'rerank_gifts',
});

Commerce: checkout and money

In the commerce layer we care about events:

  • checkout_started — purchase started.
  • checkout_success — payment succeeded.
  • checkout_failed — error with code/type.

And we need to attach:

  • amount, currency.
  • request_id of the same session as tool_invocation.

Then we can answer: “This purchase cost us N cents in LLM spend and brought M dollars in revenue.”

Example of a simple checkout event handler:

// api/commerce/logCheckout.ts
export function logCheckoutEvent(e: {
  type: 'checkout_started' | 'checkout_success' | 'checkout_failed';
  requestId: string;
  userId?: string;
  amountCents?: number;
  currency?: string;
  errorCode?: string;
}) {
  console.log(JSON.stringify({
    timestamp: new Date().toISOString(),
    level: 'info',
    service: 'commerce',
    ...e,
  }));
}

4. Structured logs for cost (link to M17)

Key point: no “freeform” text logs like console.log("Tool suggest_gifts used 123 tokens"). Everything in JSON.

In module 17 we already agreed to log requests as JSON with base fields like request_id, user_id, tool_name, etc. Now on top of that we’ll add cost fields.

Fields that must be present in logs related to costs:

  • timestamp, level.
  • event (tool_invocation, agent_step, checkout_success, etc.).
  • request_id, trace_id — to connect the chain of events for one workflow.
  • user_id, tenant_id — to aggregate by users/companies later.
  • tool_name / service.
  • tokens_in, tokens_out, cost_estimate_usd.
  • latency_ms, success/error_code.

In the examples we’ll name the cost field cost_estimate_usd (cost in US dollars) and stick to that name in code and dashboards.

This exact structure allows you to:

  • Build aggregates: mean cost_estimate_usd by tool_name, by user_id, by workflow.
  • Correlate “expensive” requests with increased latency or errors and decide what to optimize first.

If in M17 you already implemented a basic logger.info({...}), adding cost fields is not a new framework, just a couple of extra properties on the object.

5. How to roughly compute LLM cost in code

The formulas here aren’t scary at all. We only need the order of magnitude, not something that matches billing to the last cent.

Take usage from the OpenAI response

When your MCP server calls the OpenAI Responses API, it usually receives a usage object:

{
  "usage": {
    "prompt_tokens": 120,
    "completion_tokens": 350,
    "total_tokens": 470
  }
}

It’s convenient to compute cost from it. Different models have different prices per 1M input/output tokens.

Simplest estimator function in TypeScript:

// mcp/cost.ts
type Usage = { prompt_tokens?: number; completion_tokens?: number };

const PRICING = {
  inputPerMillion: 2.5,   // dollars per 1M input tokens, example
  outputPerMillion: 10.0, // and for output tokens
};

export function estimateCost(
  promptTokens?: number,
  completionTokens?: number,
): number {
  const inTokens = promptTokens ?? 0;
  const outTokens = completionTokens ?? 0;

  const inputCost = (inTokens / 1_000_000) * PRICING.inputPerMillion;
  const outputCost = (outTokens / 1_000_000) * PRICING.outputPerMillion;
  return Number((inputCost + outputCost).toFixed(6)); // round a bit
}

Prices here are examples; you’ll take the real ones from the current OpenAI pricing and put them in config. What matters is that this function is called on every tool invocation, and the result goes into the cost_estimate_usd field in the log.

If usage is unavailable

Sometimes you use a third‑party LLM that doesn’t send usage, or you need preliminary control before the real call. Then you can:

  • Estimate tokens using a library like tiktoken or its analogue for the target model.
  • Use averages from historical logs (median_tokens_in/median_tokens_out for the tool) and multiply by price.

Stub code to estimate length:

// mcp/costEstimateFallback.ts
export function roughTokenEstimate(text: string): number {
  // Rough estimate: 1 token ≈ 4 Latin characters
  return Math.ceil(text.length / 4);
}

It’s not rocket science, but it lets you, for example, block a 200000-token prompt from the cheap tier.

6. Key cost metrics

Collected logs are raw material. Now let’s see which aggregates are vital.

cost_per_tool_call

What it is: average cost of one invocation of a specific tool.

Why:

  • It shows which tools are particularly expensive.
  • You can look for “expensive and useless”: high avg_cost_per_call and low conversion to scenario success.

How to compute from logs:

  • Take logs with event = "tool_invocation" for a period.
  • Group by tool_name.
  • For each, compute avg(cost_estimate_usd) and optionally p95 (95th percentile of cost).

cost_per_successful_task (or cost_per_workflow)

Task/workflow is a completed user‑level scenario:

  • In GiftGenius this could be “gift selection + showing cards + the user saved N ideas,” or “selection → checkout → successful purchase.”

What we do:

  • On workflow completion, write a workflow_completed event with request_id, workflow_name, and a success flag.
  • Via request_id, “pull in” all tool_invocation events of that workflow and sum their cost_estimate_usd.

This gives “how much a successful task costs” — the key to understanding the unit cost of the scenario.

cost_per_user / cost_per_tenant

For B2B scenarios the question is often: “How much does one user/one team cost us per month?”

Compute:

  • Group tool_invocation and other cost events by user_id or tenant_id.
  • Sum cost_estimate_usd over the period (day, month).

Then compare with the subscription price. If cost_per_user is approaching the plan price, it’s time either to raise the price or to optimize usage (we’ll talk about this in the next lecture on pricing and “cost ↔ quality” experiments).

7. Example: tool_invocation format and a dashboard for GiftGenius

Now we’ll do what was in the plan exercise: design a log event and a minimal dashboard for tools.

tool_invocation event format for GiftGenius

Earlier we looked at a minimal log for an MCP tool. Now let’s design a more detailed tool_invocation event that you can use in production and dashboards: same idea, we’ve just added fields for services, errors, and linkage to models.

First — a TypeScript type:

// telemetry/events.ts
export interface ToolInvocationEvent {
  timestamp: string;
  level: 'info' | 'error';
  event: 'tool_invocation';
  service: 'mcp-giftgenius';
  requestId: string;
  traceId?: string;
  userId?: string;
  tenantId?: string;
  toolName: string;
  modelId?: string;
  tokensIn?: number;
  tokensOut?: number;
  costEstimateUsd?: number;
  latencyMs: number;
  success: boolean;
  errorCode?: string;
}

And a convenient helper:

// telemetry/emitToolInvocation.ts
export function emitToolInvocation(e: ToolInvocationEvent) {
  console.log(JSON.stringify(e));
  // In real life: send to Logtail/Datadog/ELK, etc.
}

For each tool (e.g., suggest_gifts, rerank_gifts, fetch_catalog) we add a call to emitToolInvocation at the end of the handler (or in a finally block so the log exists even on error).

The simplest tool dashboard

A minimal table for a dashboard (e.g., in Metabase / Grafana / any BI):

Column Description
tool_name
Tool name (suggest_gifts, checkout_create_session, …)
% of traffic
Share of all tool_invocation events that fell on this tool
avg_cost_per_call
Average cost of a single call (from cost_estimate_usd)
error_rate
Percentage of events with success = false
avg_latency_ms
Average latency
avg_revenue_per_call
Average revenue associated with this tool (if available)

Visually, this usually looks like: a table on top and a couple of charts below:

  • Bar chart: tool_name on the X axis, avg_cost_per_call on the Y axis.
  • Scatter plot: X = avg_cost_per_call, Y = error_rate or conversion_to_checkout.

These charts help quickly find optimization candidates: expensive, slow, and no conversion — start there.

Linking cost to revenue is helped by the fact that we log checkout_* together with request_id. Then we can compute avg_revenue_per_call as total revenue divided by the number of tool calls in scenarios where a checkout_success occurred.

8. Accounting for infrastructure costs (without zealotry)

LLM costs are nice: each call has tokens, and you can compute cost right in the log. Infrastructure isn’t that straightforward: you have a monthly bill for Vercel, databases, Redis, etc.

At the start you can take a simple approach:

  1. Take the total infrastructure bill for the month (say, 200$).
  2. Divide it by the number of workflows for the month (workflow_completed) — you’ll get an approximate infra_cost_per_task.
  3. Or divide by the number of active users — infra_cost_per_user.

Then you add these numbers to the LLM cost (which we computed in detail from logs) — and you get an approximate full unit cost of a scenario or user.

When the app grows, you can get more granular (allocate spend across services and tools), but for early versions this is more than enough to avoid flying blind.

9. A small end‑to‑end example for GiftGenius

Let’s put it all together in a mini‑story.

The user describes the gift recipient; ChatGPT suggests enabling GiftGenius. Then:

  1. The widget starts the workflow "gift_selection".
  2. Your backend decides to use an LLM agent to choose gifts more intelligently.
  3. The agent performs 3 steps:
  • analyze_recipient (analyze the description with an LLM).
  • suggest_gifts (our MCP tool).
  • rerank_gifts (an additional model to improve the list).
  1. The user sees the gift cards and saves several ideas.
  2. Clicks “Buy,” ACP is launched and checkout_create_session runs.
  3. Successful checkout_success with amount 79.00 USD.

What we have in the logs:

  • Three tool_invocation events (each with its own tokens_in/tokens_out, cost_estimate_usd, latencyMs).
  • Several agent_step events with workflow = "gift_selection", step_name.
  • checkout_started and checkout_success with amount=7900, currency="USD".

By request_id we connect all this and can say:

  • LLM cost of the scenario: sum of cost_estimate_usd across the three tools, say 0.19$.
  • Infrastructure share (from aggregates) is about 0.03$ per workflow.
  • Total 0.22$ unit cost.
  • Revenue from the transaction — 79$ minus Stripe fee and others.

This is already concrete unit economics, not “it feels like GPT‑4 is expensive.”

10. Typical mistakes with cost instrumentation

Mistake #1: only looking at the monthly bill and having no granularity.
It’s very tempting to look only at the total bill from OpenAI/the cloud. But without linking to tool_name, user_id, workflow you don’t know where exactly the money is being spent. In the end, optimization becomes “blindly downgrading the model” instead of targeted improvements to expensive scenarios.

Mistake #2: writing cost data to unstructured text logs.
Lines like "Tool suggest_gifts used 123 tokens" can’t be aggregated and filtered well. At some point you’ll realize you need to migrate to JSON, and that move will be painful. From the start, write structured logs with fields like request_id, tool_name, tokens_in/tokens_out, cost_estimate_usd.

Mistake #3: ignoring the link between cost and commerce events.
Logging checkout_success without request_id and linkage to tool invocations means voluntarily giving up understanding which scenarios make money and which only burn tokens. Don’t be lazy — propagate request_id all the way from the widget to ACP.

Mistake #4: trying to build “perfect” billing instead of a practical estimate.
Some teams get bogged down attempting to perfectly reproduce OpenAI billing to the last token. In reality you only need the order of magnitude: whether a scenario costs 0.02$ or 0.021$ isn’t critical. What matters is that it’s not 2$. Don’t be afraid to use approximate estimates via usage or even rough heuristics.

Mistake #5: looking only at cost and forgetting about quality.
Sometimes, when you see pretty savings figures, you want to switch everywhere to the cheapest model. That’s how you can “optimize” the app into a state where users stop using it. Cost must be considered together with answer quality and conversion — this link will be the topic of the next lecture in this module — about pricing and “cost ↔ quality” experiments.

1
Task
ChatGPT Apps, level 19, lesson 0
Locked
API cost estimate for an LLM call (tokens → USD)
API cost estimate for an LLM call (tokens → USD)
Comments
TO VIEW ALL COMMENTS OR TO MAKE A COMMENT,
GO TO FULL VERSION