AI-powered support
Ask me anything about SettleGrid
How margin actually works in the three-layer agent-to-tool-to-infrastructure stack. Where margin gets compressed, the four economic levers that matter (batching, caching, tier selection, quality gating), and a worked P&L example at three revenue scales.
Traditional SaaS economics assume a fairly simple value chain: the buyer pays the software vendor, the vendor pays the cloud, and margin falls out of the difference. Tool-calling economics don't work that way. There are at least three distinct parties in every paid tool call — the human operator who pays for agent usage, the agent itself (or its operator) which pays the tool, and the tool which pays for underlying compute and upstream APIs — and margin gets compressed at each handoff. Understanding how that compression works is the difference between a tool that's a healthy business and a tool that looks healthy on revenue but bleeds cash on COGS.
This lesson walks through the three-layer structure, the specific places margin gets squeezed, the four main economic levers tool operators can actually pull (batching, caching, tier selection, quality gating), and what an example P&L looks like for a mid-complexity MCP tool at three scales. It assumes you've already read lesson 1 on pricing for the basics of cost floors; here we go deeper into where the economics break and how to defend against that.
Every paid agent tool call flows through three economic layers. Each layer has its own cost, its own price, and its own expected margin. Getting the layer stack wrong is the most common cause of surprise unprofitability.
The human operator pays for agent access. This can be a subscription to a hosted agent platform (think ChatGPT Pro, Claude Pro, Cursor, or a custom enterprise agent), pay-as-you-go API usage on an LLM provider, or a blend. From the tool operator's perspective, this layer is mostly invisible — you don't see the human's payment, only the agent's downstream behavior.
What you do see indirectly is the human's budget discipline. An agent platform with a tight monthly budget will produce more cost-conscious tool-calling behavior than one with uncapped usage. This is why enterprise agent callers often use tools more aggressively than consumer ones — the enterprise operator has already paid for agent time and wants to maximize the return on it.
This is where your tool revenue comes from. The agent calls your tool, pays your per-call price, and receives the tool's output. Your price has to cover your Layer 3 costs plus your own margin. Your margin on this layer is the direct, top-line number you care about — but it's also the number most easily overestimated, because tool developers often underestimate Layer 3 costs.
This is where your costs come from. If your tool wraps an LLM, Layer 3 includes inference costs on Anthropic or OpenAI or open-weights providers. If your tool makes upstream paid API calls (financial data, enrichment services, map providers), those fees are in Layer 3. If your tool runs its own compute (browser automation, database queries, vector search), your cloud bill is in Layer 3. If you take card payments, Stripe's 2.9% + 30¢ is in Layer 3.
Layer 3 is also where most margin mistakes hide. The cost items are each small enough to feel negligible, but they stack. A tool that's 70% margin on LLM inference alone can easily drop to 20% once you add infrastructure + payment processing + overhead allocation.
Margin doesn't just "exist" at each layer — it gets compressed by the adjacent layers' decisions. Four specific compression forces to watch.
If your Layer 3 cost is dominated by an upstream provider (an LLM API, a paid data feed), that provider can reprice without consulting you. LLM pricing has generally moved down over the past 18 months, which is good for tool operators. But if you'd priced your tool aggressively against an old inference rate and the provider raised prices to handle scaling costs, your margin would compress immediately.
Mitigation: benchmark your pricing against a model cost that's 1.5-2× your current cost. That gives you headroom if upstream prices shift against you before you can react. Don't price your tool at exactly your current cost + target margin — price for your cost + target margin + a buffer for upstream volatility.
Early adopters tolerate lower-quality outputs and higher prices because they're excited about the capability. As a category matures, callers expect the median quality to rise and the median price to fall. A tool that launched at $0.25/call with acceptable quality in 2025 may find itself competing against callers offering similar quality at $0.05 two years later. Your revenue per call can drop without your costs dropping correspondingly, squeezing margin.
Mitigation: keep your quality investment ahead of your pricing. If your tool is 2× better than competitors, you can price at or above the category median without losing callers; if it's merely on par, the market will drag you to the median price.
Fees stack subtly. A 5-cent per-call tool paying 2.9% + 30¢ to Stripe direct would lose 600% of revenue to the 30-cent fee on every call. Per-call tool operators work around this with pre-funded balances and batched settlement — we covered this in lesson 1 — but the underlying lesson is that payment processing fees are a real Layer 3 cost, not an abstraction to ignore. Budget them explicitly.
Tool calls that fail, time out, or produce unsatisfactory outputs sometimes get refunded. Refunded calls carry their cost (you already ran the compute) but reverse their revenue. A 2% refund rate on a 50%-margin tool drops effective margin to 49%; a 10% refund rate drops it to 45%. Refund rate is a proxy for quality — the best way to control it is to improve output consistency, but some baseline rate is unavoidable. The MCP payment retry logic guide covers the operational side of handling failed payments and refunds — the billing-layer decisions you make there directly affect your realized margin.
Mitigation: measure refund rate as a dashboard metric, not a quarterly review metric. A refund rate that drifts from 2% to 5% between pricing experiments is telling you something about the experiment — either the new pricing attracted worse-fit callers or something else shifted.
Within this structure, tool operators have four economic levers that materially move margin. Other levers exist (cost of sales, overhead allocation, one-time capex) but these four are where the day-to-day management attention should go.
Many Layer 3 providers offer batch-mode pricing at a meaningful discount. Anthropic's batch API offers a flat 50% discount on asynchronous jobs, per the published pricing. OpenAI offers similar batch pricing. If your tool can tolerate a batch latency window (typically seconds to hours, depending on the provider), routing appropriate calls through batch can double your margin on inference-heavy workloads.
The trade-off is latency. A call that would complete in 2 seconds on the real-time API might take 30 minutes on batch. Not all tool calls can tolerate that — a real-time compliance check can't, but a nightly data enrichment run can. Classify your tool's calls by latency tolerance and route accordingly.
Prompt caching on the LLM side (Anthropic's Sonnet cache reads at $0.30/MTok, roughly an order of magnitude cheaper than non-cached reads) makes a meaningful difference for tools that reuse large system prompts or RAG context. Caching at the tool level — memoizing idempotent calls — eliminates the cost entirely for repeat inputs.
The practical trick with caching is understanding your cache hit rate before you commit to infrastructure. If your callers rarely repeat inputs, caching won't help; if they frequently do, caching can cut your COGS by 60-80%. Measure first, build second.
LLM-wrapping tools have a choice of model: Opus for highest quality, Sonnet for median quality, Haiku for lowest cost. The right choice depends on what your callers actually need. Many tools over-provision — they ship Opus when Sonnet would produce equivalent results for most calls, or ship Sonnet when Haiku would suffice for the 80% of calls that don't need the larger model.
Tier selection can be static (always use Sonnet) or dynamic (route to Opus only when the input is complex, otherwise use Haiku). Dynamic routing is harder to implement correctly but can move your inference cost floor by 3-5×. Measure which tier actually produces winning outputs on your real workload, not which tier sounds right.
The flip side of tier selection: some tools can accept only successful outputs and route failures for free retry on a higher-quality tier. A search tool might first try Haiku; if the output fails a quality check, retry on Sonnet at no additional cost to the caller. This is effectively outcome-based pricing at the tool level (covered in lesson 1) and it shifts the cost curve in ways that static tier selection can't.
Quality gating requires a reliable quality signal — a structured check that determines whether the output is "good enough." For some workloads this is straightforward (did the search return results?); for others it's hard (is this synthesis correct?). When it works, it's the highest-leverage lever on this list.
Theory is easier to evaluate against a concrete example. Consider a hypothetical MCP tool that performs structured data extraction from web pages — input a URL, output a JSON object with pre-defined fields. The underlying implementation calls Claude Sonnet 4.6 for extraction with a ~3K-token system prompt and ~500 tokens of variable page context. Here's what the economics look like at three revenue scales.
$0.08$80$3/MTok + 0.5K output tokens × $15/MTok) = 1,000 × $0.0180 = $18$5 flat$3 in platform fees$80 − $26 = $54 (67.5%)Good margin on paper. But at $54/month, the tool is barely paying for the developer's coffee. The scale is the problem, not the economics.
$0.08$4,000$0.0180 = $900$100 (load balancing, additional compute allocations)$160 (roughly 4% platform fee at this scale)$4,000 − $1,160 = $2,840 (71%)Meaningfully better. Infrastructure scales sublinearly with volume, and the inference cost is the dominant variable cost. Adding prompt caching (the 3K-token system prompt is identical across calls) would cut inference to ~$300, pushing margin to ~85%.
$0.08$40,000$0.30/MTok + 0.5K × $15/MTok) ≈ $3,855$800 (database for idempotency keys, monitoring)$2,000 (platform fee)$2,000$40,000 − $8,655 = $31,345 (78%)At this scale, prompt caching has moved from "nice to have" to "required for the business." Without it, inference would be $9,000 and margin would drop to 71%. The same tool, same pricing, same caller behavior — but the tool operator's implementation choices determine whether the business is healthy or marginal.
The broader lesson: unit economics don't become "good" automatically as volume grows. Scale creates opportunity to improve unit economics, but only if you invest the engineering time to harvest it. Tools that don't instrument caching, don't measure refund rate, and don't do tier routing can hit margin ceilings well below what their more disciplined competitors achieve at the same scale.
One additional note on the above examples: the per-call price is held constant across scales for illustrative purposes. In practice, your pricing should probably evolve as you scale — volume discounts for large callers, subscription options for enterprise, and occasional repricing experiments to test whether the market will bear more. Those decisions sit in lesson 2; the point of this lesson is that scale doesn't fix bad unit economics — it amplifies whatever economics you already have.
The worked examples above assume the tool is operating roughly as designed. In practice, unit economics can go sideways in three specific ways, and recognizing each early saves you from building on broken foundations.
A tool that launches with a low price to capture early adoption can end up with gross revenue growing while gross margin shrinks. Each new caller brings inference cost that exceeds their per-call revenue after settlement fees. This looks healthy on a top-line chart — revenue is up! — but cash burns faster than it comes in. The fix is almost never to grow out of the problem; it's to reprice immediately, even at the cost of churning some callers. A gross-margin-negative tool at 50K calls is a gross-margin-negative tool at 500K calls, only louder.
Your economics depend on the mix of callers you have. If your best-margin segment (say, enterprise callers on high-volume plans) churns faster than your worst-margin segment (evaluators running one-off tests), your blended margin drops even if your per-segment margins stay constant. This is invisible on a single margin number but obvious when you break margin out by caller cohort. Instrument caller-level margin from day one.
Tool operators sometimes add features to retain large callers — priority queues, dedicated support channels, custom rate limits — without pricing those services separately. Over time, these unpriced services absorb margin. One large caller consuming 20 hours of support a month on a $299 subscription is unprofitable regardless of what your direct-cost margin looks like. Price for the services explicitly or cap them.
Direct COGS is one side of economics. The other side is fixed and semi-fixed costs that don't scale with call volume — but still eat into profitability.
If you're a solo developer, your time has an opportunity cost. A tool that nets $3,000/month after direct costs but requires 20 hours/week of maintenance is effectively paying you $37/hour before taxes. That may or may not be good economics depending on your alternatives, but it's the honest picture. Bake your time cost into your margin model.
Paid tools come with obligations: security updates, dependency upgrades, API contract maintenance, customer support, tax compliance if you pass thresholds. Plan for roughly 10-20% of gross revenue absorbed by these costs once you're at a scale that triggers them.
If you spend on ads, content, or sponsorships to drive tool discovery, that spend is part of the economics. The directory submission work that lesson 1 pointed at is effectively an unpaid customer acquisition channel — valuable at launch, but limited in scale. Past a certain size, tools that want to keep growing usually invest in paid acquisition, which becomes a real line item.
Most tools that fail don't fail for novel reasons. The patterns repeat, and each one has a signature you can recognize early.
Launch with aggressive pricing to capture market share, assume margin will improve as you scale. This fails because upstream costs don't amortize the way many founders expect — inference costs scale roughly linearly with calls, infrastructure scales sublinearly but non-trivially, and payment processing scales linearly until you hit custom-contract volume (typically $10K+/month spend). Your gross margin at 1K calls is a good predictor of your gross margin at 100K calls. Fix margin at 1K, not at 100K.
Most of your revenue comes from one or two large callers. You build features for them, price around them, and structure your team around their needs. When they churn — because they build the capability in-house, because they switch to a competitor, or because their use case evolved — you lose the majority of your business in one month. Mitigate by enforcing a concentration limit: no single caller should account for more than 25% of revenue unless you've specifically decided to accept that risk.
Your prices are obviously too low, but you keep them because raising them feels risky. Every month you don't reprice, you leave margin on the table. Meanwhile, competitors see your pricing and either race you to the bottom or skip your segment entirely. The longer you wait, the higher the opportunity cost. Reprice when the data supports it, don't wait for "the right moment."
Your free tier converts poorly to paid — say, <2% of free users ever upgrade — but you keep it because it drives "top-of-funnel metrics." Meanwhile, the free tier's direct costs (inference, infrastructure, support) eat real money every month. A free tier that doesn't convert is a marketing budget; decide if it's worth what it costs you. The MCP server free-tier configuration guide covers the implementation of tight free tiers that don't bleed margin.
If you're looking at your economics and wondering what to fix first, this ordering is defensible:
What specifically not to optimize first: the shape of your pricing model (covered in lesson 2 on per-call vs subscription). Pricing-model changes are the highest-risk, highest-disruption change you can make, and they rarely move margin as much as Layer 3 optimizations do.
The economics of tool calling reward tool operators who think in terms of the full three-layer stack, rather than focusing only on their per-call price. Your price sets your revenue ceiling; your implementation choices determine what fraction of that revenue survives as margin. Spend at least as much time on the second question as on the first.
SettleGrid lets you try any pricing model — per-call, tiered, freemium — with two lines of code, and switch without redeploying. Free tier is production-ready: 50K operations/month, 0% take rate on your first $1K/mo of revenue.