Spot GPUs and KV Caches for Stable LLM Inference

Most teams overspend on reserved GPUs. A better LLM setup uses spot instances, KV caches, and strict fallback rules to keep latency stable.

By Tushar Goyal
EngineeringAITechnology
Spot GPUs and KV Caches for Stable LLM Inference

Reserved GPUs are the wrong default for most startup LLM products because predictability comes more from architecture than from paying on-demand prices all month.

Table of Contents

Why predictable inference is mostly a systems problem

Founders usually blame model size when latency is unstable. They should blame scheduling, cold starts, prompt bloat, and bad fallback design first.

We have seen this pattern repeatedly in AI builds. On Harmony.ai, the biggest cost driver was not the model vendor itself. It was prompt token count across chained calls, and we brought costs down by caching intermediate outputs instead of recomputing them every time.

That same principle applies to self-hosted inference. If you keep reprocessing long shared prefixes, route requests blindly, and let pods cold start under load, your latency will swing even on expensive GPUs.

A few numbers make the point:

  • NVIDIA states that KV caching avoids recomputing attention state for prior tokens during autoregressive generation, which directly reduces repeated work for long contexts and multi-turn chats by reusing previously computed keys and values.
  • Google’s 2017 paper showed transformers scale much better through parallel sequence processing than recurrent models, but inference is still dominated by token-by-token decoding costs because generation remains autoregressive.
  • Spot pricing is not a tiny discount. AWS documents that Spot Instances can offer up to 90% lower prices than On-Demand, which is too large a cost gap to ignore if you are serving production traffic.

The wrong conclusion from those numbers is, "Great, we can just run everything on spot and call it optimized." That architecture will fail.

The right conclusion is simpler:

  • Use spot GPUs for the bulk of stateless inference capacity because the savings are too large to ignore.
  • Protect user-facing reliability with warm fallback capacity, queue shaping, and request classes.
  • Exploit KV cache reuse aggressively, because the cheapest token is the one you do not recompute.

Predictable LLM inference is not about eliminating volatility in infrastructure. It is about designing the system so infrastructure volatility never reaches the user.

This is the same engineering pattern we used on Surge. We rebuilt the real-time data layer twice because Supabase Realtime added 200ms+ latency under load, and that was unacceptable once thousands of concurrent users were active. The lesson was blunt: managed convenience is fine until it leaks into user experience, then you replace it.

LLM inference should be treated the same way. If spot volatility or cache misses show up in your p95, your architecture is incomplete.

If you are still choosing your infra stack at the product level, read Choosing The Right Tech Stack alongside this. Stack decisions are not separate from inference reliability. They are the reason you either can or cannot recover from a GPU interruption cleanly.

The architecture that actually works

The best startup architecture is a two-tier inference pool: spot GPUs for primary throughput, on-demand GPUs for minimum guaranteed capacity, with KV-aware routing in front.

Do not build a single undifferentiated cluster. Split the system by failure tolerance.

Here is the architecture I would choose for almost every seed-stage or Series A product shipping chat, agents, summarization, or extraction:

LayerWhat runs hereWhy it exists
API gatewayAuth, rate limits, request classification, streaming transportThis keeps product logic separate from model serving and lets you degrade cleanly.
Request routerModel selection, tenant policy, cache-aware routingThis is where you decide whether a request must hit a warm replica or can tolerate retries.
Spot GPU poolBulk generation traffic and batchable workThis is where you get most of your margin back.
On-demand GPU poolPremium traffic, fallbacks, warm canaries, urgent retriesThis protects p95 and absorbs spot interruptions.
KV cache layerPrefix-aware cache index plus per-replica resident KV blocksThis removes repeated compute on shared prompts and long sessions.
Durable stateSession metadata, prompt templates, response artifacts, queue stateThis lets you reconstruct work after interruptions without confusing the user.

This is not over-engineering. It is the minimum viable production setup once LLM output is revenue-critical.

A practical request flow looks like this:

  • The API gateway tags the request as interactive, background, or replayable. Interactive requests get strict latency budgets. Background work gets queued and can use cheaper capacity.
  • The router checks for a reusable prompt prefix signature. If the tenant, system prompt, tools, and model version match, it prefers a replica with a warm KV cache.
  • The router sends first choice traffic to spot if the request is replayable within your SLA. If not, it sends it to a warm on-demand pool immediately.
  • The gateway starts streaming as soon as the first token is available. We had to solve similar streaming behavior in Utkrusht.ai, where the challenge was streaming LLM responses without blocking the UI thread. The backend design mattered because frontend smoothness depended on token flow consistency.
  • If a spot node receives an interruption notice, new requests stop immediately and inflight work is either drained or replayed against on-demand capacity.

A simple version of the routing policy can be expressed like this:

function choosePool(req) {
  if (req.priority === 'interactive' && req.slaMs <= 2500) {
    return 'on-demand';
  }

  if (req.cacheHit && req.expectedOutputTokens < 400) {
    return 'spot';
  }

  if (req.replayable && req.userCanTolerateRetry) {
    return 'spot';
  }

  return 'on-demand';
}

That logic is intentionally boring. Good inference architecture is mostly boring. The teams that get hurt are usually chasing some clever autoscaling trick while ignoring classification and replayability.

Why most teams route too late

Many products route after tokenization or after the serving engine has already accepted the request. That is too late.

You need routing decisions before expensive work starts because:

  • Prefix-aware placement only matters if the request lands on the right replica before prefill begins.
  • Spot interruption risk is manageable only if requests are classified before assignment.
  • Queue isolation only works if interactive traffic is separated from batch traffic at the edge.

For startup teams building fast, the same bias applies broadly. Keep the control point early and explicit. We make the same argument in Most Agile Stack For Building Yc Mvp: fast teams win by removing ambiguity from architecture, not by adding abstraction.

Where KV caches change the economics

KV caching is not a micro-optimization. For chat products, agent frameworks, and repeated structured prompts, it is the difference between a viable gross margin and a fake one.

Every time your model sees a long shared prefix, you are paying for prefill work. That includes:

  • The system prompt.
  • Tool schemas.
  • Conversation history.
  • Retrieval context.
  • Safety instructions.
  • Output formatting constraints.

If those tokens are stable across turns or requests, recomputing them is waste.

The important distinction is between two kinds of reuse:

1. Session-local KV reuse

This is the obvious one. A single conversation continues on the same replica, and the model reuses previous attention state.

Most teams stop here. That leaves a lot of savings on the table.

2. Cross-request prefix reuse

This is the bigger lever for products with repeated templates, role prompts, or enterprise workflows.

If 500 users hit the same workflow with the same tool schema and nearly identical preamble, you should precompute and reuse the prefix where your serving stack supports it. Engines like vLLM and TensorRT-LLM are worth evaluating specifically because they were built around higher-throughput serving patterns and memory efficiency, not because they are fashionable.

A few rules matter if you want KV caches to help instead of hurt:

  • Cache only stable prefixes. If the first 3,000 tokens change every request, there is nothing meaningful to reuse.
  • Version your cache key aggressively. Model version, tokenizer version, system prompt hash, tool schema hash, and decoding config should all be part of the key.
  • Evict based on business value, not just LRU. The best cache entry is the one tied to your hottest flow or largest enterprise tenant.
  • Keep cache affinity in routing. A 90% valid cache that lives on the wrong replica is functionally a miss.

A representative cache key might look like this:

{
  "model": "llama-3.1-70b-instruct",
  "tokenizer": "v3",
  "systemPromptHash": "a91c...",
  "toolsHash": "b821...",
  "retrievalTemplateHash": "f440...",
  "tenantTier": "enterprise",
  "decodeProfile": "streaming-balanced"
}

The counterintuitive part is that larger prompts can make your system more predictable if they are stable and cacheable.

Most teams try to shorten every prompt. That is directionally right, but incomplete. A 4,000-token stable prefix with strong reuse can be cheaper and more predictable than a 1,500-token prompt rebuilt differently every time.

We saw the broader version of this on Harmony.ai. Token count was the cost problem, but the real fix was not just "use fewer tokens." It was to stop repeating work across orchestration steps. In inference serving, KV caches are how you stop repeating work at the model execution layer.

If you are building AI features into an MVP, this connects directly to scope discipline. Mvp Scope Example What To Build First matters here because the easiest way to avoid runaway inference cost is to ship one repeated high-volume workflow before five low-volume bespoke ones.

How to survive spot interruptions without user-visible chaos

Spot instances are only dangerous if your product assumes a machine will stay alive just because a request started there.

That assumption is the real bug.

AWS gives Spot Instance interruption notices with a two-minute warning in many interruption cases. Two minutes is plenty if your system is designed to drain, replay, or fail over. It is useless if your state only lives in process memory and your router has no replacement path.

The survival plan is straightforward:

  • Stop assigning new requests to the interrupted node immediately. This should happen at the service registry or load balancer level, not by waiting for application health checks to fail.
  • Let short interactive requests finish if they are near completion. If a stream is 90% done, draining is often better than replay.
  • Replay long or batch requests onto a warm fallback pool. This only works if the request envelope and prior artifacts are stored durably.
  • Preserve partial UX state. If a generation restarts, the client should show a brief reconnect state instead of silently hanging.

For interactive products, I recommend three request classes:

Request classExampleWhere it should run
GoldLive user chat, copilot, demo-critical flowsOn-demand first, spot overflow only if warm failover exists
SilverStandard user requests with retry toleranceSpot first, on-demand fallback
BronzeBatch enrichments, nightly jobs, async evaluationSpot only

This classification sounds obvious, but most products skip it because they are still thinking in terms of "one inference endpoint."

That is a mistake. Not all tokens have the same business value.

A simple interruption handler might look like this:

async def handle_spot_interruption(node_id: str):
    mark_node_unschedulable(node_id)
    inflight = await list_inflight_requests(node_id)

    for req in inflight:
        if req.progress_pct > 85 and req.class_name == "gold":
            await try_drain(req)
        else:
            await checkpoint(req)
            await replay_to_pool(req, pool="on-demand")

There is also a product truth here: graceful degradation beats fake reliability.

If capacity is tight, do this instead of pretending everything is normal:

  • Reduce max output tokens for lower-priority classes.
  • Temporarily disable expensive tools or multi-step reasoning modes.
  • Queue non-interactive tasks visibly with ETA estimates.
  • Route premium or sales-critical flows to guaranteed capacity.

That is the same philosophy behind When To Stop Using Supabase For Postgres. You do not wait for the architecture to collapse in public. You identify the layer that is creating user-visible instability and replace it before it becomes your brand.

What to measure if you care about reliability

Most teams measure average latency, which is almost useless for LLM serving.

You should care about p95 time-to-first-token, p99 completion time, interruption recovery success rate, and effective cost per successful response.

If I were instrumenting this from scratch, I would track these metrics first:

  • Time to first token by request class. This tells you whether interactive UX feels instant or sluggish.
  • Prefill time versus decode time. This tells you whether prompt construction or token generation is actually driving latency.
  • KV cache hit rate by prefix family. This shows whether your caching strategy is real or just theoretical.
  • Spot interruption replay success rate. This tells you whether your fallback architecture works in production rather than in diagrams.
  • Effective tokens per GPU-second. This is a much better capacity metric than raw requests per second.
  • Cost per completed response by model and tenant tier. This exposes which workflows are quietly destroying margin.

A practical SLO table might look like this:

MetricTargetWhy it matters
p95 TTFT for gold traffic< 800msThis is the threshold users feel immediately in chat UX.
p99 completion for silver traffic< 8sThis keeps normal usage acceptable without overpaying for guaranteed capacity.
KV cache hit rate on hot prefixes> 70%This proves your prompt architecture is reusable rather than chaotic.
Replay success after spot interruption> 99%This is what turns cheap infrastructure into reliable product behavior.
GPU utilization60-80% sustainedThis is the band where you are efficient without creating queue spikes.

The 60-80% utilization target is deliberate. Running GPUs near 100% looks efficient on paper and feels terrible in production because queueing delay spikes before users complain in a clear way.

That is also why I do not recommend autoscaling only on GPU utilization. Scale on a mix of:

  • Queue depth for each request class.
  • p95 time-to-first-token.
  • Cache-miss rate on hot workflows.
  • Number of warm replicas available in fallback capacity.

The architecture decision founders miss

Many founders ask whether they should optimize prompts, switch models, or buy more GPUs. The order is wrong.

Do this first:

  1. Classify requests by business value and retry tolerance.
  2. Make repeated prefixes stable enough to cache.
  3. Add a small on-demand fallback pool.
  4. Only then tune models and quantization.

That ordering is not theoretical. It matches how real product constraints show up.

On Utkrusht.ai, the hard problem was not just getting tokens out of a model. It was streaming them in a way that kept the interface responsive. On Surge, the hard problem was not just real-time updates. It was removing infra choices that added user-visible latency under load. Predictable inference is the same kind of engineering decision: build around the user-facing bottleneck, not the internal abstraction you like most.

What to do next

Audit one production workflow this week and force it through a three-part decision: is it replayable, does it have a stable cacheable prefix, and what exact fallback pool serves it on interruption.

Do not start by adding another serving engine or another model. Start by drawing the routing rules for one high-volume endpoint and assign every request to one of three classes: gold, silver, or bronze.

Then implement this sequence in order:

  • Put interactive and batch traffic behind separate queues. Shared queues are where latency predictability goes to die.
  • Add explicit prefix hashing and log cache-hit candidates even before you enable reuse. You need to know whether your product behavior makes caching viable.
  • Keep a small warm on-demand pool for premium and replay traffic. This pool is your reliability layer, not your main capacity layer.
  • Define p95 TTFT and replay success SLOs before you optimize anything else. If you cannot measure recovery, you do not have reliable spot inference.
  • Trim prompt entropy in your hottest workflow. Stable prompts are more valuable than merely shorter prompts.

If you are early and still shaping the product, connect this work back to feature scope. The simplest path to predictable inference is a narrower set of repeated workflows, which is exactly why we push founders toward disciplined MVP decisions in Building Products From Zero To One.

The best architecture for most startups is not all reserved GPUs and it is not all spot GPUs. It is spot-first capacity with strict request classes, warm fallbacks, and KV cache-aware routing.

That is how you get lower costs without turning latency into a lottery.

If you're at this stage, schedule a call with us.