How to Choose Performance Metrics for Serverless LLM Inference

When deploying Large Language Models (LLMs) in a serverless environment, it's common to focus on a single performance number, like median tokens per second. While this metric is easy to track, it often provides an incomplete picture of an application's real-world performance. Different LLM workloads have distinct bottlenecks and user expectations, meaning a single metric optimized in isolation can lead to a beautifully benchmarked but poorly performing production service. This tutorial will guide you through identifying and prioritizing the critical performance metrics that truly matter for your serverless LLM inference applications.

Step 1: Understand Your LLM Workload's Requirements

Before diving into specific metrics, it's crucial to categorize your LLM workload. The ideal performance measurement heavily depends on how your application interacts with users or other systems. Consider these common workload types:

Batch Processing: Tasks like overnight document summarization, bulk data extraction, or generating embeddings for a large dataset. Here, a human isn't waiting in real-time.
Interactive Chat Interfaces: Applications where users expect immediate and continuous responses, such as chatbots, virtual assistants, or real-time content generation tools.
Real-time API Endpoints: Services that provide quick, single-shot responses to user requests, like content moderation, sentiment analysis, or code completion suggestions.

Each of these scenarios places different emphasis on speed, consistency, and cost. Understanding your primary use case will inform which metrics you should prioritize.

Step 2: Evaluate Throughput for Offline and Batch Processes

Throughput measures the steady-state rate at which an LLM emits tokens once it has started generating a response. It's typically expressed in tokens per second (TPS).

When Throughput Matters Most

Throughput is the primary metric for workloads where no human is actively waiting for an immediate response. This includes:

Batch Summarization: Processing hundreds or thousands of documents.
Data Generation Pipelines: Creating large volumes of synthetic data or embeddings.
Offline Content Rewriting: Updating product descriptions or marketing copy in bulk.

For these applications, maximizing the total number of tokens processed over a given period is key. High throughput directly translates to faster completion of large tasks and efficient resource utilization.

Beyond Single-Stream Throughput

While many benchmarks report single-stream throughput (one request at a time), production services typically handle many requests concurrently. Therefore, the more relevant figure is aggregate throughput under concurrency. This measures how well the model maintains its TPS as the number of simultaneous requests increases. You'll want to observe whether the per-request speed degrades gracefully or sharply under load. Furthermore, consider how model architecture, such as Mixture-of-Experts (MoE) models, can significantly impact throughput compared to dense models, making model selection a critical factor alongside provider performance.

Step 3: Prioritize Time to First Token (TTFT) and Its Stability for Interactive Applications

Time to First Token (TTFT) is the duration between sending a request to the LLM and receiving the very first token of its response. For interactive, user-facing applications, TTFT is the metric users feel most acutely.

The User Experience of TTFT

In a streaming chat interface, TTFT is the perceived delay between a user hitting 'send' and the AI's response beginning to appear. Even if the overall response generation speed (throughput) is moderate, a quick and consistent TTFT can make the application feel highly responsive. Users often prefer seeing a response start immediately and stream out, rather than waiting for a complete, potentially faster, full response.

The Importance of Predictability and Stability

While a fast average TTFT is good, predictability and stability are even more critical. Users notice inconsistencies. A TTFT that is usually 0.2 seconds but occasionally spikes to 5 or 10 seconds due to cold starts or resource contention will lead to a frustrating user experience. It's the worst-case scenario that defines user perception, not the median. Focus on minimizing the variance in TTFT, ensuring a tight range of response times even under varying load conditions.

Step 4: Assess Model Availability and Reliability

Availability, often overlooked in initial benchmarks, is a decisive metric for production systems. A model that is incredibly fast when it works is worthless if it's frequently unavailable or suffers from long, unpredictable cold starts.

Key Aspects of Availability

Consistent Access: Ensure that the models you rely on are consistently accessible and not gated behind specific endpoint types or subject to erratic performance.
Cold Starts: Serverless environments can introduce