In Part 1 we established the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do. Now we make it practical.
This post is about the bridge between those concepts and actual hardware — specifically, how you translate a customer’s real-world requirements (“we need to support 500 users”) into a GPU count you can put in a proposal.
The key metric that connects the two sides is tokens per second (TPS). To use it properly, you need to understand what’s actually happening inside the GPU when a model generates a response — because not all tokens are created equal.
Two Phases, Two Different Problems
When an LLM handles a request, it does so in two distinct phases. They look similar from the outside — text goes in, text comes out — but they have fundamentally different performance characteristics under the hood.
Phase 1: Prefill This is where the model reads and processes the entire input prompt.
- All the input tokens in your prompt are processed in parallel.
- This phase is compute-intensive — the GPU is doing a lot of simultaneous maths.
- It largely determines Time to First Token (TTFT): how long the user waits before they see any response at all.
Phase 2: Decode This is where the model generates the response, one token at a time.
- Each new token depends on the previous ones, so this phase is inherently sequential.
- And here’s the critical insight for sizing: the decode phase is often not limited by the GPU’s raw FLOPS.
- It’s limited by memory bandwidth — how fast the GPU can stream the model’s weights from high-bandwidth memory (HBM) to generate each token.
A quick note on FLOPS
You’ll see FLOPS quoted constantly in GPU spec sheets, so it’s worth understanding what it actually means — and where it does and doesn’t tell the full story.
FLOPS stands for Floating-Point Operations Per Second. It measures how much numerical computation a processor can perform per second. LLMs are essentially enormous stacks of matrix multiplications on floating-point numbers, so FLOPS is a natural unit for describing raw GPU compute power.
Vendors typically quote performance in:
- TFLOPS (tera-FLOPS = 10¹²) or PFLOPS (peta-FLOPS = 10¹⁵)
- Often broken down by precision: FP32 TFLOPS, FP16/BF16 TFLOPS, INT8 TOPS
So when you see “H100: X PFLOPS (FP16)”, that’s the peak theoretical compute at 16-bit precision — not what you’ll observe in a real LLM workload once memory access patterns, batching, and framework overhead come into play.
Here’s how FLOPS maps to the two inference phases:
- Prefill is FLOPS-hungry. Processing all prompt tokens in parallel is a heavy matrix multiplication workload — this is where raw compute throughput matters most. Higher FLOPS directly improves prefill speed and reduces TTFT.
- Decode is not FLOPS-bound. Generating tokens sequentially doesn’t saturate the GPU’s arithmetic units. The bottleneck shifts entirely to memory bandwidth — how fast the GPU can stream model weights from HBM for each token generated.
This distinction matters enormously in practice: a GPU with impressive FLOPS but modest memory bandwidth can underperform for LLM inference compared to one with higher bandwidth, even if the spec sheet comparison looks favourable. It’s why memory bandwidth is often the first number to check when evaluating accelerators for inference workloads — and why the H100 SXM, with its multi-TB/s HBM3 bandwidth, consistently outperforms lower-bandwidth alternatives for decode-heavy deployments.
The Core Metric: Tokens Per Second
Tokens per second (TPS) is your fundamental unit of inference throughput. Everything in a sizing conversation eventually traces back to it.
There are two ways to look at TPS, and you need to keep them separate:
- Per-user TPS — how fast tokens are delivered to a single user.
- This drives the perceived experience.
- Rough guide: below 10–15 tokens/sec starts to feel sluggish; above 30 tokens/sec it feels near-instant for most chat use cases.
- System TPS — the total token output across all concurrent users.
- This is what you’re actually sizing the hardware to sustain.
The relationship is simple in principle:
System TPS = Concurrent Users × Tokens per Second per User
In practice, batching is what makes this efficient:
- Rather than serving each user’s request on dedicated GPU resources, a well-configured inference server groups multiple requests together and processes them as a single batch.
- This significantly improves GPU utilisation — particularly during the memory-bandwidth-bound decode phase.
- Batching is the primary mechanism that lets you serve many users from a relatively small GPU footprint.
Working Backwards: From Users to GPUs
Here’s the sizing workflow that turns a customer conversation into a hardware recommendation.
Step 1: Define the workload
Start with the usage-side discovery questions from Part 1:
- How many concurrent users?
- What’s the average prompt length (input tokens)?
- What’s the expected response length (output tokens)?
- What’s the acceptable latency — both time to first token (TTFT) and total response time?
A worked example
A customer wants to deploy an internal assistant. Together you define:
| Parameter | Value |
|---|---|
| Concurrent users | 200 |
| Average prompt | 500 tokens |
| Average response | 300 tokens |
| Target response time | ~10 seconds |
| Acceptable TTFT | < 2 seconds |
Step 2: Calculate required system TPS
From the example:
- 300 output tokens in 10 seconds = 30 tokens/sec per user
- 200 users × 30 tokens/sec = 6,000 tokens/sec system throughput
So the platform needs to sustain ~6,000 TPS of decoded tokens under load.
Step 3: Establish per-GPU TPS for your chosen model
This is where model size and GPU choice meet. As a rough reference for inference at FP16 (actual figures vary with batch size, framework, and optimisation):
| Model | GPU | Approx. TPS (decode, batched) |
|---|---|---|
| 7B | H100 80GB | ~2,000–3,000 |
| 70B (tensor parallel, 4×) | 4× H100 80GB | ~800–1,200 |
| 70B (tensor parallel, 8×) | 8× H100 80GB | ~1,500–2,500 |
Note: these are illustrative ranges. Always validate against benchmark data for your specific model, serving framework, optimisation level (TensorRT-LLM, vLLM, etc.), and batch configuration.
Step 4: Calculate GPU or node count
Continuing the example, assume:
- You choose a 70B model hosted on 4× H100 80GB nodes.
- Based on benchmarks, you take a conservative estimate of 1,000 TPS per node (decode, batched).
Then:
- 6,000 system TPS ÷ 1,000 TPS per node ≈ 6 nodes
Add a headroom buffer (typically 20–30% for burst traffic, uneven load, and future growth):
- 6 nodes × 1.25 ≈ 8 nodes as a starting recommendation.
At this point, you have a defensible answer to “how many GPUs/nodes do we need?” that’s grounded in user requirements, not just “bigger is better.”
Reference Sizing: Two Common Scenarios
The worked example above walks through the methodology. The table below applies it to two reference architectures you’ll encounter regularly — a 7B internal assistant and a 70B RAG system — to give you a practical feel for how the numbers land.
Figures assume FP16 or INT8 precision, batched inference, and a well-optimised serving framework such as TensorRT-LLM or vLLM. Treat these as directional reference points, not guaranteed benchmarks — validate against your specific model, configuration, and workload before quoting.
| Aspect | Ref A: 7B Internal Assistant | Ref B: 70B RAG System |
|---|---|---|
| Typical use case | Employee Q&A, productivity assistant | Legal/finance/engineering RAG over proprietary data |
| Model size | 7B | 70B |
| Quality vs cost | “Good enough” quality, cost-optimised | Higher quality, domain-heavy reasoning |
| Concurrency (peak) | ~500 users | ~100 users |
| Avg prompt (input) | ~400 tokens | ~2,000 tokens (incl. retrieved context) |
| Avg response (output) | ~250 tokens | ~500 tokens |
| Latency target | 8–10 s total, TTFT < 2 s | 12–15 s total, TTFT < 3 s |
| System TPS target | ≈ 12,500 TPS (decode) | ≈ 3,300 TPS (decode) |
| Precision (typical) | INT8 / mixed (weights) | FP16 / mixed, selective quantisation |
| GPUs per node (typical) | 3–4 GPUs per PowerEdge node | 8 GPUs per PowerEdge XE-class node |
| Nodes (illustrative) | ≈ 3–4 nodes (total ~12 GPUs, incl. headroom) | ≈ 3 nodes (24 GPUs total, incl. headroom) |
| Interconnect focus | Good PCIe + 25–100 GbE | NVLink/NVSwitch + 100–400 Gb fabric |
| Workload pattern | High concurrency, chat-like | Lower concurrency, long prompts, RAG + heavier reasoning |
| Sizing conversation hook | “Maximise users per GPU, acceptable quality” | “Maximise quality on key workflows, moderate concurrency” |
Notice the counterintuitive result: the smaller 7B model actually demands nearly four times the system throughput of the 70B RAG system (≈12,500 TPS vs ≈3,300 TPS). That’s not a contradiction — it’s the concurrency effect. Serving 500 chat users simultaneously generates far more aggregate token output than 100 users running deep reasoning queries, even though each individual 7B response is shorter. Bigger model doesn’t always mean bigger infrastructure footprint; workload pattern matters just as much as parameter count.
A few additional assumptions behind these figures worth keeping front of mind:
- No fine-tuning overhead — these are inference-only configurations. If the customer plans on-premises fine-tuning, GPU and memory requirements increase substantially.
- Steady-state load — the node counts include a 20–30% headroom buffer but assume reasonably predictable peak concurrency. Highly bursty workloads (e.g. end-of-day batch spikes) may warrant additional headroom or an autoscaling strategy.
- Single-tenant deployment — figures assume dedicated GPU resources per workload. Multi-model or multi-tenant deployments require separate sizing treatment.
- Retrieved context included — the 70B RAG prompt size of ~2,000 tokens already includes retrieved document chunks. If retrieval quality improves and chunk sizes grow, prompt tokens — and therefore TTFT — will increase accordingly.
The Trade-offs Every Customer Faces
Once you’ve run this exercise with a customer, three trade-off conversations typically follow.
1. Model quality vs. throughput
- A 70B model usually produces higher-quality outputs than a 7B.
- But it also serves far fewer users per GPU.
- For some use cases — summarising legal documents, writing complex code, specialised reasoning — the quality premium is worth it.
- For a high-volume customer service assistant, a well-tuned 7B model might deliver better economics with acceptable quality.
2. Latency vs. concurrency
- Larger batch sizes improve GPU utilisation and system throughput, but they increase the time an individual request spends waiting to join a batch.
- If TTFT is critical (live chat, voice interfaces), you’ll accept lower utilisation to keep batches small and responsive.
- If the application is asynchronous (batch document processing, offline analytics), you can run large batches, push utilisation higher, and drive down cost per request.
3. Precision vs. memory footprint
- Running a model at FP16 gives you full quality but also the full VRAM cost.
- Quantising to INT8 or INT4 roughly halves or quarters the memory footprint, allowing either a larger model to fit in the same GPUs, or the same model to fit in fewer GPUs.
- There is a quality trade-off, but for many inference workloads, well-done INT8 quantisation offers an excellent quality-to-cost ratio and is worth including in the conversation.
What This Means for Platform Selection
By this point in a customer conversation, you have enough to make an informed platform recommendation.
Single-node, lower concurrency, 7B–13B models A PowerEdge server with 2–4 high-memory GPUs will typically cover the requirement, with room to scale up or out as usage grows.
Multi-node, higher concurrency, or 70B+ models You’re looking at GPU-dense platforms where high-speed interconnect between GPUs (NVLink, NVSwitch) and network fabric between nodes become as important as raw GPU count. These directly affect prefill and decode performance, and therefore both latency and throughput.
Mixed workloads (inference + fine-tuning) Fine-tuning demands significantly more memory per GPU than inference alone (optimizer states, gradient storage, larger activations). If a customer plans both, size for fine-tuning — the inference requirement is typically covered as a result.
The specific Dell platform mapping — PowerEdge XE series, GPU configurations, and interconnect options — is what we’ll build out in Part 3.
Next up: Part 3 — Platform and GPU selection: mapping your sizing to Dell PowerEdge XE configurations. We’ll take the TPS-based sizing approach from this post and show how it translates into concrete server configs you can quote.

1 Comment