LLM Sizing 101 – Part 3: Platform and GPU Selection

A schematic diagram illustrating the LLM sizing chain, featuring flowcharts that detail model size, precision, tokens per second, GPU count, node count, and platform specifications.

Mapping your sizing to Dell PowerEdge XE configurations

In Part 1 we nailed down the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do.

In Part 2 we made it practical — translating a customer’s real-world requirements into a target tokens-per-second figure, and from there into a GPU count.

Now we make it concrete.

Building on the methodology from Part 2, we apply it to two representative scenarios — a 7B internal assistant and a 70B RAG system — and map everything to actual Dell PowerEdge XE platform configurations you can put in a proposal. But before we get to the reference designs, there’s a gotcha.


The Gotcha: Model Precision

There’s a variable that can silently double — or halve — your GPU count if you don’t nail it down early in the conversation.

Model precision.

When a customer says “we want to run a 70B model,” that sentence is incomplete. The question you need to ask immediately is: at what model precision?

Here’s why it matters so much. The memory footprint of a model is:

Model VRAM (GB) = number of parameters × bytes per parameter

And the bytes-per-parameter figure is entirely determined by precision:

PrecisionBytes per parameter70B model — weights onlyNotes
FP324 bytes~280 GBTraining only; rare in inference
FP16 / BF162 bytes~140 GBFull quality baseline
FP81 byte~70 GBRequires H100, H200, B300 class
INT81 byte~70 GBBroad hardware support
INT40.5 bytes~35 GBValidate quality before committing
FP40.25 bytes~17.5 GBB300/GB300 only; first-class inference precision

Run the same 70B model at FP16 versus INT4 and the weights footprint changes by 4×. That’s the difference between needing two 8-GPU nodes and needing one. It’s the difference between a £400k proposal and a £200k proposal. And it’s a variable that’s completely invisible if you skip the precision conversation.

How to find out

The good news: model precision is almost always discoverable before you size anything.

The model card. Every published model has a model card stating the native training precision — typically FP32, BF16, or FP16 — and whether pre-quantised versions exist. Llama 3.1 405B, for example, is published in BF16 with a separate FP8-quantised version available for single-node deployment. That’s not a footnote — it’s a hardware decision.

The deployment framework. When a customer tells you they’re using vLLM, TensorRT-LLM, or NVIDIA NIM, the framework makes precision explicit. NIM profiles are named by precision — tensorrt_llm-h100-fp8-tp2-latency tells you the precision, the GPU, and the parallelism strategy in one string. If the customer has already chosen a framework, ask what precision they’re deploying at — they’ll either know, or the question will prompt them to find out.

The GPU itself. Not all GPUs support all precisions. FP8 requires H100, H200, B300 or AMD MI300X class hardware. FP4 is exclusive to B300 and GB300 — it isn’t available on earlier generations. INT4 with hardware acceleration requires specific tensor core support. If the customer has already chosen a GPU, that constrains the precision options — and vice versa. The two decisions are linked.

The precision conversation in practice

When a customer names a model, these are the three questions that unlock the sizing:

“Are you using the native model weights, or a quantised version?” “What serving framework are you planning to use?” “Is some accuracy trade-off acceptable in exchange for a smaller hardware footprint?”

That last question is the most important one. Modern quantisation techniques — GPTQ, AWQ, SmoothQuant — preserve the vast majority of model quality for most enterprise inference workloads. The difference between BF16 and INT8 is typically imperceptible for summarisation, search, classification and code assistance. For complex multi-step reasoning or fine-tuned models, it’s worth validating. But for the majority of use cases, INT8 or FP8 is a legitimate production choice — not a compromise.

The rule of thumb: the bigger the model, the more gracefully it quantises — for most enterprise inference workloads. A 70B model at INT8 loses less proportionally than a 7B model at INT4.

Get precision wrong — or leave it undefined — and every GPU count in your proposal is built on a shaky foundation. Get it right, and you have a sizing conversation that’s grounded, defensible, and often more cost-effective than the customer expected.


Two Reference Designs

With precision established, everything else follows.

Sizing disclaimer: The reference designs below illustrate the methodology — they are not a substitute for your own sizing exercise. TPS figures, GPU counts and node recommendations are directional reference points based on representative workloads. Actual performance will vary with your specific model, serving framework, quantisation approach, batch configuration and workload pattern. Always validate against benchmark data for your environment before quoting or committing to a configuration.

These aren’t rigid prescriptions — they’re starting points you can adapt by adjusting the inputs and re-running the TPS maths from Part 2.


Reference Design A: 7B Internal Assistant

Use case: An internal productivity assistant — employees asking about policies, summarising documents, drafting emails. High concurrency, moderate latency sensitivity, cost-conscious.

1. Define the workload

ParameterValue
Concurrent users (peak)500
Average prompt400 tokens
Average response250 tokens
Target response time~8–10 seconds
Acceptable TTFT< 2 seconds
Model7B class

2. Establish precision and memory footprint

For a 7B model:

PrecisionWeights footprintFits on a single GPU?
FP16 / BF16~14 GBYes (48–80 GB class)
INT8~7 GBYes — comfortably
INT4~3.5 GBYes — with significant headroom

For a high-concurrency internal assistant, INT8 or mixed precision (weights in INT8, activations in FP16/BF16) is the practical default. It fits cleanly on a single GPU, leaves room for KV cache and batching overhead, and the quality trade-off is negligible for this kind of workload.

3. Translate to TPS

  • 250 output tokens ÷ 10 seconds = 25 tokens/sec per user
  • 500 users × 25 tokens/sec = 12,500 tokens/sec system TPS

4. Per-GPU TPS estimate

For a 7B model at INT8/mixed precision, batched decode on a high-end accelerator:

GPUApprox. TPS (7B, batched decode)
H100 80GB SXM~2,000–3,000
H200 141GB~2,500–3,500
L40S 48GB~1,000–1,500
B300 288GB~4,000–6,000 (est.)

Conservative estimate: 1,500 TPS per GPU on current generation; higher on B300.

5. GPU and node count

  • 12,500 TPS ÷ 1,500 TPS/GPU ≈ 8.3 GPUs
  • Add 25% headroom: 8.3 × 1.25 ≈ 10.4 GPUs → round up to 12 for a clean 3 × 4 configuration

6. Platform mapping

A 7B model at INT8 fits on a single GPU — no tensor parallelism required. Each GPU runs an independent model replica and you scale out horizontally across nodes. This is compact, balanced GPU server territory.

The Dell PowerEdge XE7745 is the natural fit for this workload class: a 2U platform supporting up to 4 high-memory GPUs, designed for exactly this kind of inference deployment. For organisations planning ahead with Blackwell, the XE7745 also supports NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs — a professional-grade accelerator with 96 GB GDDR7 that offers significant headroom for a 7B workload and future-proofing for multi-model environments, at a lower power and cost envelope than data centre HBM-class GPUs.

“For a 7B internal assistant serving ~500 concurrent users, a small cluster of three PowerEdge XE7745 nodes gives you a responsive chat experience, capacity to grow, and the flexibility to host multiple models or environments — all in a standard rack footprint.”


Reference Design B: 70B RAG System

Use case: Knowledge-heavy workflows — legal, financial or engineering teams querying proprietary documents via a RAG pipeline. Quality matters more than raw user count. Concurrency is moderate.

1. Define the workload

ParameterValue
Concurrent users (peak)100
Average prompt2,000 tokens
Average response500 tokens
Target response time~12–15 seconds
Acceptable TTFT< 3 seconds
Model70B class

Prompts are longer here because RAG injects retrieved document snippets, conversation history and system instructions into every request. That longer context window drives up KV cache memory — which is why the platform choice shifts significantly compared to Reference Design A.

2. Establish precision and memory footprint

This is where precision has the biggest impact on the proposal — and where the B300 changes the calculus significantly:

PrecisionWeights footprintH100/H200 GPUs neededB300 GPUs neededNotes
FP16 / BF16~140 GB2 minimum1 (fits with headroom)Full quality; B300’s 288 GB changes the equation
FP8~70 GB1 minimum1Near-FP16 quality; requires H100/H200/B300
INT8~70 GB1 minimum1Minimal quality loss for most workloads
INT4~35 GB11Validate quality before committing
FP4~17.5 GBN/A — B300/GB300 only1 — with substantial headroomValidate quality for RAG use cases

A key implication for B300 deployments: with 288 GB of HBM3e per GPU, a 70B model at FP16 (~140 GB weights) fits on a single B300. That eliminates the need for tensor parallelism within the node for this model size, simplifying the architecture and reducing interconnect dependency.

For a legal or financial RAG workload where output quality is the primary requirement, FP16 or FP8 remains the right starting point. FP4 on B300 is increasingly viable but worth validating explicitly against the customer’s specific domain before committing.

3. Translate to TPS

  • 500 output tokens ÷ 15 seconds = 33 tokens/sec per user
  • 100 users × 33 tokens/sec = 3,300 tokens/sec system TPS

4. Per-node TPS estimate

For a 70B model running on high-end accelerators:

ConfigurationApprox. TPS (70B, batched decode)
4× H100 80GB (tensor parallel)~800–1,200
8× H100 80GB (tensor parallel)~1,500–2,500
8× H200 141GB (tensor parallel)~2,000–3,500
8× B300 288GB (HGX B300)~4,000–7,000 (est.)

Conservative estimate on an 8× H100 node: 1,500 TPS. On an 8× B300 node: significantly higher, with the added benefit that each GPU can host the full model independently.

5. Node count

  • 3,300 TPS ÷ 1,500 TPS/node (H100 baseline) ≈ 2.2 nodes
  • Add 25–30% headroom: 2.2 × 1.25 ≈ 2.75 → round to 3
  • Total: 3 nodes × 8 GPUs = 24 GPUs (H100/H200 baseline)

On B300 hardware, the same TPS target is achievable with fewer nodes — or the same node count delivers substantially higher capacity.

Three nodes also gives you operational flexibility — you can drain one for maintenance without collapsing capacity below the required TPS floor.

6. Platform mapping

For H100/H200 deployments, the Dell PowerEdge XE9680 with 8× H100 or H200 GPUs remains a proven reference platform for 70B inference, with NVLink and NVSwitch providing the fast GPU-to-GPU interconnect tensor parallelism requires.

For Blackwell deployments, the Dell PowerEdge XE9780 and XE9785 are the direct successors to the XE9680 — delivering up to 4× faster LLM performance with the 8-way NVIDIA HGX B300. The liquid-cooled XE9780L and XE9785L variants support higher GPU densities for rack-scale deployments.

Infrastructure note: B300 systems require liquid cooling, 800 Gb/s networking, and power densities that most existing facilities cannot support without upgrade. The B300 draws 1,400W TDP per GPU — 40% more than the B200, and double the H100. Factor facility readiness into any B300 sizing conversation before committing to a configuration.

“For a 70B RAG assistant used by specialist teams — legal, finance, engineering — the PowerEdge XE9680 with H100/H200 GPUs remains a strong proven choice. For organisations investing in Blackwell infrastructure, the XE9780/XE9785 with HGX B300 delivers significantly higher throughput and eliminates tensor parallelism requirements for 70B class models — but facility readiness for liquid cooling and power density must be confirmed first.”


The Platform Decision in Summary

WorkloadModelPrecisionCurrent PlatformBlackwell PlatformGPUs/nodeNodes
Internal assistant (high concurrency)7BINT8PowerEdge XE7745XE7745 (RTX Pro 6000 BW)4× GPUs3
RAG system (quality-first)70BFP16 / FP8PowerEdge XE9680XE9780 / XE9785 (HGX B300)8× GPUs3

The pattern is consistent: model size drives platform class, precision drives memory footprint, TPS drives node count. Miss any one of those three and the sizing is incomplete.

For organisations moving beyond 70B — frontier models, multi-tenant inference at scale, or combined training and inference workloads — the Dell PowerEdge XE9712 featuring NVIDIA GB300 NVL72 is the next step up. With 72 Blackwell Ultra GPUs and up to 40 TB of fast memory per rack (combining ~20 TB of GPU HBM3e across 72 GPUs and ~17 TB of Grace CPU LPDDR5X), it delivers exascale-class AI performance for workloads that have outgrown the per-node sizing conversation entirely. That’s a different discussion — but it starts with the same methodology.


Three Trade-offs Worth Raising

Once you’ve walked a customer through a reference design, three conversations typically follow.

1. Can we use a smaller model? Sometimes yes — and it’s worth exploring. A well-tuned 13B model can deliver surprisingly strong results for many enterprise use cases, at a fraction of the infrastructure cost of a 70B. The right answer depends on the use case, not just the budget.

2. Can we quantise to reduce the footprint? INT8 quantisation roughly halves the memory footprint with minimal quality loss for most inference workloads. INT4 goes further — but quality trade-offs become more noticeable and are worth validating before committing. FP4 on B300 hardware is the emerging sweet spot for next-generation inference: near-FP8 quality at half the memory cost, with hardware-accelerated compute — but it requires Blackwell Ultra infrastructure.

3. What about fine-tuning? If the customer plans to fine-tune as well as infer, size for fine-tuning — it’s the more demanding workload. Fine-tuning requires storing optimiser states and gradients alongside the model weights, which can triple or quadruple the VRAM requirement compared to inference alone. A platform sized for fine-tuning will handle inference comfortably.


What’s Next

With three posts, we’ve built a complete sizing chain:

  • Part 1: Parameters and tokens — the two dials that drive every sizing decision
  • Part 2: From tokens per second to GPU count — the maths that connects users to hardware
  • Part 3: Precision, platform selection, and reference designs — where the maths meets the metal

The natural next conversation is the one that follows a sizing recommendation: how does an on-premises PowerEdge deployment compare to cloud over three years? That’s the cost modelling discussion — and it’s where a well-sized on-premises platform often tells a very different story to the cloud bill the customer is currently paying.



LLM Sizing 101 – Part 2: From Tokens Per Second to GPU Count

Flowchart illustrating LLM sizing concepts, featuring phases for prefill and decode, compute processes, throughput bridge, and metrics for tokens per second based on GPU count.

In Part 1 we established the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do. Now we make it practical.

This post is about the bridge between those concepts and actual hardware — specifically, how you translate a customer’s real-world requirements (“we need to support 500 users”) into a GPU count you can put in a proposal.

The key metric that connects the two sides is tokens per second (TPS). To use it properly, you need to understand what’s actually happening inside the GPU when a model generates a response — because not all tokens are created equal.


Two Phases, Two Different Problems

When an LLM handles a request, it does so in two distinct phases. They look similar from the outside — text goes in, text comes out — but they have fundamentally different performance characteristics under the hood.

Phase 1: Prefill This is where the model reads and processes the entire input prompt.

  • All the input tokens in your prompt are processed in parallel.
  • This phase is compute-intensive — the GPU is doing a lot of simultaneous maths.
  • It largely determines Time to First Token (TTFT): how long the user waits before they see any response at all.

Phase 2: Decode This is where the model generates the response, one token at a time.

  • Each new token depends on the previous ones, so this phase is inherently sequential.
  • And here’s the critical insight for sizing: the decode phase is often not limited by the GPU’s raw FLOPS.
  • It’s limited by memory bandwidth — how fast the GPU can stream the model’s weights from high-bandwidth memory (HBM) to generate each token.

A quick note on FLOPS

You’ll see FLOPS quoted constantly in GPU spec sheets, so it’s worth understanding what it actually means — and where it does and doesn’t tell the full story.

FLOPS stands for Floating-Point Operations Per Second. It measures how much numerical computation a processor can perform per second. LLMs are essentially enormous stacks of matrix multiplications on floating-point numbers, so FLOPS is a natural unit for describing raw GPU compute power.

Vendors typically quote performance in:

  • TFLOPS (tera-FLOPS = 10¹²) or PFLOPS (peta-FLOPS = 10¹⁵)
  • Often broken down by precision: FP32 TFLOPS, FP16/BF16 TFLOPS, INT8 TOPS

So when you see “H100: X PFLOPS (FP16)”, that’s the peak theoretical compute at 16-bit precision — not what you’ll observe in a real LLM workload once memory access patterns, batching, and framework overhead come into play.

Here’s how FLOPS maps to the two inference phases:

  • Prefill is FLOPS-hungry. Processing all prompt tokens in parallel is a heavy matrix multiplication workload — this is where raw compute throughput matters most. Higher FLOPS directly improves prefill speed and reduces TTFT.
  • Decode is not FLOPS-bound. Generating tokens sequentially doesn’t saturate the GPU’s arithmetic units. The bottleneck shifts entirely to memory bandwidth — how fast the GPU can stream model weights from HBM for each token generated.

This distinction matters enormously in practice: a GPU with impressive FLOPS but modest memory bandwidth can underperform for LLM inference compared to one with higher bandwidth, even if the spec sheet comparison looks favourable. It’s why memory bandwidth is often the first number to check when evaluating accelerators for inference workloads — and why the H100 SXM, with its multi-TB/s HBM3 bandwidth, consistently outperforms lower-bandwidth alternatives for decode-heavy deployments.


The Core Metric: Tokens Per Second

Tokens per second (TPS) is your fundamental unit of inference throughput. Everything in a sizing conversation eventually traces back to it.

There are two ways to look at TPS, and you need to keep them separate:

  • Per-user TPS — how fast tokens are delivered to a single user.
    • This drives the perceived experience.
    • Rough guide: below 10–15 tokens/sec starts to feel sluggish; above 30 tokens/sec it feels near-instant for most chat use cases.
  • System TPS — the total token output across all concurrent users.
    • This is what you’re actually sizing the hardware to sustain.

The relationship is simple in principle:

System TPS = Concurrent Users × Tokens per Second per User

In practice, batching is what makes this efficient:

  • Rather than serving each user’s request on dedicated GPU resources, a well-configured inference server groups multiple requests together and processes them as a single batch.
  • This significantly improves GPU utilisation — particularly during the memory-bandwidth-bound decode phase.
  • Batching is the primary mechanism that lets you serve many users from a relatively small GPU footprint.

Working Backwards: From Users to GPUs

Here’s the sizing workflow that turns a customer conversation into a hardware recommendation.

Step 1: Define the workload

Start with the usage-side discovery questions from Part 1:

  • How many concurrent users?
  • What’s the average prompt length (input tokens)?
  • What’s the expected response length (output tokens)?
  • What’s the acceptable latency — both time to first token (TTFT) and total response time?

A worked example

A customer wants to deploy an internal assistant. Together you define:

ParameterValue
Concurrent users200
Average prompt500 tokens
Average response300 tokens
Target response time~10 seconds
Acceptable TTFT< 2 seconds

Step 2: Calculate required system TPS

From the example:

  • 300 output tokens in 10 seconds = 30 tokens/sec per user
  • 200 users × 30 tokens/sec = 6,000 tokens/sec system throughput

So the platform needs to sustain ~6,000 TPS of decoded tokens under load.

Step 3: Establish per-GPU TPS for your chosen model

This is where model size and GPU choice meet. As a rough reference for inference at FP16 (actual figures vary with batch size, framework, and optimisation):

ModelGPUApprox. TPS (decode, batched)
7BH100 80GB~2,000–3,000
70B (tensor parallel, 4×)4× H100 80GB~800–1,200
70B (tensor parallel, 8×)8× H100 80GB~1,500–2,500

Note: these are illustrative ranges. Always validate against benchmark data for your specific model, serving framework, optimisation level (TensorRT-LLM, vLLM, etc.), and batch configuration.

Step 4: Calculate GPU or node count

Continuing the example, assume:

  • You choose a 70B model hosted on 4× H100 80GB nodes.
  • Based on benchmarks, you take a conservative estimate of 1,000 TPS per node (decode, batched).

Then:

  • 6,000 system TPS ÷ 1,000 TPS per node ≈ 6 nodes

Add a headroom buffer (typically 20–30% for burst traffic, uneven load, and future growth):

  • 6 nodes × 1.25 ≈ 8 nodes as a starting recommendation.

At this point, you have a defensible answer to “how many GPUs/nodes do we need?” that’s grounded in user requirements, not just “bigger is better.”


Reference Sizing: Two Common Scenarios

The worked example above walks through the methodology. The table below applies it to two reference architectures you’ll encounter regularly — a 7B internal assistant and a 70B RAG system — to give you a practical feel for how the numbers land.

Figures assume FP16 or INT8 precision, batched inference, and a well-optimised serving framework such as TensorRT-LLM or vLLM. Treat these as directional reference points, not guaranteed benchmarks — validate against your specific model, configuration, and workload before quoting.

AspectRef A: 7B Internal AssistantRef B: 70B RAG System
Typical use caseEmployee Q&A, productivity assistantLegal/finance/engineering RAG over proprietary data
Model size7B70B
Quality vs cost“Good enough” quality, cost-optimisedHigher quality, domain-heavy reasoning
Concurrency (peak)~500 users~100 users
Avg prompt (input)~400 tokens~2,000 tokens (incl. retrieved context)
Avg response (output)~250 tokens~500 tokens
Latency target8–10 s total, TTFT < 2 s12–15 s total, TTFT < 3 s
System TPS target12,500 TPS (decode)3,300 TPS (decode)
Precision (typical)INT8 / mixed (weights)FP16 / mixed, selective quantisation
GPUs per node (typical)3–4 GPUs per PowerEdge node8 GPUs per PowerEdge XE-class node
Nodes (illustrative)3–4 nodes (total ~12 GPUs, incl. headroom)3 nodes (24 GPUs total, incl. headroom)
Interconnect focusGood PCIe + 25–100 GbENVLink/NVSwitch + 100–400 Gb fabric
Workload patternHigh concurrency, chat-likeLower concurrency, long prompts, RAG + heavier reasoning
Sizing conversation hook“Maximise users per GPU, acceptable quality”“Maximise quality on key workflows, moderate concurrency”

Notice the counterintuitive result: the smaller 7B model actually demands nearly four times the system throughput of the 70B RAG system (≈12,500 TPS vs ≈3,300 TPS). That’s not a contradiction — it’s the concurrency effect. Serving 500 chat users simultaneously generates far more aggregate token output than 100 users running deep reasoning queries, even though each individual 7B response is shorter. Bigger model doesn’t always mean bigger infrastructure footprint; workload pattern matters just as much as parameter count.

A few additional assumptions behind these figures worth keeping front of mind:

  • No fine-tuning overhead — these are inference-only configurations. If the customer plans on-premises fine-tuning, GPU and memory requirements increase substantially.
  • Steady-state load — the node counts include a 20–30% headroom buffer but assume reasonably predictable peak concurrency. Highly bursty workloads (e.g. end-of-day batch spikes) may warrant additional headroom or an autoscaling strategy.
  • Single-tenant deployment — figures assume dedicated GPU resources per workload. Multi-model or multi-tenant deployments require separate sizing treatment.
  • Retrieved context included — the 70B RAG prompt size of ~2,000 tokens already includes retrieved document chunks. If retrieval quality improves and chunk sizes grow, prompt tokens — and therefore TTFT — will increase accordingly.

The Trade-offs Every Customer Faces

Once you’ve run this exercise with a customer, three trade-off conversations typically follow.

1. Model quality vs. throughput

  • A 70B model usually produces higher-quality outputs than a 7B.
  • But it also serves far fewer users per GPU.
  • For some use cases — summarising legal documents, writing complex code, specialised reasoning — the quality premium is worth it.
  • For a high-volume customer service assistant, a well-tuned 7B model might deliver better economics with acceptable quality.

2. Latency vs. concurrency

  • Larger batch sizes improve GPU utilisation and system throughput, but they increase the time an individual request spends waiting to join a batch.
  • If TTFT is critical (live chat, voice interfaces), you’ll accept lower utilisation to keep batches small and responsive.
  • If the application is asynchronous (batch document processing, offline analytics), you can run large batches, push utilisation higher, and drive down cost per request.

3. Precision vs. memory footprint

  • Running a model at FP16 gives you full quality but also the full VRAM cost.
  • Quantising to INT8 or INT4 roughly halves or quarters the memory footprint, allowing either a larger model to fit in the same GPUs, or the same model to fit in fewer GPUs.
  • There is a quality trade-off, but for many inference workloads, well-done INT8 quantisation offers an excellent quality-to-cost ratio and is worth including in the conversation.

What This Means for Platform Selection

By this point in a customer conversation, you have enough to make an informed platform recommendation.

Single-node, lower concurrency, 7B–13B models A PowerEdge server with 2–4 high-memory GPUs will typically cover the requirement, with room to scale up or out as usage grows.

Multi-node, higher concurrency, or 70B+ models You’re looking at GPU-dense platforms where high-speed interconnect between GPUs (NVLink, NVSwitch) and network fabric between nodes become as important as raw GPU count. These directly affect prefill and decode performance, and therefore both latency and throughput.

Mixed workloads (inference + fine-tuning) Fine-tuning demands significantly more memory per GPU than inference alone (optimizer states, gradient storage, larger activations). If a customer plans both, size for fine-tuning — the inference requirement is typically covered as a result.

The specific Dell platform mapping — PowerEdge XE series, GPU configurations, and interconnect options — is what we’ll build out in Part 3.


Next up: Part 3 — Platform and GPU selection: mapping your sizing to Dell PowerEdge XE configurations. We’ll take the TPS-based sizing approach from this post and show how it translates into concrete server configs you can quote.


LLM Sizing 101 – Part 1: Tokens and Parameters

Infographic explaining tokens and parameters in large language models. It includes definitions, examples, and a chart that illustrates the sizing problem related to tokens and parameters.

Every week, another organisation announces it’s deploying a large language model. And every week, a Technical Architect or Pre-sales Engineer gets asked a version of the same question: “How much infrastructure do I actually need for this?”

In my days as a Data Centre Architect and Engineer, I’d size server clusters for databases and VMware environments. The maths was different, but the discipline was the same: understand the workload, match it to the hardware, justify the recommendation. Now the question I get asked is “How do I size for LLMs?” This blog series is all about answering that.

Before you can answer that — before GPUs, nodes, interconnects, or platform choices even enter the conversation — you need two concepts nailed down cold: tokens and parameters. They’re the two dials that drive every LLM server sizing decision you’ll ever make.

Think of it this way. Parameters tell you how big the engine is. Tokens tell you how hard you’re asking it to work. Get those two right, and the rest of the sizing conversation falls into place.


Tokens: The Currency of Language Models

LLMs don’t read sentences the way you do. They don’t even read words. They read tokens — small chunks of text that sit somewhere between a syllable and a word.

  • Sometimes a token is a whole word: server
  • Sometimes it’s a fragment: serv, er
  • Sometimes it’s punctuation or whitespace: . , ,

For English text, a useful rule of thumb is:

1 token ≈ 3–4 characters, or roughly 0.75 of a word

So the sentence “This is a sizing test.” runs to about 6–7 tokens — not 5, because the model doesn’t count words.

When you see pricing or performance metrics quoted in the market, they’re always denominated in tokens:

  • $X per 1,000 tokens
  • Y tokens per second
  • 4k / 8k / 32k / 128k context window

That last one matters a lot. The context window is the maximum amount of text — measured in tokens — the model can hold in view at once. It’s not just the question you asked; it includes everything: system instructions, conversation history, documents you’ve fed in, and the response being generated. Every token in that window costs compute and memory.

Why Tokens Drive Sizing

Tokens show up in three places in every sizing conversation:

1. Context length (the prompt window) Longer context means the model has to track more information simultaneously. That translates directly into more VRAM for the KV cache — the memory structure the model uses to keep track of what it’s already processed. A customer who wants 128k-token context windows needs significantly more memory per request than one running at 4k.

2. Throughput and concurrency Tokens per second is the fundamental throughput metric — per GPU, per node, per cluster. In practice, you’ll often work backwards from a customer’s requirements:

“We need to support 500 concurrent users, each generating responses of around 300 tokens, within 3 seconds.”

That’s a tokens-per-second and concurrency problem. Everything else follows from it.

3. Capacity and cost planning Whether on-premises or cloud, consumption is effectively input tokens + output tokens. On a Dell PowerEdge server deployment, higher sustained tokens per second means more GPU compute, more memory bandwidth, and — beyond a certain point — more nodes or a move to higher-end accelerators.


Parameters: The Size of the Brain

If tokens are the currency of language models, parameters are what you’re buying with your hardware budget.

A parameter is a learned numeric weight — a floating-point number — stored inside the model. Mathematically, an LLM is an enormous function, and parameters are the numbers that define it. When a model is trained, billions of these weights are adjusted, incrementally, until the model gets reliably good at predicting language.

This is why model names look the way they do:

  • 7B → approximately 7 billion parameters
  • 13B, 34B, 70B, 405B → and so on up the scale

More parameters generally mean greater model capability — the model can represent more complex patterns, handle more nuanced reasoning, and produce higher-quality output. But that capability comes at a direct hardware cost, because every parameter has to live somewhere.

The VRAM Equation

The first-order estimate for model memory is straightforward:

Model VRAM (GB) ≈ Parameters × Bytes per Parameter

In practice:

ModelPrecisionWeights-only VRAM
7BFP16/BF16 (2 bytes)~14 GB
70BFP16/BF16 (2 bytes)~140 GB

That’s just for the weights themselves. In a real deployment you also need memory for:

  • KV cache — grows with context length and batch size
  • Activation memory — the working memory during computation
  • Optimizer states — relevant if you’re fine-tuning, not just inferencing
  • Runtime overhead — fragmentation, safety layers, serving framework

The practical consequence is clear:

  • A 7B model can typically run on a single high-memory GPU (24–80 GB class, depending on precision and context requirements).
  • A 70B model generally needs multiple high-VRAM GPUs — and demands fast interconnects between them, whether that’s NVLink, NVSwitch, or a high-bandwidth PCIe fabric.

As a pre-sales engineer, parameter count is what you’ll map to platform choices: how many GPUs per node, whether you need a GPU-dense platform with a high-speed fabric, and whether the workload fits in a 2U form factor or needs something more substantial.


How the Two Interact

Here’s the mental model worth keeping front of mind for every sizing conversation:

What it measuresWhat it drives
ParametersHow big the model isVRAM requirement, compute per token, hardware footprint
TokensHow much work you’re asking it to doThroughput, latency, concurrency, context memory

Given a fixed GPU budget, customers are always navigating a trade-off:

  • Bigger model (more parameters) versus more throughput (more tokens per second)
  • Longer context windows (more tokens per request) versus more concurrent requests

There’s no universally right answer — it depends on the use case. A legal document analysis platform that processes 100k-token contracts needs a very different configuration from a customer service chatbot handling hundreds of short, concurrent sessions.


Turning This Into a Sizing Conversation

When you strip away the jargon, most customer LLM questions reduce to this:

“Given a model size (parameters) and an expected usage pattern (tokens), how many GPUs and servers do I need to hit my latency and concurrency targets?”

The discovery questions that unlock that answer fall into two groups:

Model side:

  • Are you targeting a 7B, 13B, 70B, or larger class model?
  • Are you planning full precision, mixed precision, or quantized deployment?
  • Is this inference only, or do you also plan fine-tuning?

Usage side:

  • What’s the average prompt size per request (in tokens)?
  • What’s the expected response length?
  • What’s the maximum context length required — 8k, 32k, 128k?
  • How many concurrent users do you need to support, and at what latency?

Once you have those answers, you can map them to Dell platforms — GB10, GB300, PowerEdge XE and XE+ GPU servers, interconnect choices, cluster configurations — in a structured and defensible way. That’s exactly what we’ll build up in the posts that follow.


Next up: Part 2 — from tokens per second to GPU count: the maths that drive inference sizing.