LLM Sizing 101 – Part 3: Platform and GPU Selection
Mapping your sizing to Dell PowerEdge XE configurations
In Part 1 we nailed down the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do.
In Part 2 we made it practical — translating a customer’s real-world requirements into a target tokens-per-second figure, and from there into a GPU count.
Now we make it concrete.
Building on the methodology from Part 2, we apply it to two representative scenarios — a 7B internal assistant and a 70B RAG system — and map everything to actual Dell PowerEdge XE platform configurations you can put in a proposal. But before we get to the reference designs, there’s a gotcha.
The Gotcha: Model Precision
There’s a variable that can silently double — or halve — your GPU count if you don’t nail it down early in the conversation.
Model precision.
When a customer says “we want to run a 70B model,” that sentence is incomplete. The question you need to ask immediately is: at what model precision?
Here’s why it matters so much. The memory footprint of a model is:
Model VRAM (GB) = number of parameters × bytes per parameter
And the bytes-per-parameter figure is entirely determined by precision:
| Precision | Bytes per parameter | 70B model — weights only | Notes |
|---|---|---|---|
| FP32 | 4 bytes | ~280 GB | Training only; rare in inference |
| FP16 / BF16 | 2 bytes | ~140 GB | Full quality baseline |
| FP8 | 1 byte | ~70 GB | Requires H100, H200, B300 class |
| INT8 | 1 byte | ~70 GB | Broad hardware support |
| INT4 | 0.5 bytes | ~35 GB | Validate quality before committing |
| FP4 | 0.25 bytes | ~17.5 GB | B300/GB300 only; first-class inference precision |
Run the same 70B model at FP16 versus INT4 and the weights footprint changes by 4×. That’s the difference between needing two 8-GPU nodes and needing one. It’s the difference between a £400k proposal and a £200k proposal. And it’s a variable that’s completely invisible if you skip the precision conversation.
How to find out
The good news: model precision is almost always discoverable before you size anything.
The model card. Every published model has a model card stating the native training precision — typically FP32, BF16, or FP16 — and whether pre-quantised versions exist. Llama 3.1 405B, for example, is published in BF16 with a separate FP8-quantised version available for single-node deployment. That’s not a footnote — it’s a hardware decision.
The deployment framework. When a customer tells you they’re using vLLM, TensorRT-LLM, or NVIDIA NIM, the framework makes precision explicit. NIM profiles are named by precision — tensorrt_llm-h100-fp8-tp2-latency tells you the precision, the GPU, and the parallelism strategy in one string. If the customer has already chosen a framework, ask what precision they’re deploying at — they’ll either know, or the question will prompt them to find out.
The GPU itself. Not all GPUs support all precisions. FP8 requires H100, H200, B300 or AMD MI300X class hardware. FP4 is exclusive to B300 and GB300 — it isn’t available on earlier generations. INT4 with hardware acceleration requires specific tensor core support. If the customer has already chosen a GPU, that constrains the precision options — and vice versa. The two decisions are linked.
The precision conversation in practice
When a customer names a model, these are the three questions that unlock the sizing:
“Are you using the native model weights, or a quantised version?” “What serving framework are you planning to use?” “Is some accuracy trade-off acceptable in exchange for a smaller hardware footprint?”
That last question is the most important one. Modern quantisation techniques — GPTQ, AWQ, SmoothQuant — preserve the vast majority of model quality for most enterprise inference workloads. The difference between BF16 and INT8 is typically imperceptible for summarisation, search, classification and code assistance. For complex multi-step reasoning or fine-tuned models, it’s worth validating. But for the majority of use cases, INT8 or FP8 is a legitimate production choice — not a compromise.
The rule of thumb: the bigger the model, the more gracefully it quantises — for most enterprise inference workloads. A 70B model at INT8 loses less proportionally than a 7B model at INT4.
Get precision wrong — or leave it undefined — and every GPU count in your proposal is built on a shaky foundation. Get it right, and you have a sizing conversation that’s grounded, defensible, and often more cost-effective than the customer expected.
Two Reference Designs
With precision established, everything else follows.
Sizing disclaimer: The reference designs below illustrate the methodology — they are not a substitute for your own sizing exercise. TPS figures, GPU counts and node recommendations are directional reference points based on representative workloads. Actual performance will vary with your specific model, serving framework, quantisation approach, batch configuration and workload pattern. Always validate against benchmark data for your environment before quoting or committing to a configuration.
These aren’t rigid prescriptions — they’re starting points you can adapt by adjusting the inputs and re-running the TPS maths from Part 2.
Reference Design A: 7B Internal Assistant
Use case: An internal productivity assistant — employees asking about policies, summarising documents, drafting emails. High concurrency, moderate latency sensitivity, cost-conscious.
1. Define the workload
| Parameter | Value |
|---|---|
| Concurrent users (peak) | 500 |
| Average prompt | 400 tokens |
| Average response | 250 tokens |
| Target response time | ~8–10 seconds |
| Acceptable TTFT | < 2 seconds |
| Model | 7B class |
2. Establish precision and memory footprint
For a 7B model:
| Precision | Weights footprint | Fits on a single GPU? |
|---|---|---|
| FP16 / BF16 | ~14 GB | Yes (48–80 GB class) |
| INT8 | ~7 GB | Yes — comfortably |
| INT4 | ~3.5 GB | Yes — with significant headroom |
For a high-concurrency internal assistant, INT8 or mixed precision (weights in INT8, activations in FP16/BF16) is the practical default. It fits cleanly on a single GPU, leaves room for KV cache and batching overhead, and the quality trade-off is negligible for this kind of workload.
3. Translate to TPS
- 250 output tokens ÷ 10 seconds = 25 tokens/sec per user
- 500 users × 25 tokens/sec = 12,500 tokens/sec system TPS
4. Per-GPU TPS estimate
For a 7B model at INT8/mixed precision, batched decode on a high-end accelerator:
| GPU | Approx. TPS (7B, batched decode) |
|---|---|
| H100 80GB SXM | ~2,000–3,000 |
| H200 141GB | ~2,500–3,500 |
| L40S 48GB | ~1,000–1,500 |
| B300 288GB | ~4,000–6,000 (est.) |
Conservative estimate: 1,500 TPS per GPU on current generation; higher on B300.
5. GPU and node count
- 12,500 TPS ÷ 1,500 TPS/GPU ≈ 8.3 GPUs
- Add 25% headroom: 8.3 × 1.25 ≈ 10.4 GPUs → round up to 12 for a clean 3 × 4 configuration
6. Platform mapping
A 7B model at INT8 fits on a single GPU — no tensor parallelism required. Each GPU runs an independent model replica and you scale out horizontally across nodes. This is compact, balanced GPU server territory.
The Dell PowerEdge XE7745 is the natural fit for this workload class: a 2U platform supporting up to 4 high-memory GPUs, designed for exactly this kind of inference deployment. For organisations planning ahead with Blackwell, the XE7745 also supports NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs — a professional-grade accelerator with 96 GB GDDR7 that offers significant headroom for a 7B workload and future-proofing for multi-model environments, at a lower power and cost envelope than data centre HBM-class GPUs.
“For a 7B internal assistant serving ~500 concurrent users, a small cluster of three PowerEdge XE7745 nodes gives you a responsive chat experience, capacity to grow, and the flexibility to host multiple models or environments — all in a standard rack footprint.”
Reference Design B: 70B RAG System
Use case: Knowledge-heavy workflows — legal, financial or engineering teams querying proprietary documents via a RAG pipeline. Quality matters more than raw user count. Concurrency is moderate.
1. Define the workload
| Parameter | Value |
|---|---|
| Concurrent users (peak) | 100 |
| Average prompt | 2,000 tokens |
| Average response | 500 tokens |
| Target response time | ~12–15 seconds |
| Acceptable TTFT | < 3 seconds |
| Model | 70B class |
Prompts are longer here because RAG injects retrieved document snippets, conversation history and system instructions into every request. That longer context window drives up KV cache memory — which is why the platform choice shifts significantly compared to Reference Design A.
2. Establish precision and memory footprint
This is where precision has the biggest impact on the proposal — and where the B300 changes the calculus significantly:
| Precision | Weights footprint | H100/H200 GPUs needed | B300 GPUs needed | Notes |
|---|---|---|---|---|
| FP16 / BF16 | ~140 GB | 2 minimum | 1 (fits with headroom) | Full quality; B300’s 288 GB changes the equation |
| FP8 | ~70 GB | 1 minimum | 1 | Near-FP16 quality; requires H100/H200/B300 |
| INT8 | ~70 GB | 1 minimum | 1 | Minimal quality loss for most workloads |
| INT4 | ~35 GB | 1 | 1 | Validate quality before committing |
| FP4 | ~17.5 GB | N/A — B300/GB300 only | 1 — with substantial headroom | Validate quality for RAG use cases |
A key implication for B300 deployments: with 288 GB of HBM3e per GPU, a 70B model at FP16 (~140 GB weights) fits on a single B300. That eliminates the need for tensor parallelism within the node for this model size, simplifying the architecture and reducing interconnect dependency.
For a legal or financial RAG workload where output quality is the primary requirement, FP16 or FP8 remains the right starting point. FP4 on B300 is increasingly viable but worth validating explicitly against the customer’s specific domain before committing.
3. Translate to TPS
- 500 output tokens ÷ 15 seconds = 33 tokens/sec per user
- 100 users × 33 tokens/sec = 3,300 tokens/sec system TPS
4. Per-node TPS estimate
For a 70B model running on high-end accelerators:
| Configuration | Approx. TPS (70B, batched decode) |
|---|---|
| 4× H100 80GB (tensor parallel) | ~800–1,200 |
| 8× H100 80GB (tensor parallel) | ~1,500–2,500 |
| 8× H200 141GB (tensor parallel) | ~2,000–3,500 |
| 8× B300 288GB (HGX B300) | ~4,000–7,000 (est.) |
Conservative estimate on an 8× H100 node: 1,500 TPS. On an 8× B300 node: significantly higher, with the added benefit that each GPU can host the full model independently.
5. Node count
- 3,300 TPS ÷ 1,500 TPS/node (H100 baseline) ≈ 2.2 nodes
- Add 25–30% headroom: 2.2 × 1.25 ≈ 2.75 → round to 3
- Total: 3 nodes × 8 GPUs = 24 GPUs (H100/H200 baseline)
On B300 hardware, the same TPS target is achievable with fewer nodes — or the same node count delivers substantially higher capacity.
Three nodes also gives you operational flexibility — you can drain one for maintenance without collapsing capacity below the required TPS floor.
6. Platform mapping
For H100/H200 deployments, the Dell PowerEdge XE9680 with 8× H100 or H200 GPUs remains a proven reference platform for 70B inference, with NVLink and NVSwitch providing the fast GPU-to-GPU interconnect tensor parallelism requires.
For Blackwell deployments, the Dell PowerEdge XE9780 and XE9785 are the direct successors to the XE9680 — delivering up to 4× faster LLM performance with the 8-way NVIDIA HGX B300. The liquid-cooled XE9780L and XE9785L variants support higher GPU densities for rack-scale deployments.
Infrastructure note: B300 systems require liquid cooling, 800 Gb/s networking, and power densities that most existing facilities cannot support without upgrade. The B300 draws 1,400W TDP per GPU — 40% more than the B200, and double the H100. Factor facility readiness into any B300 sizing conversation before committing to a configuration.
“For a 70B RAG assistant used by specialist teams — legal, finance, engineering — the PowerEdge XE9680 with H100/H200 GPUs remains a strong proven choice. For organisations investing in Blackwell infrastructure, the XE9780/XE9785 with HGX B300 delivers significantly higher throughput and eliminates tensor parallelism requirements for 70B class models — but facility readiness for liquid cooling and power density must be confirmed first.”
The Platform Decision in Summary
| Workload | Model | Precision | Current Platform | Blackwell Platform | GPUs/node | Nodes |
|---|---|---|---|---|---|---|
| Internal assistant (high concurrency) | 7B | INT8 | PowerEdge XE7745 | XE7745 (RTX Pro 6000 BW) | 4× GPUs | 3 |
| RAG system (quality-first) | 70B | FP16 / FP8 | PowerEdge XE9680 | XE9780 / XE9785 (HGX B300) | 8× GPUs | 3 |
The pattern is consistent: model size drives platform class, precision drives memory footprint, TPS drives node count. Miss any one of those three and the sizing is incomplete.
For organisations moving beyond 70B — frontier models, multi-tenant inference at scale, or combined training and inference workloads — the Dell PowerEdge XE9712 featuring NVIDIA GB300 NVL72 is the next step up. With 72 Blackwell Ultra GPUs and up to 40 TB of fast memory per rack (combining ~20 TB of GPU HBM3e across 72 GPUs and ~17 TB of Grace CPU LPDDR5X), it delivers exascale-class AI performance for workloads that have outgrown the per-node sizing conversation entirely. That’s a different discussion — but it starts with the same methodology.
Three Trade-offs Worth Raising
Once you’ve walked a customer through a reference design, three conversations typically follow.
1. Can we use a smaller model? Sometimes yes — and it’s worth exploring. A well-tuned 13B model can deliver surprisingly strong results for many enterprise use cases, at a fraction of the infrastructure cost of a 70B. The right answer depends on the use case, not just the budget.
2. Can we quantise to reduce the footprint? INT8 quantisation roughly halves the memory footprint with minimal quality loss for most inference workloads. INT4 goes further — but quality trade-offs become more noticeable and are worth validating before committing. FP4 on B300 hardware is the emerging sweet spot for next-generation inference: near-FP8 quality at half the memory cost, with hardware-accelerated compute — but it requires Blackwell Ultra infrastructure.
3. What about fine-tuning? If the customer plans to fine-tune as well as infer, size for fine-tuning — it’s the more demanding workload. Fine-tuning requires storing optimiser states and gradients alongside the model weights, which can triple or quadruple the VRAM requirement compared to inference alone. A platform sized for fine-tuning will handle inference comfortably.
What’s Next
With three posts, we’ve built a complete sizing chain:
- Part 1: Parameters and tokens — the two dials that drive every sizing decision
- Part 2: From tokens per second to GPU count — the maths that connects users to hardware
- Part 3: Precision, platform selection, and reference designs — where the maths meets the metal
The natural next conversation is the one that follows a sizing recommendation: how does an on-premises PowerEdge deployment compare to cloud over three years? That’s the cost modelling discussion — and it’s where a well-sized on-premises platform often tells a very different story to the cloud bill the customer is currently paying.


