Podcast Episode: The Plumbing Under the Hood: RAG, MCP and the Architecture Nobody Explains

Click for full blog post: The Plumbing Under the Hood: RAG, MCP and the Architecture Nobody Explains

Pip: If you’ve ever nodded along while someone explained AI architecture and understood roughly zero percent of it, garagegeek’s recent posts are for you — and honestly, maybe for that someone too.

Mara: This episode covers the infrastructure behind enterprise AI: how models actually get access to current information, and what that plumbing looks like in practice.

Pip: Let’s start with the architecture itself.

The Plumbing Under the Hood: RAG, MCP and the Architecture Nobody Explains

Mara: The core problem the post is solving is this: large language models are frozen. They’re trained up to a point in time, and then that’s it — they don’t know what changed this morning, or what’s in your CRM, or what your HR team updated last week.

Pip: The post puts it plainly: “An LLM is, for all its capability, a brilliant mind in a sealed room.” Every enterprise deployment is essentially one long attempt to pass notes through the door.

Mara: Right, and two distinct approaches emerged to solve that. The first is RAG — Retrieval Augmented Generation. Documents get chunked, converted into vector embeddings, and stored. When a question comes in, the system finds the chunks whose meaning is closest to the question and hands them to the model as context.

Pip: So it’s a very well-organized library. The catch being the library was shelved at a point in time.

Mara: Exactly. The post is direct about the ceiling: if source documents change, the index goes stale until you re-embed. And there’s a data quality dependency that doesn’t go away — “weak data quality at the input stage leads to flawed outputs downstream. RAG doesn’t solve a data quality problem. It inherits it.”

Pip: That line should probably be laminated and handed out at every vendor presentation.

Mara: The second approach is MCP — the Model Context Protocol — and it works differently. Instead of a pre-built index, the model connects to live systems through their APIs and queries them in real time. SharePoint, CRM, HR platforms, procurement tools — current, not cached.

Pip: The upshot is the model stops seeing a snapshot of your organisation and starts seeing it as it actually is right now. Which is a meaningfully different thing.

Mara: The post is also clear these aren’t competitors. RAG handles large, stable knowledge corpora well. MCP handles live, distributed operational data. A well-architected system can use both — MCP as the orchestration layer, with RAG as one of the tools it calls for specific unstructured content.

Pip: And the practical advice is refreshingly un-hyped: start with MCP, no vector infrastructure to provision, no embedding pipelines to maintain. Reach for RAG when semantic retrieval across messy unstructured content becomes essential.

Mara: There’s a sharp section on what nobody tells you before you commit. RAG brings real infrastructure costs — compute, storage, continuous re-embedding. MCP makes your legacy systems load-bearing: a flaky API or three years of inconsistent CRM data entry becomes a front-and-centre problem, not a peripheral one. Governance and security have to be designed in from the start, not added after.

Pip: The post closes by pointing somewhere interesting — once you have the interface and the plumbing, agents become possible. Models that don’t just answer, but act.

Mara: That’s the thread the next piece picks up.


Pip: The sealed-room framing is the one that sticks — capable model, no current information, and everything that follows is just different ways of passing notes.

Mara: And the note about agents acting rather than answering suggests the architecture conversation is only getting more consequential from here.

The Plumbing Under the Hood: RAG, MCP and the Architecture Nobody Explains

A diagram illustrating the architecture of a large language model (LLM) with connections to various systems including CRM, ERP, HR, and SharePoint, displayed on a blueprint-style background.

I’m an under the hood type of guy. I hear high-level fluff and I just turn off. I need more. I need to be able to visualise how things work — and the effects of implementation. I guess that’s the Solution Architect in me. Years of seeing projects go south. Experience that says it’s just not that simple.

I learned a long time ago: if you want something doing, do it yourself.

So here it is. No fluff, no hand-waving. The no-nonsense guide to what RAG and MCP actually are, how they work, and why the distinction matters more than most people realise. Enjoy.


The Problem Every Enterprise AI Deployment Hits

Large language models are genuinely extraordinary. The breadth of knowledge, the reasoning capability, the ability to synthesise and explain — it’s real, and it’s useful. But they have a fundamental constraint that every organisation hits the moment they try to deploy one seriously.

They are frozen.

An LLM is trained on a vast corpus of data up to a point in time, and then the weights are fixed. The model doesn’t know what happened last Tuesday. It doesn’t know your organisation’s processes, your customer contracts, your current pipeline, or the policy document your HR team updated this morning. It is, for all its capability, a brilliant mind in a sealed room.

Every enterprise AI deployment is therefore really solving one problem: how do we get relevant, current, organisational knowledge into the model’s hands at the moment it needs to answer?

Two main solutions emerged. They look similar on the surface. They are fundamentally different underneath.


RAG: The Indexed Snapshot

RAG stands for Retrieval Augmented Generation. The name is less important than the mechanism.

Imagine you have a large knowledge base — policy documents, product guides, training materials, technical specifications. RAG takes all of that content and processes it in advance. Each document gets broken into chunks. Each chunk gets converted into a vector embedding — a numerical representation of the meaning of that text, not just its keywords. Those embeddings get stored in a vector database.

When a user asks a question, the question itself gets converted into a vector using the same method. The system then searches the database for the chunks whose meaning is closest to the meaning of the question — semantic similarity, not keyword matching. The most relevant chunks get retrieved and placed into the model’s context window alongside the original question. The model answers using that retrieved material as its working context.

Think of it as a library. Brilliantly organised, perfectly indexed, searchable by meaning rather than title. You walk in, the system finds the most relevant books, opens them to the right pages, and hands them to the model before it answers.

It’s powerful. For stable, curated knowledge bases it works extremely well.

But it has a ceiling, and the ceiling matters.

The library was shelved at a point in time. The moment your source documents change, your index is stale until you re-embed. And the quality of retrieval is entirely dependent on the quality of what went in. Poorly structured documents, inconsistent language, missing metadata — the embeddings become noisy and retrieval underperforms. The foundational principle holds here as firmly as anywhere in AI: weak data quality at the input stage leads to flawed outputs downstream. RAG doesn’t solve a data quality problem. It inherits it.


MCP: The Living Plumbing

MCP — the Model Context Protocol — is a different kind of answer to the same problem. And understanding the difference is where the real business thinking begins.

MCP doesn’t retrieve from a pre-built index. It connects the model to live systems through their APIs — and queries them in real time, at the moment of the conversation.

Here’s what that means practically. Your SharePoint isn’t indexed in advance — the model calls it directly and gets back whatever is there right now, including the contract template someone updated this morning. Your CRM isn’t embedded into vectors — the model queries it and sees the deal that moved stage an hour ago. Your HR system, your procurement platform, your service desk — all of them accessible, all of them current, all of them live.

The model doesn’t see a snapshot of your organisation. It sees your organisation as it actually is, right now.

And here is the point that changes how you should think about this entirely.

Most enterprise knowledge isn’t in one place. It never has been. It’s fragmented across Salesforce and SAP, ServiceNow and SharePoint, HR platforms and finance systems and procurement tools. Getting RAG to span those systems requires significant data engineering effort — ingesting, normalising, embedding, maintaining. It’s achievable, but it’s heavy.

MCP connects to all of them. Through their APIs. Simultaneously. The model becomes a single conversational interface across the entire technology estate — not just one knowledge base, but the living information fabric of the organisation.

That is not a chatbot connected to some documents. That is a fundamentally different proposition.


Not Competitors — Different Layers

It would be tempting to read this as RAG versus MCP. It isn’t.

They solve overlapping problems at different layers and with different trade-offs. RAG is the right tool for large, stable knowledge corpora where semantic similarity search matters — where you need the model to find relevant material even when the exact words don’t appear in the query. MCP is the right tool where data is live, dynamic, and distributed across operational systems.

And they can work together. A well-architected system might use MCP as the orchestration layer — the model deciding which tools to call — while one of those tools triggers a RAG pipeline for a specific stable knowledge base. The plumbing and the library, working in concert.

The practical guidance is straightforward. Start with MCP. It’s the lower point of entry — no vector infrastructure to provision, no embedding pipelines to build and maintain, no index to keep fresh. You’re connecting to systems and APIs you already have. Reach for RAG when you’ve hit the ceiling — when the corpus is large, messy, and semantic retrieval across unstructured content becomes essential.

Start simple. Earn the complexity.


Before You Lay The Plumbing — What Nobody Tells You

The pitch for both RAG and MCP is compelling. The reality, as always, has a few sharp edges worth knowing about before you commit.

RAG brings infrastructure with it. RAG isn’t just a software pattern you switch on. Behind every vector database is a compute and storage requirement that needs provisioning, maintaining, and scaling as your knowledge base grows. Embedding pipelines need to run continuously — every time source content changes, chunks need re-processing and re-indexing or your library goes stale. For organisations already managing data centre complexity, this is a real cost conversation that rarely appears in the vendor presentation.

MCP makes your legacy systems load-bearing. MCP’s power is connecting to live systems. But those live systems are now dependencies. The legacy HR platform with the flaky API. The procurement system that slows under load. The CRM with three years of inconsistent data entry. Once the LLM is reaching across your technology estate, it is only as reliable as the weakest system it touches. A timeout, a bad API response, a data quality problem in one system degrades the entire interface. What felt like a peripheral legacy problem just became front and centre.

Governance and security are not optional extras. When a model can traverse your entire technology estate — reading CRM data, querying HR systems, pulling procurement approvals — your entire technology estate needs to be ready for that conversation. Access controls, data classification, audit trails, API security, compliance boundaries. These cannot be bolted on after deployment. They need to be designed in from the start. MCP without a holistic governance and security view isn’t just risky. It’s an exposed surface at scale.

This is AI Reality. The plumbing is powerful. Lay it properly.


The Interface, The Plumbing, The Flow

Here is the frame I want to leave you with — because it’s the one that changes how you brief a customer, evaluate a vendor, or think about your own AI roadmap.

LLMs are becoming the interface to information. Not a search bar, not a dashboard, not a report. A conversational, reasoning interface that sits in front of your organisation’s entire data landscape and makes it accessible in plain language.

MCP is the plumbing. The connective tissue that links the interface to the living systems underneath — the CRM, the ERP, the HR platform, the document store, the data warehouse. Without the plumbing, the interface has nothing to work with. With it, the interface can see everything.

And once you have an interface and plumbing, something else becomes possible.

Agents.

Not models that answer questions. Models that act. That move through systems, make decisions, complete workflows, and hand off to humans at exactly the right moment. Agents ride the pipelines that MCP creates and turn information flow into work getting done.

That’s where this goes. And that’s what the next post is about.

Next: The Agentic Leap — when AI stops answering and starts acting.


The Interface. The Plumbing. The Flow.

LLM Sizing 101 – Part 3: Platform and GPU Selection

A schematic diagram illustrating the LLM sizing chain, featuring flowcharts that detail model size, precision, tokens per second, GPU count, node count, and platform specifications.

Mapping your sizing to Dell PowerEdge XE configurations

In Part 1 we nailed down the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do.

In Part 2 we made it practical — translating a customer’s real-world requirements into a target tokens-per-second figure, and from there into a GPU count.

Now we make it concrete.

Building on the methodology from Part 2, we apply it to two representative scenarios — a 7B internal assistant and a 70B RAG system — and map everything to actual Dell PowerEdge XE platform configurations you can put in a proposal. But before we get to the reference designs, there’s a gotcha.


The Gotcha: Model Precision

There’s a variable that can silently double — or halve — your GPU count if you don’t nail it down early in the conversation.

Model precision.

When a customer says “we want to run a 70B model,” that sentence is incomplete. The question you need to ask immediately is: at what model precision?

Here’s why it matters so much. The memory footprint of a model is:

Model VRAM (GB) = number of parameters × bytes per parameter

And the bytes-per-parameter figure is entirely determined by precision:

PrecisionBytes per parameter70B model — weights onlyNotes
FP324 bytes~280 GBTraining only; rare in inference
FP16 / BF162 bytes~140 GBFull quality baseline
FP81 byte~70 GBRequires H100, H200, B300 class
INT81 byte~70 GBBroad hardware support
INT40.5 bytes~35 GBValidate quality before committing
FP40.25 bytes~17.5 GBB300/GB300 only; first-class inference precision

Run the same 70B model at FP16 versus INT4 and the weights footprint changes by 4×. That’s the difference between needing two 8-GPU nodes and needing one. It’s the difference between a £400k proposal and a £200k proposal. And it’s a variable that’s completely invisible if you skip the precision conversation.

How to find out

The good news: model precision is almost always discoverable before you size anything.

The model card. Every published model has a model card stating the native training precision — typically FP32, BF16, or FP16 — and whether pre-quantised versions exist. Llama 3.1 405B, for example, is published in BF16 with a separate FP8-quantised version available for single-node deployment. That’s not a footnote — it’s a hardware decision.

The deployment framework. When a customer tells you they’re using vLLM, TensorRT-LLM, or NVIDIA NIM, the framework makes precision explicit. NIM profiles are named by precision — tensorrt_llm-h100-fp8-tp2-latency tells you the precision, the GPU, and the parallelism strategy in one string. If the customer has already chosen a framework, ask what precision they’re deploying at — they’ll either know, or the question will prompt them to find out.

The GPU itself. Not all GPUs support all precisions. FP8 requires H100, H200, B300 or AMD MI300X class hardware. FP4 is exclusive to B300 and GB300 — it isn’t available on earlier generations. INT4 with hardware acceleration requires specific tensor core support. If the customer has already chosen a GPU, that constrains the precision options — and vice versa. The two decisions are linked.

The precision conversation in practice

When a customer names a model, these are the three questions that unlock the sizing:

“Are you using the native model weights, or a quantised version?” “What serving framework are you planning to use?” “Is some accuracy trade-off acceptable in exchange for a smaller hardware footprint?”

That last question is the most important one. Modern quantisation techniques — GPTQ, AWQ, SmoothQuant — preserve the vast majority of model quality for most enterprise inference workloads. The difference between BF16 and INT8 is typically imperceptible for summarisation, search, classification and code assistance. For complex multi-step reasoning or fine-tuned models, it’s worth validating. But for the majority of use cases, INT8 or FP8 is a legitimate production choice — not a compromise.

The rule of thumb: the bigger the model, the more gracefully it quantises — for most enterprise inference workloads. A 70B model at INT8 loses less proportionally than a 7B model at INT4.

Get precision wrong — or leave it undefined — and every GPU count in your proposal is built on a shaky foundation. Get it right, and you have a sizing conversation that’s grounded, defensible, and often more cost-effective than the customer expected.


Two Reference Designs

With precision established, everything else follows.

Sizing disclaimer: The reference designs below illustrate the methodology — they are not a substitute for your own sizing exercise. TPS figures, GPU counts and node recommendations are directional reference points based on representative workloads. Actual performance will vary with your specific model, serving framework, quantisation approach, batch configuration and workload pattern. Always validate against benchmark data for your environment before quoting or committing to a configuration.

These aren’t rigid prescriptions — they’re starting points you can adapt by adjusting the inputs and re-running the TPS maths from Part 2.


Reference Design A: 7B Internal Assistant

Use case: An internal productivity assistant — employees asking about policies, summarising documents, drafting emails. High concurrency, moderate latency sensitivity, cost-conscious.

1. Define the workload

ParameterValue
Concurrent users (peak)500
Average prompt400 tokens
Average response250 tokens
Target response time~8–10 seconds
Acceptable TTFT< 2 seconds
Model7B class

2. Establish precision and memory footprint

For a 7B model:

PrecisionWeights footprintFits on a single GPU?
FP16 / BF16~14 GBYes (48–80 GB class)
INT8~7 GBYes — comfortably
INT4~3.5 GBYes — with significant headroom

For a high-concurrency internal assistant, INT8 or mixed precision (weights in INT8, activations in FP16/BF16) is the practical default. It fits cleanly on a single GPU, leaves room for KV cache and batching overhead, and the quality trade-off is negligible for this kind of workload.

3. Translate to TPS

  • 250 output tokens ÷ 10 seconds = 25 tokens/sec per user
  • 500 users × 25 tokens/sec = 12,500 tokens/sec system TPS

4. Per-GPU TPS estimate

For a 7B model at INT8/mixed precision, batched decode on a high-end accelerator:

GPUApprox. TPS (7B, batched decode)
H100 80GB SXM~2,000–3,000
H200 141GB~2,500–3,500
L40S 48GB~1,000–1,500
B300 288GB~4,000–6,000 (est.)

Conservative estimate: 1,500 TPS per GPU on current generation; higher on B300.

5. GPU and node count

  • 12,500 TPS ÷ 1,500 TPS/GPU ≈ 8.3 GPUs
  • Add 25% headroom: 8.3 × 1.25 ≈ 10.4 GPUs → round up to 12 for a clean 3 × 4 configuration

6. Platform mapping

A 7B model at INT8 fits on a single GPU — no tensor parallelism required. Each GPU runs an independent model replica and you scale out horizontally across nodes. This is compact, balanced GPU server territory.

The Dell PowerEdge XE7745 is the natural fit for this workload class: a 2U platform supporting up to 4 high-memory GPUs, designed for exactly this kind of inference deployment. For organisations planning ahead with Blackwell, the XE7745 also supports NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs — a professional-grade accelerator with 96 GB GDDR7 that offers significant headroom for a 7B workload and future-proofing for multi-model environments, at a lower power and cost envelope than data centre HBM-class GPUs.

“For a 7B internal assistant serving ~500 concurrent users, a small cluster of three PowerEdge XE7745 nodes gives you a responsive chat experience, capacity to grow, and the flexibility to host multiple models or environments — all in a standard rack footprint.”


Reference Design B: 70B RAG System

Use case: Knowledge-heavy workflows — legal, financial or engineering teams querying proprietary documents via a RAG pipeline. Quality matters more than raw user count. Concurrency is moderate.

1. Define the workload

ParameterValue
Concurrent users (peak)100
Average prompt2,000 tokens
Average response500 tokens
Target response time~12–15 seconds
Acceptable TTFT< 3 seconds
Model70B class

Prompts are longer here because RAG injects retrieved document snippets, conversation history and system instructions into every request. That longer context window drives up KV cache memory — which is why the platform choice shifts significantly compared to Reference Design A.

2. Establish precision and memory footprint

This is where precision has the biggest impact on the proposal — and where the B300 changes the calculus significantly:

PrecisionWeights footprintH100/H200 GPUs neededB300 GPUs neededNotes
FP16 / BF16~140 GB2 minimum1 (fits with headroom)Full quality; B300’s 288 GB changes the equation
FP8~70 GB1 minimum1Near-FP16 quality; requires H100/H200/B300
INT8~70 GB1 minimum1Minimal quality loss for most workloads
INT4~35 GB11Validate quality before committing
FP4~17.5 GBN/A — B300/GB300 only1 — with substantial headroomValidate quality for RAG use cases

A key implication for B300 deployments: with 288 GB of HBM3e per GPU, a 70B model at FP16 (~140 GB weights) fits on a single B300. That eliminates the need for tensor parallelism within the node for this model size, simplifying the architecture and reducing interconnect dependency.

For a legal or financial RAG workload where output quality is the primary requirement, FP16 or FP8 remains the right starting point. FP4 on B300 is increasingly viable but worth validating explicitly against the customer’s specific domain before committing.

3. Translate to TPS

  • 500 output tokens ÷ 15 seconds = 33 tokens/sec per user
  • 100 users × 33 tokens/sec = 3,300 tokens/sec system TPS

4. Per-node TPS estimate

For a 70B model running on high-end accelerators:

ConfigurationApprox. TPS (70B, batched decode)
4× H100 80GB (tensor parallel)~800–1,200
8× H100 80GB (tensor parallel)~1,500–2,500
8× H200 141GB (tensor parallel)~2,000–3,500
8× B300 288GB (HGX B300)~4,000–7,000 (est.)

Conservative estimate on an 8× H100 node: 1,500 TPS. On an 8× B300 node: significantly higher, with the added benefit that each GPU can host the full model independently.

5. Node count

  • 3,300 TPS ÷ 1,500 TPS/node (H100 baseline) ≈ 2.2 nodes
  • Add 25–30% headroom: 2.2 × 1.25 ≈ 2.75 → round to 3
  • Total: 3 nodes × 8 GPUs = 24 GPUs (H100/H200 baseline)

On B300 hardware, the same TPS target is achievable with fewer nodes — or the same node count delivers substantially higher capacity.

Three nodes also gives you operational flexibility — you can drain one for maintenance without collapsing capacity below the required TPS floor.

6. Platform mapping

For H100/H200 deployments, the Dell PowerEdge XE9680 with 8× H100 or H200 GPUs remains a proven reference platform for 70B inference, with NVLink and NVSwitch providing the fast GPU-to-GPU interconnect tensor parallelism requires.

For Blackwell deployments, the Dell PowerEdge XE9780 and XE9785 are the direct successors to the XE9680 — delivering up to 4× faster LLM performance with the 8-way NVIDIA HGX B300. The liquid-cooled XE9780L and XE9785L variants support higher GPU densities for rack-scale deployments.

Infrastructure note: B300 systems require liquid cooling, 800 Gb/s networking, and power densities that most existing facilities cannot support without upgrade. The B300 draws 1,400W TDP per GPU — 40% more than the B200, and double the H100. Factor facility readiness into any B300 sizing conversation before committing to a configuration.

“For a 70B RAG assistant used by specialist teams — legal, finance, engineering — the PowerEdge XE9680 with H100/H200 GPUs remains a strong proven choice. For organisations investing in Blackwell infrastructure, the XE9780/XE9785 with HGX B300 delivers significantly higher throughput and eliminates tensor parallelism requirements for 70B class models — but facility readiness for liquid cooling and power density must be confirmed first.”


The Platform Decision in Summary

WorkloadModelPrecisionCurrent PlatformBlackwell PlatformGPUs/nodeNodes
Internal assistant (high concurrency)7BINT8PowerEdge XE7745XE7745 (RTX Pro 6000 BW)4× GPUs3
RAG system (quality-first)70BFP16 / FP8PowerEdge XE9680XE9780 / XE9785 (HGX B300)8× GPUs3

The pattern is consistent: model size drives platform class, precision drives memory footprint, TPS drives node count. Miss any one of those three and the sizing is incomplete.

For organisations moving beyond 70B — frontier models, multi-tenant inference at scale, or combined training and inference workloads — the Dell PowerEdge XE9712 featuring NVIDIA GB300 NVL72 is the next step up. With 72 Blackwell Ultra GPUs and up to 40 TB of fast memory per rack (combining ~20 TB of GPU HBM3e across 72 GPUs and ~17 TB of Grace CPU LPDDR5X), it delivers exascale-class AI performance for workloads that have outgrown the per-node sizing conversation entirely. That’s a different discussion — but it starts with the same methodology.


Three Trade-offs Worth Raising

Once you’ve walked a customer through a reference design, three conversations typically follow.

1. Can we use a smaller model? Sometimes yes — and it’s worth exploring. A well-tuned 13B model can deliver surprisingly strong results for many enterprise use cases, at a fraction of the infrastructure cost of a 70B. The right answer depends on the use case, not just the budget.

2. Can we quantise to reduce the footprint? INT8 quantisation roughly halves the memory footprint with minimal quality loss for most inference workloads. INT4 goes further — but quality trade-offs become more noticeable and are worth validating before committing. FP4 on B300 hardware is the emerging sweet spot for next-generation inference: near-FP8 quality at half the memory cost, with hardware-accelerated compute — but it requires Blackwell Ultra infrastructure.

3. What about fine-tuning? If the customer plans to fine-tune as well as infer, size for fine-tuning — it’s the more demanding workload. Fine-tuning requires storing optimiser states and gradients alongside the model weights, which can triple or quadruple the VRAM requirement compared to inference alone. A platform sized for fine-tuning will handle inference comfortably.


What’s Next

With three posts, we’ve built a complete sizing chain:

  • Part 1: Parameters and tokens — the two dials that drive every sizing decision
  • Part 2: From tokens per second to GPU count — the maths that connects users to hardware
  • Part 3: Precision, platform selection, and reference designs — where the maths meets the metal

The natural next conversation is the one that follows a sizing recommendation: how does an on-premises PowerEdge deployment compare to cloud over three years? That’s the cost modelling discussion — and it’s where a well-sized on-premises platform often tells a very different story to the cloud bill the customer is currently paying.



Generative AI in Business: From Hype to Everyday Use

A schematic diagram illustrating Generative AI architecture for enterprise deployment, showing flow from data infrastructure to natural language input and response, featuring components including data engineer, ML backbone, and distribution grid for business users.

We can talk to our data – and it can talk back.

I grew up on science fiction, and it’s still my favourite genre. As a kid I watched Captain Kirk speak to the Enterprise computer: he’d ask a question, the ship would analyse everything it knew, and then calmly talk back in plain language. Later it was HAL in 2001: A Space Odyssey – a computer you could converse with, for better or worse.

Here’s the thing: what was fiction is now reality.

For the first time, we can ask our business systems questions in natural language – about customers, operations, risks, performance – and get meaningful answers back, grounded in our own data. That’s the practical side of generative AI that matters: not party tricks in the browser, but the ability to talk to the business and have it talk back in ways that save time, reduce errors and unlock real ROI.


From finding the golden process to opening it up

In the my blogs, I’ve talked about machine learning as the discipline of finding that golden process – the one worth fixing, optimising, or automating. ML is brilliant when you can frame a problem as a narrow, well-defined prediction task: given this input, what is the most likely outcome? It forecasts demand, scores risk, flags likely defects, and fine-tunes routes and schedules.

But here’s what ML cannot do: it cannot explain itself in plain English. It cannot draft the shift handover report. It cannot answer the technician’s question about a procedure buried in a manual. It cannot have a conversation.

That’s where generative AI enters the refinery.

Think back to our crude oil analogy. If machine learning is the refining process – turning raw data into specific, purpose-built outputs – then generative AI is the distribution grid. It takes everything the refinery produces and makes it accessible to people who never needed to understand how the refinery works.

Data Engineers still run the pipelines. Data Scientists still interpret the outputs and build the models. But now, the sales manager, the warehouse supervisor, the contact centre agent – they can walk up to a terminal, ask a question in plain English, and get a meaningful answer grounded in real business data.

That’s the shift. ML found the golden process. GenAI opens it up to everyone.


What’s actually new

Most organisations have already met “classic” AI, even if they don’t call it that. Traditional machine learning classifies, predicts and optimises. It works brilliantly within defined boundaries.

Generative AI adds a different kind of capability on top of that backbone:

  • It can read and summarise large volumes of unstructured text – emails, PDFs, reports, notes.
  • It can draft content: responses, proposals, documentation, code.
  • It provides a conversational interface into your data and systems.
  • It can generate variations – alternative phrasings, translations, test cases, outreach angles.

If traditional ML optimises numbers and events, generative AI optimises words and workflows. That makes it especially powerful in the gaps between your systems: the email trail, the shared drive full of PDFs, the complaint notes, the procedures that nobody can ever quite find.

The value isn’t in talking to an LLM in the cloud for its own sake. It’s in embedding these capabilities directly into the work people are already doing.


Four use cases that are returning real ROI

The patterns that consistently deliver value share a few traits: they’re anchored in existing processes, they use data you already control, and they have clear owners and measurable outcomes. Here are four that I see working across industries right now.

1. Customer service copilots

Contact centres have long used AI for call routing and basic automation. Generative AI takes this further by acting as a copilot for human agents – surfacing customer history and relevant knowledge articles the moment a case comes in, proposing draft replies that agents can review and send, suggesting next best actions based on similar cases, and automatically updating tickets and notes from the conversation.

This doesn’t replace the agent. It removes the searching, copying and typing that sits around the real work.

In practice: A regional insurer deployed a generative copilot alongside its existing contact centre platform for 120 agents. Within three months, average email handle time fell by 23%, first contact resolution improved by nine percentage points, and new hire time to competency dropped from twelve weeks to eight. No new platform was purchased – they configured the generative features already included in the suite they owned.

2. Knowledge assistants for frontline staff

In most organisations, the people who most need answers – technicians, nurses, warehouse supervisors, branch staff – are the ones with the least time to hunt through policy documents and manuals. Generative AI can sit on top of your existing documentation and act as a conversational handbook.

A technician asks: “What’s the safety procedure for this task on model X?” and gets a precise answer with a link back to the source. A warehouse team member asks: “What do we do when a pallet arrives damaged?” and sees the current procedure with the right forms attached.

Under the hood, this uses retrieval augmented generation – pulling relevant content from your own documents, not from the open internet. The answers are grounded in your procedures, your standards, your data.

3. Sales and account management copilots

Sales teams are already surrounded by data: CRM records, emails, meeting notes, product catalogues, pricing rules. Very little of it is easy to use in the moment.

Generative AI can act as a deal desk in your pocket – producing account summaries from CRM history, turning meeting transcripts into action-oriented follow-up emails, drafting proposal text by combining standard boilerplate with customer-specific details.

In practice: A UK-based B2B distributor introduced a generative assistant inside its CRM. Reps could click “summarise” on any opportunity and get a one-paragraph overview of needs, stakeholders and risks. First-draft proposals were generated from product data, standard terms and prior similar deals. After six months, average proposal turnaround time dropped from 5.2 days to 2.7 days, the number of opportunities receiving a formal proposal before closing increased by 18%, and win rates on AI-assisted proposals ran four to five points above the historic average – not because the AI was magical, but because more prospects received a timely, coherent response instead of a rushed one.

4. Paperwork killers in operations

Manufacturing, logistics and field-based work are full of semi-structured documents: inspection reports, delivery notes, shift handover emails, maintenance logs. Generative AI helps at both ends of this process.

On the input side, it reads and structures information – extracting key fields from free text, normalising inconsistent terminology, flagging anomalies. On the output side, it drafts standardised documents: turning technician notes into formal incident reports, auto-drafting non-conformance records, generating customer-ready summaries of delays and resolutions.

The ROI here is quiet but consistent: less administrative overhead per job or shift, fewer errors in documentation, faster handovers, and better auditability without anyone doing extra work.


This is not just for the Fortune 500

A few years ago, meaningful AI work did feel like the preserve of companies with research labs and large data science teams. Three things have changed that:

Infrastructure is now a service. You don’t need to run your own GPU cluster. Cloud platforms and partners provide the models and the plumbing; you focus on the use case and the data.

Your existing vendors are already there. CRM, ERP, contact centre, productivity suites – most now ship with generative features as standard or as add-ons. For many organisations, adopting generative AI means configuring what you already pay for, not buying an entirely new stack.

The skills barrier is lower than it looks. You still need expertise in data, integration and governance – which is exactly where Charlie earns his keep. But a significant proportion of high-value work now looks like defining good prompts and templates, curating the right documents, and designing sensible review workflows. The data foundation Clare relies on for her models is the same foundation that powers a generative assistant. The platform is the same; the interface is new.

The main differentiators between leaders and laggards are no longer who has the biggest lab. They’re who has their data in reasonable shape, clear processes they want to improve, and the discipline to run small, focused pilots and scale what works.


A straightforward way to get started

Start with a specific problem. Resist the temptation to open with “we need a GenAI strategy.” Start with: “Our agents spend 40% of their time on after-call notes.” Or: “Our engineers hate writing shift handover reports.” Make the problem specific, measurable and owned by a business leader.

Choose a constrained pilot. One team, one process, one geography. Eight to twelve weeks from start to learning. Clear metrics – time saved, error rates, throughput, whatever actually matters.

Use off-the-shelf building blocks first. In 2026, there is rarely a good reason for a typical organisation to start by training its own large language model. Turn on and configure the copilots your existing vendors provide. Use retrieval-augmented assistants over your own documents. Work with partners who understand both your industry and the tooling. Custom models come later, if the business case ever justifies them.

Wrap everything in governance. Decide what data can and cannot be used. Keep humans in the loop for decisions with legal, safety or reputational consequences. Make sure users can see where an answer came from. Monitor for errors and have a feedback mechanism. The goal is not to eliminate all risk; it’s to manage it the way you’d manage any powerful tool.


From backbone to interface

If machine learning is the backbone that finds and optimises the processes worth improving, generative AI is becoming the interface layer that lets people work with those processes more naturally.

Its most valuable contributions are not mysterious. They turn scattered information into usable knowledge. They turn blank pages into workable first drafts. They turn complex systems into conversational tools.

The organisations that will get the most from generative AI over the next few years are unlikely to be the ones with the flashiest demos. They will be the ones quietly embedding these capabilities into everyday processes – and letting the ROI accumulate in faster cycles, fewer errors and better experiences.

In that sense, generative AI is following the same path as every meaningful technology shift in business.

After the hype comes the hard work.

And that’s where it gets interesting.

The Token Cost — New Line on the Spreadsheet


A budget breakdown chart outlining costs related to AI infrastructure, including license costs, infrastructure, staff costs, cloud compute, storage per GB, and a new line item for token costs, with arrows indicating actions required.

The Spreadsheet Never Lies

Back when I was an IT Manager, budget time was the one part of the job I genuinely dreaded.

I was technically biased — give me an infrastructure problem over a finance meeting any day. But the spreadsheet had to be built. So out it came, year after year. Rows and columns of licence costs, support contracts, hardware refresh cycles, staff costs, cloud compute, storage per GB. Every line item accounted for, justified, and defended.

Over the years that spreadsheet grew new rows. Cloud costs arrived and changed everything — suddenly you weren’t buying hardware, you were buying consumption. Then came storage costs per GB, virtual machine sprawl, networking costs, SaaS licensing, and the ongoing headache of software nobody was using but everyone was paying for.

Good IT management has always meant knowing what things cost. Not approximately. Precisely.

Now there is a new line to add to that spreadsheet.

Token cost.

But here is the thing. If you stop at the token line, you are optimising for the meter, not the mission.


Tokens 101 — What They Actually Are

Before the cost makes sense, the concept needs to.

A token is the basic unit an LLM uses to process text. Not a word. Not a character. Something in between — a chunk of text that the model reads, processes, and responds to.

When you type a message to a chatbot, the model doesn’t read it the way you wrote it. It breaks it into tokens first — fragments of words, whole words, punctuation, spaces — and processes each one in sequence. The response it generates is also built token by token, each one predicted from everything that came before.

A rough rule of thumb: one token is approximately four characters, or about three quarters of a word. A typical sentence of fifteen words is roughly twenty tokens. A detailed prompt of five hundred words is somewhere around six hundred and fifty tokens.

It adds up quickly. And every token processed — whether going in or coming out — carries a price.


Tokens Are a Meter, Not a Currency

There is a phrase doing the rounds right now. Tokens are the new currency of AI.

It is a neat soundbite. It is also wrong in all the ways that matter if you are trying to build serious AI capability.

Saying tokens are a currency is like saying you paid your electricity bill in kilowatt hours. You didn’t. You consumed kilowatt hours. You paid in money. The kilowatt hour is a unit of consumption — a meter reading, not a medium of exchange.

Tokens are exactly the same. They measure how much work a model is doing. They are the unit on which vendors calculate your bill. But they are not currency. They are consumption — and like every unit of consumption in IT, they carry a cost that needs understanding, governing, and optimising.

The organisations that treat tokens as a vanity metric — “we consumed X billion tokens last quarter!” — are optimising for the wrong number entirely.


The AI Factory and the Cost Behind Every Token

Dell and NVIDIA use the term AI Factory deliberately — because building AI capability at scale really does look like industrial infrastructure. Data pipelines, compute clusters, model serving layers, orchestration, guardrails. A factory for producing AI output at volume.

And like any factory, every unit of output carries a cost of production.

In an AI Factory, the token is the unit of output. And behind every token sits a cost stack most organisations never fully account for.

Infrastructure — GPU and accelerator time, CPU, RAM, networking, storage, cooling, power. Whether you see this directly or it is baked into a vendor’s price per thousand tokens, it is always there.

Model and platform — licensing for proprietary models, platform margin, optional add-ons for latency, SLAs, and private endpoints. Every provider has a margin sitting in the background of every token.

Data and training — models don’t appear from nowhere. Data acquisition, cleaning, fine-tuning, retrieval pipelines, continuous evaluation. All of it is part of the cost of making your tokens useful in your specific context, not just smart in general.

People — ML engineers, platform teams, application developers, security, compliance, prompt engineers. Labour is amortised over output. From a factory lens, every token carries a share of your people cost.

Guardrails and control — orchestration, content filters, safety checks, observability, caching, A/B testing. These are the conveyor belts and safety systems of your AI Factory. They rarely appear on a per-token price card. They always appear on your balance sheet.

The vendor gives you a clean price per thousand tokens. Your real cost per thousand tokens is considerably messier — and considerably higher.


From Token Cost to Outcome Cost

Here is where the conversation needs to move.

A token is a unit of cost. It is not a unit of value. And on its own, cost per token tells you almost nothing about whether your AI investment is working.

The number that actually matters is cost per outcome.

Swap abstract token consumption for something real: tokens per resolved support ticket. Tokens per sales proposal generated. Tokens per code review completed. Tokens per knowledge worker hour saved. Now you can build a unit economic view that means something.

Cost per outcome = (Tokens per outcome × fully loaded cost per thousand tokens) + overheads

Unit margin = Value per outcome − Cost per outcome

Once you see it this way, the conversations become sharper. A cheaper model per thousand tokens that requires three times the tokens per outcome is not a saving. A use case that looks expensive in tokens but delivers enormous value per outcome is not a problem. A system regenerating the same content repeatedly because nobody implemented caching is a straightforward fix hiding in plain sight.


The Levers: Token Productivity in the AI Factory

If tokens are the output of your AI Factory, token productivity is your primary optimisation lever.

Use the right model for the job. Not everything needs your largest, most capable model. Smaller, cheaper models handle classification, routing, and simple transforms well. Reserve the heavy models for genuinely complex reasoning. A tiered approach — cheap model first, escalate only when needed — can dramatically change your cost per outcome without touching quality.

Optimise prompts and context. Long system prompts and bloated context windows feel powerful. They are also expensive. Strip repetition, keep only relevant context, use structured inputs where possible. Every unnecessary sentence in a prompt is scrap material on the factory floor — and in a high-volume system, scrap accumulates fast.

Cache intelligently. A significant proportion of enterprise AI workloads are repetitive — similar questions, standard documents, known sub-tasks. Response caching, retrieval caching, and partial caching of intermediate steps reduce tokens per outcome without any loss of quality. It is one of the highest-return optimisations available and one of the most consistently overlooked.

Design around outcomes, not demos. Demos optimise for the impressive moment. Factories optimise for throughput and margin. Start from the business outcome, the current human cost of achieving it, and the target cost with AI. Then design the system backwards from that constraint — not forwards from whatever the latest model happens to be capable of.


Token Cost as a Governance Question

This is familiar territory for anyone who has managed cloud costs or software licensing.

Token consumption is a shared resource. Different business units, different applications, and different use cases will consume it at different rates and generate very different outcomes per token. Without visibility into that consumption — tracked by application, by business unit, by use case — you have no basis for budgeting, no mechanism for chargeback, and no way to identify where usage is growing faster than the value it is generating.

A note on agentic AI: if your organisation is moving into agentic deployments — systems that reason across multiple steps, use tools, retrieve information, and check their own work — the token cost profile changes significantly. A standard chatbot interaction might consume a few hundred tokens. An agentic workflow handling the same underlying task can consume tens of thousands. Model it separately. Budget it separately. The capability gain can be substantial, but the consumption profile is a different order of magnitude.


Optimising for the Mission

Back at that budget spreadsheet, the discipline was always the same. Know what you consume. Know what it costs. Know who is consuming it. And know what value it is generating.

Tokens deserve exactly that discipline. Not because they are a currency. Because they are a cost — the most visible signal of the underlying economics of your AI Factory.

The token line on the bill matters. But the executives asking “what is our token budget this year?” are asking the wrong question.

The right questions are these: Which AI-enabled outcomes matter for our business? What is our target cost per outcome? What mix of models, infrastructure, and data do we need to get there? And how do we measure value per outcome — not just tokens consumed?

Tokens are how you keep score in the background. Outcomes are why you are playing.

If your AI strategy stops at tokens, you are optimising for the meter, not the mission.

LLM Sizing 101 – Part 2: From Tokens Per Second to GPU Count

Flowchart illustrating LLM sizing concepts, featuring phases for prefill and decode, compute processes, throughput bridge, and metrics for tokens per second based on GPU count.

In Part 1 we established the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do. Now we make it practical.

This post is about the bridge between those concepts and actual hardware — specifically, how you translate a customer’s real-world requirements (“we need to support 500 users”) into a GPU count you can put in a proposal.

The key metric that connects the two sides is tokens per second (TPS). To use it properly, you need to understand what’s actually happening inside the GPU when a model generates a response — because not all tokens are created equal.


Two Phases, Two Different Problems

When an LLM handles a request, it does so in two distinct phases. They look similar from the outside — text goes in, text comes out — but they have fundamentally different performance characteristics under the hood.

Phase 1: Prefill This is where the model reads and processes the entire input prompt.

  • All the input tokens in your prompt are processed in parallel.
  • This phase is compute-intensive — the GPU is doing a lot of simultaneous maths.
  • It largely determines Time to First Token (TTFT): how long the user waits before they see any response at all.

Phase 2: Decode This is where the model generates the response, one token at a time.

  • Each new token depends on the previous ones, so this phase is inherently sequential.
  • And here’s the critical insight for sizing: the decode phase is often not limited by the GPU’s raw FLOPS.
  • It’s limited by memory bandwidth — how fast the GPU can stream the model’s weights from high-bandwidth memory (HBM) to generate each token.

A quick note on FLOPS

You’ll see FLOPS quoted constantly in GPU spec sheets, so it’s worth understanding what it actually means — and where it does and doesn’t tell the full story.

FLOPS stands for Floating-Point Operations Per Second. It measures how much numerical computation a processor can perform per second. LLMs are essentially enormous stacks of matrix multiplications on floating-point numbers, so FLOPS is a natural unit for describing raw GPU compute power.

Vendors typically quote performance in:

  • TFLOPS (tera-FLOPS = 10¹²) or PFLOPS (peta-FLOPS = 10¹⁵)
  • Often broken down by precision: FP32 TFLOPS, FP16/BF16 TFLOPS, INT8 TOPS

So when you see “H100: X PFLOPS (FP16)”, that’s the peak theoretical compute at 16-bit precision — not what you’ll observe in a real LLM workload once memory access patterns, batching, and framework overhead come into play.

Here’s how FLOPS maps to the two inference phases:

  • Prefill is FLOPS-hungry. Processing all prompt tokens in parallel is a heavy matrix multiplication workload — this is where raw compute throughput matters most. Higher FLOPS directly improves prefill speed and reduces TTFT.
  • Decode is not FLOPS-bound. Generating tokens sequentially doesn’t saturate the GPU’s arithmetic units. The bottleneck shifts entirely to memory bandwidth — how fast the GPU can stream model weights from HBM for each token generated.

This distinction matters enormously in practice: a GPU with impressive FLOPS but modest memory bandwidth can underperform for LLM inference compared to one with higher bandwidth, even if the spec sheet comparison looks favourable. It’s why memory bandwidth is often the first number to check when evaluating accelerators for inference workloads — and why the H100 SXM, with its multi-TB/s HBM3 bandwidth, consistently outperforms lower-bandwidth alternatives for decode-heavy deployments.


The Core Metric: Tokens Per Second

Tokens per second (TPS) is your fundamental unit of inference throughput. Everything in a sizing conversation eventually traces back to it.

There are two ways to look at TPS, and you need to keep them separate:

  • Per-user TPS — how fast tokens are delivered to a single user.
    • This drives the perceived experience.
    • Rough guide: below 10–15 tokens/sec starts to feel sluggish; above 30 tokens/sec it feels near-instant for most chat use cases.
  • System TPS — the total token output across all concurrent users.
    • This is what you’re actually sizing the hardware to sustain.

The relationship is simple in principle:

System TPS = Concurrent Users × Tokens per Second per User

In practice, batching is what makes this efficient:

  • Rather than serving each user’s request on dedicated GPU resources, a well-configured inference server groups multiple requests together and processes them as a single batch.
  • This significantly improves GPU utilisation — particularly during the memory-bandwidth-bound decode phase.
  • Batching is the primary mechanism that lets you serve many users from a relatively small GPU footprint.

Working Backwards: From Users to GPUs

Here’s the sizing workflow that turns a customer conversation into a hardware recommendation.

Step 1: Define the workload

Start with the usage-side discovery questions from Part 1:

  • How many concurrent users?
  • What’s the average prompt length (input tokens)?
  • What’s the expected response length (output tokens)?
  • What’s the acceptable latency — both time to first token (TTFT) and total response time?

A worked example

A customer wants to deploy an internal assistant. Together you define:

ParameterValue
Concurrent users200
Average prompt500 tokens
Average response300 tokens
Target response time~10 seconds
Acceptable TTFT< 2 seconds

Step 2: Calculate required system TPS

From the example:

  • 300 output tokens in 10 seconds = 30 tokens/sec per user
  • 200 users × 30 tokens/sec = 6,000 tokens/sec system throughput

So the platform needs to sustain ~6,000 TPS of decoded tokens under load.

Step 3: Establish per-GPU TPS for your chosen model

This is where model size and GPU choice meet. As a rough reference for inference at FP16 (actual figures vary with batch size, framework, and optimisation):

ModelGPUApprox. TPS (decode, batched)
7BH100 80GB~2,000–3,000
70B (tensor parallel, 4×)4× H100 80GB~800–1,200
70B (tensor parallel, 8×)8× H100 80GB~1,500–2,500

Note: these are illustrative ranges. Always validate against benchmark data for your specific model, serving framework, optimisation level (TensorRT-LLM, vLLM, etc.), and batch configuration.

Step 4: Calculate GPU or node count

Continuing the example, assume:

  • You choose a 70B model hosted on 4× H100 80GB nodes.
  • Based on benchmarks, you take a conservative estimate of 1,000 TPS per node (decode, batched).

Then:

  • 6,000 system TPS ÷ 1,000 TPS per node ≈ 6 nodes

Add a headroom buffer (typically 20–30% for burst traffic, uneven load, and future growth):

  • 6 nodes × 1.25 ≈ 8 nodes as a starting recommendation.

At this point, you have a defensible answer to “how many GPUs/nodes do we need?” that’s grounded in user requirements, not just “bigger is better.”


Reference Sizing: Two Common Scenarios

The worked example above walks through the methodology. The table below applies it to two reference architectures you’ll encounter regularly — a 7B internal assistant and a 70B RAG system — to give you a practical feel for how the numbers land.

Figures assume FP16 or INT8 precision, batched inference, and a well-optimised serving framework such as TensorRT-LLM or vLLM. Treat these as directional reference points, not guaranteed benchmarks — validate against your specific model, configuration, and workload before quoting.

AspectRef A: 7B Internal AssistantRef B: 70B RAG System
Typical use caseEmployee Q&A, productivity assistantLegal/finance/engineering RAG over proprietary data
Model size7B70B
Quality vs cost“Good enough” quality, cost-optimisedHigher quality, domain-heavy reasoning
Concurrency (peak)~500 users~100 users
Avg prompt (input)~400 tokens~2,000 tokens (incl. retrieved context)
Avg response (output)~250 tokens~500 tokens
Latency target8–10 s total, TTFT < 2 s12–15 s total, TTFT < 3 s
System TPS target12,500 TPS (decode)3,300 TPS (decode)
Precision (typical)INT8 / mixed (weights)FP16 / mixed, selective quantisation
GPUs per node (typical)3–4 GPUs per PowerEdge node8 GPUs per PowerEdge XE-class node
Nodes (illustrative)3–4 nodes (total ~12 GPUs, incl. headroom)3 nodes (24 GPUs total, incl. headroom)
Interconnect focusGood PCIe + 25–100 GbENVLink/NVSwitch + 100–400 Gb fabric
Workload patternHigh concurrency, chat-likeLower concurrency, long prompts, RAG + heavier reasoning
Sizing conversation hook“Maximise users per GPU, acceptable quality”“Maximise quality on key workflows, moderate concurrency”

Notice the counterintuitive result: the smaller 7B model actually demands nearly four times the system throughput of the 70B RAG system (≈12,500 TPS vs ≈3,300 TPS). That’s not a contradiction — it’s the concurrency effect. Serving 500 chat users simultaneously generates far more aggregate token output than 100 users running deep reasoning queries, even though each individual 7B response is shorter. Bigger model doesn’t always mean bigger infrastructure footprint; workload pattern matters just as much as parameter count.

A few additional assumptions behind these figures worth keeping front of mind:

  • No fine-tuning overhead — these are inference-only configurations. If the customer plans on-premises fine-tuning, GPU and memory requirements increase substantially.
  • Steady-state load — the node counts include a 20–30% headroom buffer but assume reasonably predictable peak concurrency. Highly bursty workloads (e.g. end-of-day batch spikes) may warrant additional headroom or an autoscaling strategy.
  • Single-tenant deployment — figures assume dedicated GPU resources per workload. Multi-model or multi-tenant deployments require separate sizing treatment.
  • Retrieved context included — the 70B RAG prompt size of ~2,000 tokens already includes retrieved document chunks. If retrieval quality improves and chunk sizes grow, prompt tokens — and therefore TTFT — will increase accordingly.

The Trade-offs Every Customer Faces

Once you’ve run this exercise with a customer, three trade-off conversations typically follow.

1. Model quality vs. throughput

  • A 70B model usually produces higher-quality outputs than a 7B.
  • But it also serves far fewer users per GPU.
  • For some use cases — summarising legal documents, writing complex code, specialised reasoning — the quality premium is worth it.
  • For a high-volume customer service assistant, a well-tuned 7B model might deliver better economics with acceptable quality.

2. Latency vs. concurrency

  • Larger batch sizes improve GPU utilisation and system throughput, but they increase the time an individual request spends waiting to join a batch.
  • If TTFT is critical (live chat, voice interfaces), you’ll accept lower utilisation to keep batches small and responsive.
  • If the application is asynchronous (batch document processing, offline analytics), you can run large batches, push utilisation higher, and drive down cost per request.

3. Precision vs. memory footprint

  • Running a model at FP16 gives you full quality but also the full VRAM cost.
  • Quantising to INT8 or INT4 roughly halves or quarters the memory footprint, allowing either a larger model to fit in the same GPUs, or the same model to fit in fewer GPUs.
  • There is a quality trade-off, but for many inference workloads, well-done INT8 quantisation offers an excellent quality-to-cost ratio and is worth including in the conversation.

What This Means for Platform Selection

By this point in a customer conversation, you have enough to make an informed platform recommendation.

Single-node, lower concurrency, 7B–13B models A PowerEdge server with 2–4 high-memory GPUs will typically cover the requirement, with room to scale up or out as usage grows.

Multi-node, higher concurrency, or 70B+ models You’re looking at GPU-dense platforms where high-speed interconnect between GPUs (NVLink, NVSwitch) and network fabric between nodes become as important as raw GPU count. These directly affect prefill and decode performance, and therefore both latency and throughput.

Mixed workloads (inference + fine-tuning) Fine-tuning demands significantly more memory per GPU than inference alone (optimizer states, gradient storage, larger activations). If a customer plans both, size for fine-tuning — the inference requirement is typically covered as a result.

The specific Dell platform mapping — PowerEdge XE series, GPU configurations, and interconnect options — is what we’ll build out in Part 3.


Next up: Part 3 — Platform and GPU selection: mapping your sizing to Dell PowerEdge XE configurations. We’ll take the TPS-based sizing approach from this post and show how it translates into concrete server configs you can quote.


LLM Sizing 101 – Part 1: Tokens and Parameters

Infographic explaining tokens and parameters in large language models. It includes definitions, examples, and a chart that illustrates the sizing problem related to tokens and parameters.

Every week, another organisation announces it’s deploying a large language model. And every week, a Technical Architect or Pre-sales Engineer gets asked a version of the same question: “How much infrastructure do I actually need for this?”

In my days as a Data Centre Architect and Engineer, I’d size server clusters for databases and VMware environments. The maths was different, but the discipline was the same: understand the workload, match it to the hardware, justify the recommendation. Now the question I get asked is “How do I size for LLMs?” This blog series is all about answering that.

Before you can answer that — before GPUs, nodes, interconnects, or platform choices even enter the conversation — you need two concepts nailed down cold: tokens and parameters. They’re the two dials that drive every LLM server sizing decision you’ll ever make.

Think of it this way. Parameters tell you how big the engine is. Tokens tell you how hard you’re asking it to work. Get those two right, and the rest of the sizing conversation falls into place.


Tokens: The Currency of Language Models

LLMs don’t read sentences the way you do. They don’t even read words. They read tokens — small chunks of text that sit somewhere between a syllable and a word.

  • Sometimes a token is a whole word: server
  • Sometimes it’s a fragment: serv, er
  • Sometimes it’s punctuation or whitespace: . , ,

For English text, a useful rule of thumb is:

1 token ≈ 3–4 characters, or roughly 0.75 of a word

So the sentence “This is a sizing test.” runs to about 6–7 tokens — not 5, because the model doesn’t count words.

When you see pricing or performance metrics quoted in the market, they’re always denominated in tokens:

  • $X per 1,000 tokens
  • Y tokens per second
  • 4k / 8k / 32k / 128k context window

That last one matters a lot. The context window is the maximum amount of text — measured in tokens — the model can hold in view at once. It’s not just the question you asked; it includes everything: system instructions, conversation history, documents you’ve fed in, and the response being generated. Every token in that window costs compute and memory.

Why Tokens Drive Sizing

Tokens show up in three places in every sizing conversation:

1. Context length (the prompt window) Longer context means the model has to track more information simultaneously. That translates directly into more VRAM for the KV cache — the memory structure the model uses to keep track of what it’s already processed. A customer who wants 128k-token context windows needs significantly more memory per request than one running at 4k.

2. Throughput and concurrency Tokens per second is the fundamental throughput metric — per GPU, per node, per cluster. In practice, you’ll often work backwards from a customer’s requirements:

“We need to support 500 concurrent users, each generating responses of around 300 tokens, within 3 seconds.”

That’s a tokens-per-second and concurrency problem. Everything else follows from it.

3. Capacity and cost planning Whether on-premises or cloud, consumption is effectively input tokens + output tokens. On a Dell PowerEdge server deployment, higher sustained tokens per second means more GPU compute, more memory bandwidth, and — beyond a certain point — more nodes or a move to higher-end accelerators.


Parameters: The Size of the Brain

If tokens are the currency of language models, parameters are what you’re buying with your hardware budget.

A parameter is a learned numeric weight — a floating-point number — stored inside the model. Mathematically, an LLM is an enormous function, and parameters are the numbers that define it. When a model is trained, billions of these weights are adjusted, incrementally, until the model gets reliably good at predicting language.

This is why model names look the way they do:

  • 7B → approximately 7 billion parameters
  • 13B, 34B, 70B, 405B → and so on up the scale

More parameters generally mean greater model capability — the model can represent more complex patterns, handle more nuanced reasoning, and produce higher-quality output. But that capability comes at a direct hardware cost, because every parameter has to live somewhere.

The VRAM Equation

The first-order estimate for model memory is straightforward:

Model VRAM (GB) ≈ Parameters × Bytes per Parameter

In practice:

ModelPrecisionWeights-only VRAM
7BFP16/BF16 (2 bytes)~14 GB
70BFP16/BF16 (2 bytes)~140 GB

That’s just for the weights themselves. In a real deployment you also need memory for:

  • KV cache — grows with context length and batch size
  • Activation memory — the working memory during computation
  • Optimizer states — relevant if you’re fine-tuning, not just inferencing
  • Runtime overhead — fragmentation, safety layers, serving framework

The practical consequence is clear:

  • A 7B model can typically run on a single high-memory GPU (24–80 GB class, depending on precision and context requirements).
  • A 70B model generally needs multiple high-VRAM GPUs — and demands fast interconnects between them, whether that’s NVLink, NVSwitch, or a high-bandwidth PCIe fabric.

As a pre-sales engineer, parameter count is what you’ll map to platform choices: how many GPUs per node, whether you need a GPU-dense platform with a high-speed fabric, and whether the workload fits in a 2U form factor or needs something more substantial.


How the Two Interact

Here’s the mental model worth keeping front of mind for every sizing conversation:

What it measuresWhat it drives
ParametersHow big the model isVRAM requirement, compute per token, hardware footprint
TokensHow much work you’re asking it to doThroughput, latency, concurrency, context memory

Given a fixed GPU budget, customers are always navigating a trade-off:

  • Bigger model (more parameters) versus more throughput (more tokens per second)
  • Longer context windows (more tokens per request) versus more concurrent requests

There’s no universally right answer — it depends on the use case. A legal document analysis platform that processes 100k-token contracts needs a very different configuration from a customer service chatbot handling hundreds of short, concurrent sessions.


Turning This Into a Sizing Conversation

When you strip away the jargon, most customer LLM questions reduce to this:

“Given a model size (parameters) and an expected usage pattern (tokens), how many GPUs and servers do I need to hit my latency and concurrency targets?”

The discovery questions that unlock that answer fall into two groups:

Model side:

  • Are you targeting a 7B, 13B, 70B, or larger class model?
  • Are you planning full precision, mixed precision, or quantized deployment?
  • Is this inference only, or do you also plan fine-tuning?

Usage side:

  • What’s the average prompt size per request (in tokens)?
  • What’s the expected response length?
  • What’s the maximum context length required — 8k, 32k, 128k?
  • How many concurrent users do you need to support, and at what latency?

Once you have those answers, you can map them to Dell platforms — GB10, GB300, PowerEdge XE and XE+ GPU servers, interconnect choices, cluster configurations — in a structured and defensible way. That’s exactly what we’ll build up in the posts that follow.


Next up: Part 2 — from tokens per second to GPU count: the maths that drive inference sizing.