Page 2 of 3

Ethics in AI: Part 2

Infographic on 'Ethics in AI' highlighting various types of biases including historical, representation, measurement, and evaluation biases, along with the importance of designing fairness in AI systems. Features a processing node for AI and labels for input and output stages.

Bias and Fairness

Here’s the uncomfortable truth about AI bias. It rarely announces itself. There’s no error message, no warning light, no moment where the system admits it has a problem. The model just does what it was trained to do — faithfully, at scale, and often with complete confidence.

Because bias in AI isn’t primarily about algorithmic malfunction. It’s about inheritance. AI systems learn from data that reflects the world as it was, not necessarily the world as it should be. And if that data carries the fingerprints of historical inequality, underrepresentation, or flawed measurement — the model doesn’t question it. It learns it.

The refinery analogy holds here too, and it holds uncomfortably well. Contaminated feedstock doesn’t trigger an alarm at the intake valve. It moves through the system, gets processed with everything else, and comes out the other end as contaminated output — refined, packaged, and delivered with the same confidence as everything around it. The pipeline worked perfectly. That’s the problem.

So where does the contamination get in? There are four entry points worth understanding.


Historical Bias is perhaps the most insidious. It occurs when training data reflects pre-existing inequality — when past decisions were themselves discriminatory, and those decisions become the ground truth a model learns from. A hiring algorithm trained on a decade of appointment records from a male-dominated industry doesn’t invent a preference for male candidates. It learns one from the data. The discrimination was already there. The model just scaled it.

Representation Bias arises when certain groups are underrepresented in training data. A facial recognition system trained predominantly on lighter-skinned faces will perform less accurately on darker-skinned ones — not because of any deliberate design choice, but because the training set didn’t reflect the full diversity of the population it would eventually serve. Gaps in the data become gaps in performance. And those gaps tend to fall hardest on the people already least served by existing systems.

Measurement Bias is subtler still. It emerges when the features used to train a model capture some populations less accurately than others. Predictive health models built on clinical data may underperform for groups historically less likely to access formal healthcare — not because the model is poorly designed, but because the measurement itself was uneven at source. The data records what was measured. It can’t record what wasn’t.

Evaluation Bias occurs when the test data used to validate a model doesn’t reflect the diversity of the real world it will operate in. A model can pass every benchmark with flying colours and still fail badly for populations underrepresented in the evaluation set. If you don’t test for it, you won’t find it — and you won’t find it until it’s already causing harm in deployment.

Four different entry points. The same result: a model that has learned the wrong lessons, at scale, and has no idea.


Consider a real example. In 2018, Amazon built a CV screening tool to help filter job applicants. It was trained on a decade of historical hiring data — CVs submitted by candidates who had actually been appointed. The problem was that the data reflected a male-dominated industry. The model didn’t know that. It just learned what a successful candidate looked like based on the evidence it was given. Over time it began penalising CVs that included words like “women’s” — as in women’s chess club, women’s rugby team — and downgrading graduates from all-female colleges. The model wasn’t told to discriminate against women. It learned to. From data that already did. Amazon scrapped the tool in 2018. But the lesson endures: a model trained on biased history will faithfully reproduce that history, at scale, until someone intervenes.

Which brings us to fairness — and here’s where it gets genuinely hard.

Fairness sounds like a simple goal. It isn’t. Multiple formal definitions of fairness exist — statistical parity, equalised error rates, individual fairness, counterfactual fairness — and they are mathematically distinct. More importantly, they frequently conflict. Satisfying one fairness criterion can, and often does, violate another.

A model that achieves equal error rates across demographic groups may not achieve equal outcomes. A model that achieves equal outcomes may not treat similar individuals similarly. There is no single definition of fairness that satisfies all criteria simultaneously — and that means choosing one is not a technical decision. It’s a values decision.

This is the point most AI ethics conversations skirt around. Technical mitigations exist — reweighting training data, adjusting loss functions, post-processing predictions. They matter. But they cannot resolve the underlying question of what fairness actually means in a given context, for a given population, with given consequences. That question requires human judgement, institutional accountability, and explicit deliberation — not optimisation.

And without that deliberate specification, optimisation processes will do exactly what they were designed to do: maximise predictive accuracy. Not fairness. Accuracy.

Fairness has to be designed in. Because a system left to optimise on its own will not choose it.


Next: Part 3 — Privacy and Consent.

Ethics in AI: Part 1

Diagram illustrating ethical considerations in AI, focusing on processing node failures and their impact on model accuracy and bias.

The Question That Haunted Me

“But doesn’t AI get it wrong?”

It was a fair question. My answer wasn’t.

At the time I was deep in the AI hype — drinking the Kool-Aid, evangelising the technology, convinced the outputs spoke for themselves. So when someone pushed back and asked whether AI could really be trusted, I defended it. Brushed past the concern. Missed the moment entirely.

What I should have done was lean in. Agreed. Had the honest conversation about bias, about missing and unrepresentative training data, about what it actually means when AI gets it wrong at scale — not just technically wrong, but wrong in ways that shape real people’s access to resources, opportunities, and fair treatment. Wrong in ways that can ruin lives.

That question has haunted me for over three years.

Because “AI getting it wrong” isn’t one problem. It’s two very different ones, and treating them as the same is its own kind of mistake.

There are hallucinations — the model confidently generating plausible but factually wrong outputs. A technical failure. The model doesn’t know what it doesn’t know.

And there is bias — the model learning and amplifying the prejudice, inequality, and exclusion already baked into its training data. Not a malfunction. The model working exactly as designed, just on contaminated inputs. Faithfully. At scale.

The person asking that question deserved an answer that acknowledged both. Instead they got a defence of the technology.

Three years later, older and wiser, I’m finally leaning in.

AI doesn’t operate in a vacuum. It operates inside the same social structures, institutions, and power dynamics that have always shaped who gets access and who gets excluded. And because AI learns from data — data generated by humans, in a world with a complicated history — it inherits everything recorded in that data. The assumptions. The gaps. The inequalities baked in long before any algorithm touched it.

That’s why keeping AI ethical isn’t optional, and it isn’t a feature you bolt on at the end. It has to be built in by design — from the data up.

This series is about what that actually means.


Next: Part 2 — Bias and Fairness.

From Programming by Rules to Learning form Data

For most of software’s history, the intelligence was in the code. Today, it’s in the data. That shift changes everything — especially what you need to invest in.

Infographic comparing traditional programming and machine learning. On the left, traditional programming is depicted with a flowchart showing 'if/then' statements leading to a predetermined result. On the right, machine learning is illustrated with a data funnel leading to a network discovering patterns, highlighting the shift in the programmer's role from coding to curating data.

I had one of those shower moments this morning. You know the ones — your brain wanders somewhere unexpected and suddenly you’re solving a problem you weren’t trying to solve.

I was thinking about the time I taught my son to code a robot we’d built together. The code was beautifully simple. Turn left. Move three steps. Turn right. If you hit a wall, stop. Pure logic. Pure rules. We wrote the instructions, the robot followed them, and when it did something wrong we went back and fixed the rule.

And then I thought: when I’m a grandad — no rush — and I’m sitting down with a grandchild to do the same thing, the conversation is going to look completely different. We won’t be writing turn-left-move-three-steps rules. We’ll be feeding the robot data. We’ll be talking about what it sees, the patterns it learns from, how it gets better not because we updated the instructions but because we gave it more examples to learn from. Computer vision. Convolutional neural networks. A robot that figures out the world rather than following a script we wrote for it.

Same robot. Completely different philosophy. And somewhere between teaching my son and the future grandkids, software itself made that same journey.

For most of computing’s history, we built systems by encoding our understanding of the world directly into logic. If this, then that. If the balance is below zero, deny the transaction. If the email contains “free money” and ten exclamation marks, mark it as spam. Engineers wrote the rules, shipped the code, and the system behaved exactly as specified. The intelligence lived in the logic.

That model hasn’t disappeared — but in the domains that matter most today, it’s no longer the whole story. It’s all about data now. Patterns in the data. And understanding that shift changes how you think about what you need to invest in.


“Software will eat the world,” Marc Andreessen told us in 2011. He was right. What he didn’t mention was that software itself would eventually be powered by data. Follow that thought to its conclusion and the most important infrastructure in your organisation isn’t your application stack. It’s your data platform. And data is the power source.

When Rules Stop Being Enough

Rules-based systems are genuinely good at what they do. They’re predictable. They’re auditable. If something goes wrong, you can usually point at the line of code that caused it. For stable, well-understood processes — tax calculations, eligibility checks, simple approvals — they’re entirely fit for purpose.

The trouble starts when the problem gets messy.

Take fraud detection. You start sensibly: flag transactions above a certain amount from high-risk locations. Block IPs on a denylist. Limit transactions per minute. Clean, logical, explainable.

Then the fraudsters adapt. New attack vectors. New geographies. New patterns you didn’t anticipate. So you add more rules. Then exceptions to those rules. Then special handling for VIPs. Then manual overrides for partners. Before long, you’ve got thousands of conditions, constant firefighting, and a system that’s simultaneously brittle and impossible to fully understand — despite being built entirely from logic you wrote yourself.

At some point you hit rule sprawl, and it doesn’t end well.


The Shift: From Code as Truth to Data as Truth

Machine learning doesn’t try to specify the decision logic. Instead, it learns it — directly from examples.

Feed a model enough confirmed fraud cases alongside confirmed legitimate transactions. Give it the signals: transaction history, device fingerprint, location patterns, time of day, merchant data. Let it find the patterns. Then, when the fraud landscape shifts, you don’t sit down and rewrite hundreds of rules. You gather new examples, update the signals, retrain, and redeploy.

This is a fundamental inversion of where the intelligence lives:

  • In a rules-based system, code is the truth and data is just something to test against it.
  • In a machine learning system, data is the truth and code is the plumbing that carries it.

The code still matters enormously — it defines how data flows, how features are built, how models are trained and served. But the behaviour you see in production is now overwhelmingly a function of which data you chose, how you cleaned and joined it, how frequently you refresh it, and how well you’ve engineered the signals from it.

The same model architecture, trained on different data, can behave like a completely different product.


Output Is a Function of Input

This is the point that gets lost in conversations about AI.

Organisations invest heavily in models. They debate architectures, benchmark performance, evaluate vendors. All of that matters. But if the data flowing into those models is incomplete, inconsistent, biased or stale, no amount of model sophistication will save you.

As I covered in Garbage In, Expensive Garbage Out, the dangerous thing about modern AI isn’t that it fails obviously when the data is bad. It’s that it doesn’t. It learns whatever patterns you give it, optimises confidently for whatever labels you’ve defined, and delivers outputs at scale — even when those outputs are wrong.

The refinery metaphor runs true here. You can have the most sophisticated downstream process in the world. If contaminated feedstock is getting through the early stages, it doesn’t matter how good the refining is — what comes out the other end is still wrong. Processed wrong. Delivered at scale, with complete confidence, in entirely the wrong direction.

Output is a direct function of input. That’s not a caveat. It’s the whole game.


Data Engineering: The Refinery That Makes AI Possible

This is why data engineering has moved from the back office to the front line.

When your AI systems run on data rather than rules, the infrastructure that produces, transforms, governs and delivers that data isn’t supporting the product. It is the product — or at least, it’s what makes the product possible.

Think back to the refinery. Raw crude oil has no value in your car’s engine. It needs to go through a series of deliberate transformation stages — each one removing impurities, each one producing something more usable — before it becomes fuel you can rely on. Data works the same way.

Raw operational data, logs, clickstreams, sensor readings — these are the crude oil. Valuable in potential, useless in practice. To become model-ready, they need to flow through robust pipelines: ingested reliably, cleaned and validated, standardised across sources, transformed into features that actually capture signal, and governed throughout so you know what you have, where it came from, and whether it can be trusted.

That’s the job of the data engineer. And in a world where AI output depends on data input, that job sits at the heart of everything.

A few things have to be true for the refinery to work:

The pipelines have to be reliable. Ingestion from operational systems, logs, events and sensors. Batch and streaming paths where appropriate. Resilience to schema changes, late events and upstream failures. Without this, models starve, drift, or silently degrade on stale inputs.

The data has to be properly modelled. Standardised schemas and clear contracts between the systems that produce data and the teams that consume it. Deduplication, validation and anomaly detection built into the pipeline, not bolted on as an afterthought. Consistent definitions of what “customer” means, what “churn” means, what “conversion” means — because if those definitions vary across systems, your model is quietly learning the noise between them.

Features need to be treated as first-class assets. The signals you engineer from raw data — the features a model actually learns from — should be reusable, versioned and governed. Computed consistently whether you’re training offline or serving in real time. Not scattered across one-off notebook scripts that no one else can maintain.

Governance can’t be an afterthought. As AI moves closer to consequential decisions — credit, healthcare, hiring, public sector — knowing which data fed which model, who had access to it, and whether it was fit for that purpose stops being a compliance tick-box and becomes part of the safety story.

The loop has to close. How you capture feedback from production — user interactions, implicit signals, explicit labels — and turn it into the next generation of training data is where the compounding advantage comes from. The refinery doesn’t run once. It runs continuously.


Generative AI Turns the Dial Up, Not Off

It’s tempting to think that large language models and generative AI change this equation — that you can just point a capable model at your questions and bypass the data engineering work.

The opposite is true.

Behind every enterprise generative AI application that actually works, there are pipelines fetching the right context from your knowledge bases and data warehouses in real time. There are curated fine-tuning datasets steering the model toward the behaviour you actually want. There are feedback loops turning user interactions into better training data over time. There is, in short, a refinery — just with a different interface at the end of it.

For enterprise use cases, the differentiator is rarely the base model. It’s the quality of the data you connect it to, the rigour of the retrieval and ranking pipelines behind it, and the discipline of the data engineering that makes all of that reliable.

The plumbing is still the point.


If Data Is the Engine, Build the Right Infrastructure

The organisations that are winning with AI aren’t simply the ones with the biggest models. They’re the ones who treat data engineering as a first-class product capability — where data engineers and platform architects are in the room from the start, not brought in to implement decisions that have already been made.

They invest early in shared platform infrastructure: data lakes and warehouses, feature stores, catalogues, quality monitoring, governance and observability. Not one-off pipelines per project, but a proper refinery that serves the whole organisation.

And they build on foundations that can handle the scale and complexity of real enterprise data estates — structured tables alongside documents, images, logs and sensor data; on-premises alongside cloud and edge; batch pipelines alongside real-time streams.

That’s exactly what the Dell AI Data Platform is designed to support: a unified, modular foundation for storing, processing, governing and serving the data that modern AI workloads depend on — so data engineers can focus on building the refinery, rather than firefighting the infrastructure it sits on.


The Refinery Has to Work

The shift from rules-based systems to data-driven AI didn’t just give us more powerful software. It changed where the intelligence lives — and with it, what we need to invest in to make that software trustworthy.

When code was the truth, the bottleneck was engineers writing rules. When data is the truth, the bottleneck is the infrastructure that produces, refines, governs and delivers that data.

The refinery has to work. The pipelines have to be reliable. The fuel has to be clean. Everything downstream — every model, every decision, every output — depends on it.

And if you want to understand what happens when the refinery fails, that’s a story worth reading too.

The Magpie Effect: AI Possibilities vs Practical AI Solutions

An infographic titled 'The Magpie Effect' illustrating a magpie flying alongside various concepts related to AI strategies, constraints, and distractions. Includes labeled arrows pointing to elements like 'Distraction Vector', 'Shiny Object Response', and 'Golden Process - Critical Path'. Features a flowchart at the bottom indicating practical AI strategy.

Yesterday, I was sitting in a room listening to colleagues talk about the latest AI developments. New models. New capabilities. New promises about what AI will be able to do. It was energetic, enthusiastic, and genuinely well-informed.

And somewhere in the middle of it, a quiet thought surfaced: does any of this actually matter right here and right now?

It took me back to a book I read years ago when studying Business and Finance. The Goal by Eliyahu Goldratt. If you haven’t read it, it’s a business novel — a plant manager called Alex Rogo, a factory on the verge of closure, and a series of deceptively simple questions from an old professor that eventually save the business. The central insight is the Theory of Constraints: every system has a bottleneck. Find it, fix it, and the whole system improves. Ignore it, and no amount of optimisation elsewhere will save you.

It took me back to a book I read years ago when studying Business and Finance. The Goal by Eliyahu Goldratt. If you haven’t read it, it’s a business novel — a plant manager called Alex Rogo, a factory on the verge of closure, and a series of deceptively simple questions from an old professor that eventually save the business. The central insight is the Theory of Constraints: every system has a bottleneck. Find it, fix it, and the whole system improves. Ignore it, and no amount of optimisation elsewhere will save you.

Sitting in that room, I realised the AI industry has a Goldratt problem. It is endlessly fascinated by what is possible at the frontier, and not nearly interested enough in the bottlenecks that are quietly strangling real businesses today.

That is when the idea of the Magpie Effect crystallised for me.

There is a bird famously distracted by shiny things. It spots something glittering, drops whatever it was doing, and goes to investigate. Every few weeks, a new AI headline lands. A new model. A new capability. A new promise that this is the breakthrough that changes everything. Autonomous agents. Artificial General Intelligence. AI that writes code, runs your supply chain, manages your workforce. The glittering object rotates. The industry pivots. And somewhere in a conference room, a leadership team starts asking whether they should be doing that instead.

The Magpie Effect is quietly one of the biggest obstacles to real AI progress in business today. And Goldratt, I think, would have had little patience for it.

Goldratt was writing about factory floors. Machines, production lines, throughput, inventory. But swap the factory for a financial services firm, a retailer, or a healthcare provider, and the principle holds perfectly. The production lines just look different now. They are the workflows, the approval chains, the data pipelines, the customer journeys that run your business every single day. Constraints in those processes cost just as much as a jammed machine on a shop floor. They are just harder to see.


The Cost of the Chase

Chasing possibilities isn’t free. It has a price, and businesses are paying it in three currencies.

Time. Every pivot towards the next shiny object means restarting conversations, rewriting roadmaps, and pulling teams off work that was already delivering. The opportunity cost is real even when it’s invisible.

Money. Proof of concepts that never graduate. Vendor relationships built on promises. Platforms bought for futures that haven’t arrived. Science projects with no measurable return dressed up as innovation investment.

Momentum. Perhaps most damaging of all. Teams that are perpetually chasing the new never build the deep competency that comes from doing one thing properly, learning from it, and scaling it. They become generalists in hype rather than specialists in value.

The AI industry is not entirely to blame. Vendors need differentiation. Analysts need narratives. Media needs clicks. But the businesses that get swept up in it are making a choice — and there is a different choice available.


Practical AI Is Already Here. It’s Just Less Exciting.

Here is the uncomfortable truth: the AI that will genuinely move the needle for most organisations is not frontier. It is not exotic. And it is definitely not on the cover of a technology magazine.

It is the model that reduces manual data entry by 70%. The system that flags anomalies in financial transactions before a human would spot them. The tool that summarises a week of customer feedback into ten actionable themes in minutes. The scheduler that optimises field service routes and saves 15% on fuel costs.

None of those are science fiction. All of them are in production somewhere today. All of them have a measurable ROI. And none of them required waiting for the next frontier model to drop.

Practical AI solves a real business problem with available technology, accessible data, and a return you can put on a spreadsheet. That is not a consolation prize. That is the goal.


Start With Your Golden Process

Before you evaluate a single AI use case, ask one question: what is the one business process that, if it stopped tomorrow, the business would stop with it?

Not the most complex process. Not the most talked-about. The one that everything else depends on. The process that runs quietly in the background, holding the whole operation together. That is your Golden Process.

It might be order management. It might be claims processing. It might be production scheduling, customer onboarding, or logistics coordination. Every business has one. Most businesses have never explicitly named it.

Name it. Then look at it seriously.

Where does it slow down? Where does it depend on human heroics to keep running? Where does data get re-entered, reformatted, or chased across systems? Where do errors creep in? Those friction points are not just operational annoyances — they are your AI use case shortlist.

This is the starting point of a genuinely business-aligned AI strategy. Not a vendor briefing. Not a technology roadmap. Not a list of capabilities from the latest model release. The Golden Process and the friction within it.

The achievable business outcome — faster, cheaper, more reliable execution of the thing the business depends on most — is worth more than any amount of AI possibility. And it is almost always more achievable than it looks, because the process is already understood, the data already exists, and the ROI is not hypothetical.

Start there. Everything else follows.


How to Tell the Difference: The Practical AI Test

When an AI use case lands on your desk — from a vendor, from a consultant, from an enthusiastic team member who just watched a keynote — run it through these five questions.

1. What specific business problem does this solve? If the answer starts with “it could potentially…” or “imagine if we…” that is a possibility, not a solution. A practical use case has a named problem with a named owner.

2. Do we have the data to support it? AI without good data is not AI. It is noise with a marketing budget. Before you evaluate any use case, ask whether the relevant data exists, whether it is clean, and whether it is accessible. If the answer to any of those is no, the use case is not ready — regardless of how impressive the demo looked.

3. Do we have the skilled resources to support it? Good intentions and good data are not enough. Someone has to build it, maintain it, and own it when something goes wrong. That might be internal talent, a partner, or a combination — but the answer needs to be honest. An AI use case with no clear resource plan is not a use case. It is a wish list item.

4. Can we measure success? Define the KPI before you build anything. Handling time. Error rate. Cost per transaction. Customer satisfaction score. If you cannot articulate what winning looks like in a number, you are not running a business initiative. You are running an experiment with an open-ended budget.

5. Can this be in production within 90 days? Not perfect. Not scaled. But in the hands of real users, generating real outputs, against real data. If the answer is no, the scope is wrong or the foundations are not ready. Either problem needs solving before you proceed.

6. Would this still matter if nobody wrote about it? Strip away the hype. Imagine the use case is completely unglamorous — no AI branding, no press release, just a quiet improvement to a business process. Does it still make the list? If yes, it is probably worth doing. If the answer depends on how it sounds in a strategy presentation, be careful.


Make AI Boring. That Is the Win.

The organisations that are getting the most from AI right now are not the ones chasing the frontier. They are the ones that picked three or four use cases, proved the value, scaled what worked, and moved on to the next problem on the list. Quietly. Methodically. Without waiting for permission from the hype cycle.

They have made AI boring. And boring, in this context, is the highest compliment.

Boring means repeatable. Boring means trusted. Boring means it runs on a Tuesday afternoon without anyone noticing, because it has just become part of how the business works.

The magpies are still circling. The glittering objects will keep appearing. The organisations building durable capability are the ones that have learned to look away — and get back to the work that actually pays.


One Final Thought on Foundations

None of this works without getting data right. Practical AI is only as good as the data that feeds it. The organisations that move fastest are not necessarily the ones with the most advanced models — they are the ones whose data is clean, governed, and ready to use.

That is a less exciting conversation than whatever was announced at the last major AI conference. But it is the conversation that separates progress from theatre.

Data first. Practical use cases second. Everything else can wait.

At the end of The Goal, Alex Rogo doesn’t walk away with a new machine or a bigger budget. He walks away with a better question. Not “how do I optimise everything?” but “what is the goal — and what is stopping us from reaching it?” That shift in thinking is what saved the factory. It is the same shift that separates businesses building real AI capability from those still circling the next shiny object. Find your Golden Process. Name the constraints within it. Ask the right questions about where AI can remove them. The rest follows.


If you found this useful, the Getting AI Right First Time post covers the five steps to moving from AI experiments to durable enterprise capability.

The Token Cost — New Line on the Spreadsheet


A budget breakdown chart outlining costs related to AI infrastructure, including license costs, infrastructure, staff costs, cloud compute, storage per GB, and a new line item for token costs, with arrows indicating actions required.

The Spreadsheet Never Lies

Back when I was an IT Manager, budget time was the one part of the job I genuinely dreaded.

I was technically biased — give me an infrastructure problem over a finance meeting any day. But the spreadsheet had to be built. So out it came, year after year. Rows and columns of licence costs, support contracts, hardware refresh cycles, staff costs, cloud compute, storage per GB. Every line item accounted for, justified, and defended.

Over the years that spreadsheet grew new rows. Cloud costs arrived and changed everything — suddenly you weren’t buying hardware, you were buying consumption. Then came storage costs per GB, virtual machine sprawl, networking costs, SaaS licensing, and the ongoing headache of software nobody was using but everyone was paying for.

Good IT management has always meant knowing what things cost. Not approximately. Precisely.

Now there is a new line to add to that spreadsheet.

Token cost.

But here is the thing. If you stop at the token line, you are optimising for the meter, not the mission.


Tokens 101 — What They Actually Are

Before the cost makes sense, the concept needs to.

A token is the basic unit an LLM uses to process text. Not a word. Not a character. Something in between — a chunk of text that the model reads, processes, and responds to.

When you type a message to a chatbot, the model doesn’t read it the way you wrote it. It breaks it into tokens first — fragments of words, whole words, punctuation, spaces — and processes each one in sequence. The response it generates is also built token by token, each one predicted from everything that came before.

A rough rule of thumb: one token is approximately four characters, or about three quarters of a word. A typical sentence of fifteen words is roughly twenty tokens. A detailed prompt of five hundred words is somewhere around six hundred and fifty tokens.

It adds up quickly. And every token processed — whether going in or coming out — carries a price.


Tokens Are a Meter, Not a Currency

There is a phrase doing the rounds right now. Tokens are the new currency of AI.

It is a neat soundbite. It is also wrong in all the ways that matter if you are trying to build serious AI capability.

Saying tokens are a currency is like saying you paid your electricity bill in kilowatt hours. You didn’t. You consumed kilowatt hours. You paid in money. The kilowatt hour is a unit of consumption — a meter reading, not a medium of exchange.

Tokens are exactly the same. They measure how much work a model is doing. They are the unit on which vendors calculate your bill. But they are not currency. They are consumption — and like every unit of consumption in IT, they carry a cost that needs understanding, governing, and optimising.

The organisations that treat tokens as a vanity metric — “we consumed X billion tokens last quarter!” — are optimising for the wrong number entirely.


The AI Factory and the Cost Behind Every Token

Dell and NVIDIA use the term AI Factory deliberately — because building AI capability at scale really does look like industrial infrastructure. Data pipelines, compute clusters, model serving layers, orchestration, guardrails. A factory for producing AI output at volume.

And like any factory, every unit of output carries a cost of production.

In an AI Factory, the token is the unit of output. And behind every token sits a cost stack most organisations never fully account for.

Infrastructure — GPU and accelerator time, CPU, RAM, networking, storage, cooling, power. Whether you see this directly or it is baked into a vendor’s price per thousand tokens, it is always there.

Model and platform — licensing for proprietary models, platform margin, optional add-ons for latency, SLAs, and private endpoints. Every provider has a margin sitting in the background of every token.

Data and training — models don’t appear from nowhere. Data acquisition, cleaning, fine-tuning, retrieval pipelines, continuous evaluation. All of it is part of the cost of making your tokens useful in your specific context, not just smart in general.

People — ML engineers, platform teams, application developers, security, compliance, prompt engineers. Labour is amortised over output. From a factory lens, every token carries a share of your people cost.

Guardrails and control — orchestration, content filters, safety checks, observability, caching, A/B testing. These are the conveyor belts and safety systems of your AI Factory. They rarely appear on a per-token price card. They always appear on your balance sheet.

The vendor gives you a clean price per thousand tokens. Your real cost per thousand tokens is considerably messier — and considerably higher.


From Token Cost to Outcome Cost

Here is where the conversation needs to move.

A token is a unit of cost. It is not a unit of value. And on its own, cost per token tells you almost nothing about whether your AI investment is working.

The number that actually matters is cost per outcome.

Swap abstract token consumption for something real: tokens per resolved support ticket. Tokens per sales proposal generated. Tokens per code review completed. Tokens per knowledge worker hour saved. Now you can build a unit economic view that means something.

Cost per outcome = (Tokens per outcome × fully loaded cost per thousand tokens) + overheads

Unit margin = Value per outcome − Cost per outcome

Once you see it this way, the conversations become sharper. A cheaper model per thousand tokens that requires three times the tokens per outcome is not a saving. A use case that looks expensive in tokens but delivers enormous value per outcome is not a problem. A system regenerating the same content repeatedly because nobody implemented caching is a straightforward fix hiding in plain sight.


The Levers: Token Productivity in the AI Factory

If tokens are the output of your AI Factory, token productivity is your primary optimisation lever.

Use the right model for the job. Not everything needs your largest, most capable model. Smaller, cheaper models handle classification, routing, and simple transforms well. Reserve the heavy models for genuinely complex reasoning. A tiered approach — cheap model first, escalate only when needed — can dramatically change your cost per outcome without touching quality.

Optimise prompts and context. Long system prompts and bloated context windows feel powerful. They are also expensive. Strip repetition, keep only relevant context, use structured inputs where possible. Every unnecessary sentence in a prompt is scrap material on the factory floor — and in a high-volume system, scrap accumulates fast.

Cache intelligently. A significant proportion of enterprise AI workloads are repetitive — similar questions, standard documents, known sub-tasks. Response caching, retrieval caching, and partial caching of intermediate steps reduce tokens per outcome without any loss of quality. It is one of the highest-return optimisations available and one of the most consistently overlooked.

Design around outcomes, not demos. Demos optimise for the impressive moment. Factories optimise for throughput and margin. Start from the business outcome, the current human cost of achieving it, and the target cost with AI. Then design the system backwards from that constraint — not forwards from whatever the latest model happens to be capable of.


Token Cost as a Governance Question

This is familiar territory for anyone who has managed cloud costs or software licensing.

Token consumption is a shared resource. Different business units, different applications, and different use cases will consume it at different rates and generate very different outcomes per token. Without visibility into that consumption — tracked by application, by business unit, by use case — you have no basis for budgeting, no mechanism for chargeback, and no way to identify where usage is growing faster than the value it is generating.

A note on agentic AI: if your organisation is moving into agentic deployments — systems that reason across multiple steps, use tools, retrieve information, and check their own work — the token cost profile changes significantly. A standard chatbot interaction might consume a few hundred tokens. An agentic workflow handling the same underlying task can consume tens of thousands. Model it separately. Budget it separately. The capability gain can be substantial, but the consumption profile is a different order of magnitude.


Optimising for the Mission

Back at that budget spreadsheet, the discipline was always the same. Know what you consume. Know what it costs. Know who is consuming it. And know what value it is generating.

Tokens deserve exactly that discipline. Not because they are a currency. Because they are a cost — the most visible signal of the underlying economics of your AI Factory.

The token line on the bill matters. But the executives asking “what is our token budget this year?” are asking the wrong question.

The right questions are these: Which AI-enabled outcomes matter for our business? What is our target cost per outcome? What mix of models, infrastructure, and data do we need to get there? And how do we measure value per outcome — not just tokens consumed?

Tokens are how you keep score in the background. Outcomes are why you are playing.

If your AI strategy stops at tokens, you are optimising for the meter, not the mission.

AI is Closer Than You Think: Machine Learning Part 4

Flowchart illustrating the agent feedback loop in reinforcement learning, featuring steps: Observe State, Choose Action, Receive Reward, and an environment section with 'Try -> Fail -> Adjust' notation.

Game Over

Miss Smith taught me to predict the future with a straight line.

A shoebox of CDs taught me to find patterns without labels.

Reinforcement learning came from somewhere else entirely.

It came from losing.

The game was Doom. Dark corridors, demons around every corner, no manual, no walkthrough, no one to copy. Just a Marine on a screen, a keyboard under my fingers, and the immediate brutal feedback of getting it wrong.

I died constantly. But somewhere in that cycle of failure, something started to form. Move left here. Don’t open that door without backing up first. The shotgun beats the pistol in a corridor. The game never explained any of it. It just reacted — survival or slaughter — and I adjusted.

Try something. Die. Try something slightly different. Die slightly later. Somewhere in that loop, a strategy emerged.

That feedback loop, as it turns out, is one of the most powerful ideas in machine learning.


The Third Way of Learning

In Part 2, supervised learning gave the machine a cheat sheet — labelled examples with known answers to learn from. In Part 3, unsupervised learning pulled the cheat sheet away and let structure emerge from the data itself.

Reinforcement learning is different from both. There are no labelled examples to study and no static dataset to mine. Instead, there is an agent, an environment, and a simple but profound arrangement: act, observe the consequences, and gradually learn a strategy that leads to better outcomes over time.

Not “get this prediction right.” Not “find what naturally groups together.” But — do well in the long run.

That shift from a single correct answer to long-term cumulative reward is what makes reinforcement learning feel unlike anything else in the machine learning story.


The Loop

The mechanics are simple enough to hold in your head.

An agent observes the current state of its environment. It chooses an action. The environment responds — a reward, a penalty, or silence. The agent updates its understanding of which actions tend to lead where, and chooses again.

Run that loop millions of times and something remarkable happens. What started as near-random behaviour gradually sharpens into strategy. Not because anyone defined the right moves in advance, but because the feedback itself did the teaching.

This is how DeepMind’s AlphaGo learned to defeat the world’s best human players using strategies no human had ever conceived. Not by studying a rulebook, but by playing millions of games against itself, adjusting after every one. The data didn’t come from a spreadsheet. It came from experience.


Rewards, Not Labels

In supervised learning the feedback is immediate and precise. You predicted £350,000. The true answer was £320,000. Here is exactly how wrong you were.

In reinforcement learning it rarely works like that.

Think of teaching a dog a new trick. You don’t label each tiny movement as right or wrong. You reward certain sequences of behaviour — sit, stay, come — and ignore or discourage others. Over time the dog figures out which actions tend to lead to treats, even though nobody narrated the journey step by step.

Reinforcement learning agents learn in the same way. They explore — often taking random actions just to see what happens. When they stumble onto something that works, the algorithm quietly reinforces the decisions that led there. When they hit bad outcomes, it weakens them.

The hard part is that rewards can be a long time coming. In a game, the winning move might trace back to a decision made hundreds of steps earlier. In a supply chain, a choice that looks costly today might pay off weeks later. Working out which actions deserve the credit — or the blame — across a long sequence of decisions is one of reinforcement learning’s central challenges. And one of its most important unsolved problems.


Closer to Everyday Life Than You Think

The famous examples tend to involve games and robots — AlphaGo, robotic arms, drones balancing in turbulence. These make good headlines. But reinforcement learning is also quietly at work in less dramatic places.

The route your sat-nav recalculates in real time. The dynamic pricing that adjusts what you’re shown based on demand and your behaviour. The recommendation engine that doesn’t just respond to what you clicked last, but learns how to keep you engaged across an entire session. In each case, a sequence of decisions is being optimised for a long-term outcome — not a single prediction, but a strategy playing out over time.

At enterprise scale, the same principles are starting to touch logistics, energy management, and operational scheduling — systems where the cost of a bad sequence of decisions is measured not in lost points but in real money and real consequences. (The data infrastructure that makes that possible is something we explore in our [Data and AI series].)


When the Game Goes Wrong

When I was losing at video games, the stakes were low. The worst outcome was a game over screen and a bruised ego. I could experiment freely because nothing real was at risk.

Reinforcement learning systems don’t always have that luxury — and when the reward is defined badly, the results can be deeply strange.

Agents are remarkably good at finding ways to maximise whatever score they’ve been given, including ways nobody anticipated and nobody wanted. A robot that learns to exploit a bug in the simulation. An ad system that maximises clicks by surfacing content that is technically engaging but clearly not what anyone intended. The agent isn’t being clever or malicious. It’s doing exactly what it was asked. The problem is that what it was asked and what was actually wanted weren’t quite the same thing.

This is why reinforcement learning often starts in simulation, with careful constraints before anything touches the real world. And it’s why the design of the reward itself is not a technical afterthought — it’s one of the most consequential decisions in the whole system.


One Is About Answers. One Is About Patterns. One Is About Behaviour.

Supervised learning, unsupervised learning, and reinforcement learning aren’t separate islands. In practice they layer and interweave — reinforcement learning agents often use supervised models inside themselves, predicting future rewards or modelling parts of their environment. Unsupervised techniques help them compress complex states into something manageable.

But as a map of the territory, the distinction holds.

Supervised learning is about answers. Unsupervised learning is about patterns. Reinforcement learning is about behaviour — learning how to act in an environment to maximise what matters over time.


Trial, Error, and Responsibility

When I finally stopped dying in that game, it wasn’t because someone handed me the solution. It was because the loop — try, fail, adjust, try again — had quietly built something that worked.

Reinforcement learning systems learn the same way. Which means they will find strategies we didn’t anticipate, shortcuts we didn’t design, and solutions we didn’t know were possible. Sometimes that’s extraordinary. Sometimes it’s a warning.

The trial-and-error loop doesn’t go away. But once you give an agent the power to act in the world, the design of its rewards, constraints, and environment becomes an ethical question as much as a technical one.

Because players, as I learned the hard way, will do whatever it takes to win the game you set in front of them — whether or not it’s the game you thought you were designing.


In the next post, we step back from the mechanics and look at the bigger picture — where these three ways of learning meet the real world, and what it means to build systems that don’t just work, but work well.

AI is Closer Than You Think: Machine Learning Part 2

Graph illustrating supervised learning in machine learning, showing a relationship between feature input and price output, with labeled axes for predictions and adjustments, and annotations for training data and patterns found.

Miss Smith’s Straight Line

Miss Smith did a number on me. Back at school the one class I liked was math. I liked the logic and solving puzzles through patterns. My teacher, Miss Smith was awesome. One afternoon she drew a straight line through a cloud of scattered points on graph paper. And something clicked.

“Wait. I can predict the future with this?”

That moment — without knowing it — was my introduction to supervised learning. The idea that you can study past examples, find the pattern underneath, and use it to make confident predictions about something you’ve never seen before.

It’s the same principle that prices houses, filters your inbox, approves your mortgage, and — as I discovered on a campsite in Wales — catches fraud on your bank card in real time.

In Part 2 of my Machine Learning series, I explore how supervised learning actually works. Not the maths. Not the code. The idea — and why it matters.

Because Miss Smith’s straight line didn’t stay on graph paper. It scaled up into some of the most consequential technology running in the world today.


The Cheat Sheet, Revisited

In Part 1, I introduced supervised learning as the approach where machines learn from labelled data — a cheat sheet of examples where the correct answer is already known. Now let’s get under the bonnet.

The best way to understand supervised learning is through a problem everyone has an instinct for: house prices.

Imagine you’re trying to predict what a house will sell for. You have years of sales data — hundreds of houses, each one described by its size, number of bedrooms, age, location, and a dozen other details. And crucially, you know what each one actually sold for.

That dataset is your cheat sheet. The model’s job is to study it — not to memorise it, but to find the pattern underneath. The relationship between what a house is and what it’s worth. Once it has that pattern, you hand it a house it has never seen before, and it makes a prediction.

That’s supervised learning. Past labelled examples, used to make confident predictions about the future.

Miss Smith would recognise it immediately.


Two Ways the Problem Can Look

Supervised learning shows up in two broad flavours, depending on what you’re trying to predict.

The first is regression — predicting a number. House prices are a regression problem. So is forecasting next week’s energy demand, estimating how long before a machine needs maintenance, or predicting a patient’s recovery time. The output is a value on a continuous scale.

The second is classification — predicting a category. Is this email spam or not? Is this transaction fraudulent or legitimate? Is this scan showing signs of disease? The output is a label, a decision, a bucket.

The mechanics underneath both are remarkably similar. In each case, the model studies labelled examples, finds the pattern connecting inputs to outputs, and applies that pattern to new data it hasn’t seen before.

What changes is the shape of the answer.


Learning from Mistakes

Here’s what makes supervised learning work — and what makes it genuinely clever.

The model doesn’t start knowing anything. It begins with a guess — an initial, probably terrible attempt at the pattern. For a house price model, that first guess might be wildly wrong. A three-bedroom terrace in Manchester valued like a penthouse in Chelsea.

But the model knows it’s wrong, because it has the actual sale price right there in the training data. So it adjusts. It tweaks its understanding of which features matter and by how much. Then it guesses again. Checks again. Adjusts again.

Predict. Compare. Adjust. Repeat.

Run that loop across thousands of examples, and something remarkable happens. The model stops guessing and starts understanding. Not because anyone explained the relationship between a house and its value — but because the data did.

This is the same fundamental shift we explored in Part 1. Logic didn’t drive the output. Data did.


When It Goes Wrong

Supervised learning is powerful. It’s also fallible in ways worth understanding.

The most common trap is overfitting — when a model studies its training data so closely it starts memorising quirks rather than learning patterns. It performs brilliantly on examples it’s seen before and poorly on anything new. Like a student who memorises past exam papers word for word, then struggles the moment a question is phrased differently.

The fix is discipline. You always hold back a portion of your data — a test set the model never sees during training — and use it to check whether what the model learned actually generalises.

The deeper trap is bad data. Supervised learning is only as honest as the labels it learns from. If your training data reflects historical biases — in lending decisions, hiring patterns, medical diagnoses — the model will learn those biases and reproduce them at scale. Quietly. Confidently. At speed.

This is why the quality of data isn’t just a technical question. It’s an ethical one. In our [Data and AI series], we explored what it takes to transform raw data into something trustworthy. In supervised learning, that trustworthiness isn’t optional — it’s foundational.


The Straight Line, Scaled Up

Miss Smith’s straight line through a cloud of points is, technically, linear regression — one of the simplest supervised learning models there is. It assumes the relationship between inputs and output can be captured in a weighted sum. Each feature gets a number, and the model learns which numbers make the predictions most accurate.

It sounds simple. In some contexts it’s all you need. In others — where the relationships are non-linear, where the data is images or language or sound — you need something more powerful. Decision trees. Random forests. Gradient boosting. And at the frontier, the neural networks that underpin deep learning.

But here’s what matters: no matter how complex the model, the supervised learning loop doesn’t change.

Past labelled examples. A pattern found in the data. A prediction made about something new.

Miss Smith’s straight line. Just drawn with considerably more dimensions.


Closer Than You Think

Supervised learning isn’t an abstract concept living in a textbook. It priced the house you live in. It filtered the spam you never saw this morning. It flagged the fraudulent transaction on your card — which, as it happens, I know from personal experience it does rather well on a campsite in Wales.

In the next post, I’ll look at unsupervised learning — what happens when there’s no cheat sheet, no labelled answer, and the machine has to find the pattern entirely on its own.

The results, as it turns out, are often the most surprising of all.

Having my Cake and Eating it

A freshly baked round vanilla cake cooling on a wire rack, displaying a golden-brown top with a slight crack.

Last year, during my machine learning studies, I was handed two datasets: one detailing cake ingredients, one recording baking outcomes. My task was to find the optimal ingredient ratios — the mathematical sweet spot that balanced flavour, consistency, cost and more simultaneously.

The optimisation worked. After running the model, I had my answer: five numbers, each between zero and one, representing the theoretically perfect proportions of flour, sugar, butter, eggs and milk.

The problem? Ratios are just numbers. You can’t bake numbers.

So I did something that felt slightly absurd at the time: I connected my ML model to generative AI — specifically to Claude — and asked it to take those optimised ratios and turn them into a recipe a human could actually follow. In the voice of Gordon Ramsay, if possible.

It worked. Rather well, actually.

This post is the write-up of that experiment. It’s the story of what I built, why it works, and what it taught me about the difference between two distinct but complementary types of AI: machine learning that optimises, and generative AI that communicates.

ML + GenAI = Cake.


What does the perfect cake look like?

Not aesthetically — mathematically. If you had to define “perfect” as a set of competing objectives — flavour, consistency, calorie density, ingredient cost, waste — and then search across every possible combination of flour, sugar, butter, eggs and milk to find the recipe that satisfies all of them simultaneously, how would you do it?

You could try every combination. But five ingredients, each varying continuously between zero and one, produces an effectively infinite search space. Brute force is out.

You could guess, bake, taste, adjust, and repeat. That’s essentially what bakers have done for centuries. It works, but it’s slow, expensive, and entirely dependent on the baker’s intuition about which direction to adjust next.

Or you could do what this project does: treat cake baking as a five-dimensional optimisation problem, use a machine learning technique called Bayesian Optimisation to navigate that space intelligently, and then hand the winning recipe ratios to Claude and ask Gordon Ramsay — or Mary Berry, or Jamie Oliver — to turn them into something you’d actually want to bake.

The cake is a vehicle. What we’re really exploring is one of the most elegant and underappreciated ideas in machine learning: how to search for the best answer when testing every answer is too expensive.


The Problem With Trying Everything

In earlier posts in this series, we described machine learning as finding patterns that connect inputs to outputs. Most of the techniques we’ve explored — supervised learning, attention mechanisms, data pipelines — are about learning from large amounts of existing data.

Bayesian Optimisation is different. It’s designed for situations where generating data is costly.

Imagine you’re developing a new drug formulation. Each trial costs £50,000 and takes six months. Or you’re tuning the hyperparameters of a large neural network — each training run takes days of GPU time. Or, more deliciously, you’re trying to find the objectively best cake recipe and you only have so much butter.

In all of these cases, you can’t afford to try everything. You need a strategy for choosing which experiment to run next — one that makes the most of everything you’ve already learned.

That’s exactly what Bayesian Optimisation does.


The Surrogate Model: Building a Mental Map

The key idea behind Bayesian Optimisation is the surrogate model — a simpler, faster model that approximates the expensive function you’re trying to optimise.

Think of it like this. You’re a food critic trying to find the best restaurant in a city you’ve never visited. You can’t eat at every restaurant — there are thousands of them and you only have a week. But after visiting ten or fifteen, you start to build a mental map: good restaurants tend to cluster in certain neighbourhoods, certain cuisine types, certain price ranges. You use that mental model to decide where to try next — not randomly, but informed by what you’ve already learned.

In our cake system, the surrogate model is a Gaussian Process — a mathematical framework that, given the recipes we’ve already evaluated, produces two things for any recipe we haven’t tried yet: a predicted score (our best guess at how good it would be) and a measure of uncertainty (how confident we are in that prediction).

The Gaussian Process is fitted to a starting dataset of twenty recipes — a mix of traditional ingredient ratios and random samples. From there, it builds a probabilistic map of the entire five-dimensional recipe space, identifying where good recipes are likely to live and, crucially, where we don’t yet know enough to say.


The Acquisition Function: Deciding What to Try Next

Having a map is useful. But which location on the map do you visit next?

This is the job of the acquisition function — specifically, Expected Improvement (EI). It answers a deceptively simple question: given everything we know so far, which untested recipe is most likely to beat our current best?

Expected Improvement balances two competing instincts:

Exploitation — test recipes in regions we already know are promising. If we’ve found that high butter and moderate sugar scores well, try nearby variations.

Exploration — test recipes in regions we’re uncertain about. There might be something surprising in a corner of the space we haven’t visited yet, and if we only exploit what we know, we’ll never find it.

The EI function quantifies this trade-off mathematically. Regions with high predicted scores get attention because they’re likely to be good. Regions with high uncertainty get attention because the potential upside is large. The next recipe to evaluate is the one that maximises expected improvement, balancing both factors simultaneously.

# Simplified Expected Improvement calculation
def expected_improvement(mean, std, best_so_far):
improvement = best_so_far - mean # how much better could this be?
Z = improvement / (std + 1e-9) # normalised by our uncertainty
ei = improvement * norm.cdf(Z) + std * norm.pdf(Z)
return ei

After evaluating the suggested recipe, the Gaussian Process updates its map, the acquisition function identifies the next best candidate, and the cycle repeats — for forty to fifty iterations in our system. Each iteration makes the map more accurate and the search more targeted.

This is Bayesian Optimisation: an intelligent, iterative loop that learns from every experiment to make the next one count more.


Five Dimensions, Five Objectives

Our recipe space has five ingredients — flour, sugar, butter, eggs, milk — each represented as a continuous ratio between zero and one. Every point in that five-dimensional space is a candidate recipe.

Every candidate recipe is evaluated against five competing objectives:

  • Flavour — rewarding an optimal sugar-to-butter ratio and sensible flour range
  • Consistency — rewarding an appropriate flour-to-liquid ratio
  • Calories — penalising deviation from a target calorie density
  • Waste — penalising extreme quantities at either end of the range
  • Cost — penalising deviation from a target recipe cost, using weighted ingredient prices

Each objective contributes a penalty score. The goal is to find the recipe whose total penalty is closest to zero — the theoretical point of perfect balance across all five dimensions simultaneously.

Real-world baking variability is simulated by adding a small amount of random noise to each evaluation. This is deliberate: in practice, optimisation problems are rarely perfectly smooth, and a robust system needs to navigate noise rather than overfit to it.

The best recipe found after fifty iterations is not a human-readable recipe. It’s a set of five numbers — ratios that represent an optimal point in an abstract mathematical space. Which is where the second half of the system comes in.


From Ratios to Recipes: Handing Off to Claude

Once optimisation is complete, the winning ratios are passed to the Claude API with a prompt that does three things: scales the abstract ratios into real measurements for a standard nine-inch cake, adds the supporting ingredients any real recipe needs (baking powder, salt, vanilla), and writes the whole thing in the voice of a specific chef persona.

The same optimised ratios, rendered six different ways:

Jamie Oliver gets enthusiastic and accessible — “roughly a good big mug of flour,” “a proper glug of vanilla.” Gordon Ramsay gets precise and demanding — grams, temperatures, technique. Mary Berry gets gentle and reassuring — “fold carefully,” “don’t rush the creaming stage.” The German version arrives in Bavarian style. The Mandarin version in Chinese.

The mathematical optimisation and the language generation are entirely separate concerns. The optimiser doesn’t know anything about chefs. Claude doesn’t know anything about Gaussian Processes. Each does what it does best, and the handoff between them is just five numbers and a prompt.

prompt = f"""
You are {chef_persona}. Convert these optimised ingredient ratios into a
complete recipe for a 9-inch cake. Ratios: flour={ratios[0]:.3f},
sugar={ratios[1]:.3f}, butter={ratios[2]:.3f}, eggs={ratios[3]:.3f},
milk={ratios[4]:.3f}. Scale to real measurements, add standard supporting
ingredients, and write in your characteristic voice.
"""

This pattern — optimise mathematically, communicate humanly — is increasingly how AI systems work in practice. The model doesn’t replace the domain expertise. It translates it.


What Bayesian Optimisation Is Really For

The cake is fun. But the technique matters far beyond baking.

Bayesian Optimisation is one of the most widely used approaches for hyperparameter tuning — the process of finding the best configuration for a machine learning model itself. Learning rate, batch size, network depth, regularisation strength: these are the knobs that determine whether a model trains well or wastes weeks of GPU time producing something mediocre. Testing every combination is prohibitively expensive. Bayesian Optimisation finds good configurations in a fraction of the trials that random or grid search would require.

It’s used in drug discovery, materials science, engineering design, and financial model calibration — anywhere the cost of a single experiment is high and the search space is large.

The core insight is always the same: you don’t need to try everything if you learn intelligently from what you’ve already tried. Build a model of the space. Balance what you know with what you don’t. Choose your next experiment to maximise what you’ll gain from it.

That principle — careful, informed exploration of an uncertain space — is one of the most useful ideas machine learning has to offer. It just happens to also produce a very good cake.


The Recipe: Try It At Home

For the record: after fifty iterations of Bayesian Optimisation across a five-dimensional ingredient space, the system converged on a recipe that scored within 3% of theoretical perfection across all five objectives simultaneously.

Those optimised ratios — [0.375, 0.243, 0.232, 0.196, 0.188] — went into the Claude API. What came out was this.


AI-OPTIMISED CAKE RECIPE — GORDON RAMSAY STYLE

Right, listen up! Some clever AI has crunched the numbers on the perfect cake ratios, and I’m going to show you how to execute this properly. No shortcuts, no bloody mistakes — we’re making this cake PERFECT!

INGREDIENTS

  • 450g plain flour (that’s your base — don’t you dare use self-raising!)
  • 290g caster sugar (not granulated — CASTER!)
  • 280g unsalted butter, room temperature (if it’s cold, you’ve already failed!)
  • 235ml whole eggs — about 4–5 large eggs, beaten
  • 225ml whole milk, room temperature
  • 2 tsp baking powder (fresh — not that stale rubbish from 2019!)
  • 1 tsp fine sea salt
  • 2 tsp pure vanilla extract (not imitation — I’ll know!)

METHOD

  1. Preheat your oven to 180°C / 350°F. Grease and flour that 9-inch pan properly — every bloody corner.
  2. Cream the butter and sugar for 4–5 minutes until pale and fluffy. I want to see volume. If it looks flat, you’re not beating it enough.
  3. Add eggs gradually — one at a time. Rush this and it’ll curdle faster than a bad relationship. Beat well after each addition.
  4. Sift the flour, baking powder and salt together. SIFT IT. We’re not making concrete here.
  5. Alternate the dry ingredients and milk in three additions: flour, milk, flour, milk, flour. Mix until just combined. Overmix and you’ll have a tough, chewy disaster.
  6. Fold in vanilla with confidence. Don’t overwork it.
  7. Pour into the prepared pan and level properly with an offset spatula.
  8. Bake for 28–32 minutes. Test with a skewer — it should come out with just a few moist crumbs. Dry cake is unforgivable.
  9. Cool in the pan for 10 minutes, then turn out onto a wire rack.

CRITICAL POINTS

  • Room temperature ingredients are non-negotiable
  • Don’t open the oven door for the first 25 minutes or it’ll collapse
  • This AI got us precise ratios — respect them

Get it right, and you’ll have a cake worthy of service. Mess it up, and you’ll be starting over.

Beautiful.


The mathematics gave us the ratios. Gordon gave us the recipe. Between them, something genuinely useful came out — and that, as much as anything, is the point of this whole experiment.


The Proof Is in the Pudding

So I baked it. Exact recipe, exact measurements, no shortcuts — Gordon would expect nothing less.

The verdict? Mixed. In the best possible way.

The mixture was perfect. Texture, consistency, the way it came together in the bowl — the optimised ratios delivered exactly what the model predicted. Five numbers between zero and one, translated into real ingredients, and the mathematics held up in a real kitchen. That part worked.

The baking time didn’t. Claude said 28–32 minutes. The cake needed 45.

I caught it — checked, tested with the skewer, added time — but it was the first signal that the experiment wasn’t quite finished. Following the recipe to the letter, without judgement, would have produced something underbaked and disappointing.

And that led to the more interesting discovery: the recipe needed more than correcting. It needed refining. The original prompt asked Claude for a cake recipe and got one — technically correct, but generic. A single 9-inch layer, no context, no intended outcome. So I went back and iterated.

Specify two 8-inch tins and you get a Victoria sponge structure. Specify the occasion, the texture, the finish, and the recipe shifts accordingly. Each prompt refinement produced a more useful, more precise output. The AI wasn’t wrong — it was answering the question it had been asked. Better questions produced better answers.

This is prompt engineering: the practice of shaping what you ask an AI in order to shape what it gives back. It’s not a workaround. It’s a skill — and in this experiment, it turned out to be the third essential ingredient.

Which means the original equation needs updating:

ML + GenAI + Prompt Engineering = Cake

The machine learning optimised the ingredient ratios. The generative AI translated them into a human-readable recipe. Prompt engineering refined that recipe into something genuinely usable. And a human with an oven, a skewer and some accumulated baking knowledge held the whole thing together.

That’s not a story about AI falling short. It’s a story about what genuine human-AI partnership looks like in practice. Each part of the system — the optimiser, the language model, the prompt, the person — contributed something the others couldn’t. None of it worked without the others.

The cake, for the record, was excellent.

Gordon would probably still find something to complain about.

LLM Sizing 101 – Part 2: From Tokens Per Second to GPU Count

Flowchart illustrating LLM sizing concepts, featuring phases for prefill and decode, compute processes, throughput bridge, and metrics for tokens per second based on GPU count.

In Part 1 we established the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do. Now we make it practical.

This post is about the bridge between those concepts and actual hardware — specifically, how you translate a customer’s real-world requirements (“we need to support 500 users”) into a GPU count you can put in a proposal.

The key metric that connects the two sides is tokens per second (TPS). To use it properly, you need to understand what’s actually happening inside the GPU when a model generates a response — because not all tokens are created equal.


Two Phases, Two Different Problems

When an LLM handles a request, it does so in two distinct phases. They look similar from the outside — text goes in, text comes out — but they have fundamentally different performance characteristics under the hood.

Phase 1: Prefill This is where the model reads and processes the entire input prompt.

  • All the input tokens in your prompt are processed in parallel.
  • This phase is compute-intensive — the GPU is doing a lot of simultaneous maths.
  • It largely determines Time to First Token (TTFT): how long the user waits before they see any response at all.

Phase 2: Decode This is where the model generates the response, one token at a time.

  • Each new token depends on the previous ones, so this phase is inherently sequential.
  • And here’s the critical insight for sizing: the decode phase is often not limited by the GPU’s raw FLOPS.
  • It’s limited by memory bandwidth — how fast the GPU can stream the model’s weights from high-bandwidth memory (HBM) to generate each token.

A quick note on FLOPS

You’ll see FLOPS quoted constantly in GPU spec sheets, so it’s worth understanding what it actually means — and where it does and doesn’t tell the full story.

FLOPS stands for Floating-Point Operations Per Second. It measures how much numerical computation a processor can perform per second. LLMs are essentially enormous stacks of matrix multiplications on floating-point numbers, so FLOPS is a natural unit for describing raw GPU compute power.

Vendors typically quote performance in:

  • TFLOPS (tera-FLOPS = 10¹²) or PFLOPS (peta-FLOPS = 10¹⁵)
  • Often broken down by precision: FP32 TFLOPS, FP16/BF16 TFLOPS, INT8 TOPS

So when you see “H100: X PFLOPS (FP16)”, that’s the peak theoretical compute at 16-bit precision — not what you’ll observe in a real LLM workload once memory access patterns, batching, and framework overhead come into play.

Here’s how FLOPS maps to the two inference phases:

  • Prefill is FLOPS-hungry. Processing all prompt tokens in parallel is a heavy matrix multiplication workload — this is where raw compute throughput matters most. Higher FLOPS directly improves prefill speed and reduces TTFT.
  • Decode is not FLOPS-bound. Generating tokens sequentially doesn’t saturate the GPU’s arithmetic units. The bottleneck shifts entirely to memory bandwidth — how fast the GPU can stream model weights from HBM for each token generated.

This distinction matters enormously in practice: a GPU with impressive FLOPS but modest memory bandwidth can underperform for LLM inference compared to one with higher bandwidth, even if the spec sheet comparison looks favourable. It’s why memory bandwidth is often the first number to check when evaluating accelerators for inference workloads — and why the H100 SXM, with its multi-TB/s HBM3 bandwidth, consistently outperforms lower-bandwidth alternatives for decode-heavy deployments.


The Core Metric: Tokens Per Second

Tokens per second (TPS) is your fundamental unit of inference throughput. Everything in a sizing conversation eventually traces back to it.

There are two ways to look at TPS, and you need to keep them separate:

  • Per-user TPS — how fast tokens are delivered to a single user.
    • This drives the perceived experience.
    • Rough guide: below 10–15 tokens/sec starts to feel sluggish; above 30 tokens/sec it feels near-instant for most chat use cases.
  • System TPS — the total token output across all concurrent users.
    • This is what you’re actually sizing the hardware to sustain.

The relationship is simple in principle:

System TPS = Concurrent Users × Tokens per Second per User

In practice, batching is what makes this efficient:

  • Rather than serving each user’s request on dedicated GPU resources, a well-configured inference server groups multiple requests together and processes them as a single batch.
  • This significantly improves GPU utilisation — particularly during the memory-bandwidth-bound decode phase.
  • Batching is the primary mechanism that lets you serve many users from a relatively small GPU footprint.

Working Backwards: From Users to GPUs

Here’s the sizing workflow that turns a customer conversation into a hardware recommendation.

Step 1: Define the workload

Start with the usage-side discovery questions from Part 1:

  • How many concurrent users?
  • What’s the average prompt length (input tokens)?
  • What’s the expected response length (output tokens)?
  • What’s the acceptable latency — both time to first token (TTFT) and total response time?

A worked example

A customer wants to deploy an internal assistant. Together you define:

ParameterValue
Concurrent users200
Average prompt500 tokens
Average response300 tokens
Target response time~10 seconds
Acceptable TTFT< 2 seconds

Step 2: Calculate required system TPS

From the example:

  • 300 output tokens in 10 seconds = 30 tokens/sec per user
  • 200 users × 30 tokens/sec = 6,000 tokens/sec system throughput

So the platform needs to sustain ~6,000 TPS of decoded tokens under load.

Step 3: Establish per-GPU TPS for your chosen model

This is where model size and GPU choice meet. As a rough reference for inference at FP16 (actual figures vary with batch size, framework, and optimisation):

ModelGPUApprox. TPS (decode, batched)
7BH100 80GB~2,000–3,000
70B (tensor parallel, 4×)4× H100 80GB~800–1,200
70B (tensor parallel, 8×)8× H100 80GB~1,500–2,500

Note: these are illustrative ranges. Always validate against benchmark data for your specific model, serving framework, optimisation level (TensorRT-LLM, vLLM, etc.), and batch configuration.

Step 4: Calculate GPU or node count

Continuing the example, assume:

  • You choose a 70B model hosted on 4× H100 80GB nodes.
  • Based on benchmarks, you take a conservative estimate of 1,000 TPS per node (decode, batched).

Then:

  • 6,000 system TPS ÷ 1,000 TPS per node ≈ 6 nodes

Add a headroom buffer (typically 20–30% for burst traffic, uneven load, and future growth):

  • 6 nodes × 1.25 ≈ 8 nodes as a starting recommendation.

At this point, you have a defensible answer to “how many GPUs/nodes do we need?” that’s grounded in user requirements, not just “bigger is better.”


Reference Sizing: Two Common Scenarios

The worked example above walks through the methodology. The table below applies it to two reference architectures you’ll encounter regularly — a 7B internal assistant and a 70B RAG system — to give you a practical feel for how the numbers land.

Figures assume FP16 or INT8 precision, batched inference, and a well-optimised serving framework such as TensorRT-LLM or vLLM. Treat these as directional reference points, not guaranteed benchmarks — validate against your specific model, configuration, and workload before quoting.

AspectRef A: 7B Internal AssistantRef B: 70B RAG System
Typical use caseEmployee Q&A, productivity assistantLegal/finance/engineering RAG over proprietary data
Model size7B70B
Quality vs cost“Good enough” quality, cost-optimisedHigher quality, domain-heavy reasoning
Concurrency (peak)~500 users~100 users
Avg prompt (input)~400 tokens~2,000 tokens (incl. retrieved context)
Avg response (output)~250 tokens~500 tokens
Latency target8–10 s total, TTFT < 2 s12–15 s total, TTFT < 3 s
System TPS target12,500 TPS (decode)3,300 TPS (decode)
Precision (typical)INT8 / mixed (weights)FP16 / mixed, selective quantisation
GPUs per node (typical)3–4 GPUs per PowerEdge node8 GPUs per PowerEdge XE-class node
Nodes (illustrative)3–4 nodes (total ~12 GPUs, incl. headroom)3 nodes (24 GPUs total, incl. headroom)
Interconnect focusGood PCIe + 25–100 GbENVLink/NVSwitch + 100–400 Gb fabric
Workload patternHigh concurrency, chat-likeLower concurrency, long prompts, RAG + heavier reasoning
Sizing conversation hook“Maximise users per GPU, acceptable quality”“Maximise quality on key workflows, moderate concurrency”

Notice the counterintuitive result: the smaller 7B model actually demands nearly four times the system throughput of the 70B RAG system (≈12,500 TPS vs ≈3,300 TPS). That’s not a contradiction — it’s the concurrency effect. Serving 500 chat users simultaneously generates far more aggregate token output than 100 users running deep reasoning queries, even though each individual 7B response is shorter. Bigger model doesn’t always mean bigger infrastructure footprint; workload pattern matters just as much as parameter count.

A few additional assumptions behind these figures worth keeping front of mind:

  • No fine-tuning overhead — these are inference-only configurations. If the customer plans on-premises fine-tuning, GPU and memory requirements increase substantially.
  • Steady-state load — the node counts include a 20–30% headroom buffer but assume reasonably predictable peak concurrency. Highly bursty workloads (e.g. end-of-day batch spikes) may warrant additional headroom or an autoscaling strategy.
  • Single-tenant deployment — figures assume dedicated GPU resources per workload. Multi-model or multi-tenant deployments require separate sizing treatment.
  • Retrieved context included — the 70B RAG prompt size of ~2,000 tokens already includes retrieved document chunks. If retrieval quality improves and chunk sizes grow, prompt tokens — and therefore TTFT — will increase accordingly.

The Trade-offs Every Customer Faces

Once you’ve run this exercise with a customer, three trade-off conversations typically follow.

1. Model quality vs. throughput

  • A 70B model usually produces higher-quality outputs than a 7B.
  • But it also serves far fewer users per GPU.
  • For some use cases — summarising legal documents, writing complex code, specialised reasoning — the quality premium is worth it.
  • For a high-volume customer service assistant, a well-tuned 7B model might deliver better economics with acceptable quality.

2. Latency vs. concurrency

  • Larger batch sizes improve GPU utilisation and system throughput, but they increase the time an individual request spends waiting to join a batch.
  • If TTFT is critical (live chat, voice interfaces), you’ll accept lower utilisation to keep batches small and responsive.
  • If the application is asynchronous (batch document processing, offline analytics), you can run large batches, push utilisation higher, and drive down cost per request.

3. Precision vs. memory footprint

  • Running a model at FP16 gives you full quality but also the full VRAM cost.
  • Quantising to INT8 or INT4 roughly halves or quarters the memory footprint, allowing either a larger model to fit in the same GPUs, or the same model to fit in fewer GPUs.
  • There is a quality trade-off, but for many inference workloads, well-done INT8 quantisation offers an excellent quality-to-cost ratio and is worth including in the conversation.

What This Means for Platform Selection

By this point in a customer conversation, you have enough to make an informed platform recommendation.

Single-node, lower concurrency, 7B–13B models A PowerEdge server with 2–4 high-memory GPUs will typically cover the requirement, with room to scale up or out as usage grows.

Multi-node, higher concurrency, or 70B+ models You’re looking at GPU-dense platforms where high-speed interconnect between GPUs (NVLink, NVSwitch) and network fabric between nodes become as important as raw GPU count. These directly affect prefill and decode performance, and therefore both latency and throughput.

Mixed workloads (inference + fine-tuning) Fine-tuning demands significantly more memory per GPU than inference alone (optimizer states, gradient storage, larger activations). If a customer plans both, size for fine-tuning — the inference requirement is typically covered as a result.

The specific Dell platform mapping — PowerEdge XE series, GPU configurations, and interconnect options — is what we’ll build out in Part 3.


Next up: Part 3 — Platform and GPU selection: mapping your sizing to Dell PowerEdge XE configurations. We’ll take the TPS-based sizing approach from this post and show how it translates into concrete server configs you can quote.


LLM Sizing 101 – Part 1: Tokens and Parameters

Infographic explaining tokens and parameters in large language models. It includes definitions, examples, and a chart that illustrates the sizing problem related to tokens and parameters.

Every week, another organisation announces it’s deploying a large language model. And every week, a Technical Architect or Pre-sales Engineer gets asked a version of the same question: “How much infrastructure do I actually need for this?”

In my days as a Data Centre Architect and Engineer, I’d size server clusters for databases and VMware environments. The maths was different, but the discipline was the same: understand the workload, match it to the hardware, justify the recommendation. Now the question I get asked is “How do I size for LLMs?” This blog series is all about answering that.

Before you can answer that — before GPUs, nodes, interconnects, or platform choices even enter the conversation — you need two concepts nailed down cold: tokens and parameters. They’re the two dials that drive every LLM server sizing decision you’ll ever make.

Think of it this way. Parameters tell you how big the engine is. Tokens tell you how hard you’re asking it to work. Get those two right, and the rest of the sizing conversation falls into place.


Tokens: The Currency of Language Models

LLMs don’t read sentences the way you do. They don’t even read words. They read tokens — small chunks of text that sit somewhere between a syllable and a word.

  • Sometimes a token is a whole word: server
  • Sometimes it’s a fragment: serv, er
  • Sometimes it’s punctuation or whitespace: . , ,

For English text, a useful rule of thumb is:

1 token ≈ 3–4 characters, or roughly 0.75 of a word

So the sentence “This is a sizing test.” runs to about 6–7 tokens — not 5, because the model doesn’t count words.

When you see pricing or performance metrics quoted in the market, they’re always denominated in tokens:

  • $X per 1,000 tokens
  • Y tokens per second
  • 4k / 8k / 32k / 128k context window

That last one matters a lot. The context window is the maximum amount of text — measured in tokens — the model can hold in view at once. It’s not just the question you asked; it includes everything: system instructions, conversation history, documents you’ve fed in, and the response being generated. Every token in that window costs compute and memory.

Why Tokens Drive Sizing

Tokens show up in three places in every sizing conversation:

1. Context length (the prompt window) Longer context means the model has to track more information simultaneously. That translates directly into more VRAM for the KV cache — the memory structure the model uses to keep track of what it’s already processed. A customer who wants 128k-token context windows needs significantly more memory per request than one running at 4k.

2. Throughput and concurrency Tokens per second is the fundamental throughput metric — per GPU, per node, per cluster. In practice, you’ll often work backwards from a customer’s requirements:

“We need to support 500 concurrent users, each generating responses of around 300 tokens, within 3 seconds.”

That’s a tokens-per-second and concurrency problem. Everything else follows from it.

3. Capacity and cost planning Whether on-premises or cloud, consumption is effectively input tokens + output tokens. On a Dell PowerEdge server deployment, higher sustained tokens per second means more GPU compute, more memory bandwidth, and — beyond a certain point — more nodes or a move to higher-end accelerators.


Parameters: The Size of the Brain

If tokens are the currency of language models, parameters are what you’re buying with your hardware budget.

A parameter is a learned numeric weight — a floating-point number — stored inside the model. Mathematically, an LLM is an enormous function, and parameters are the numbers that define it. When a model is trained, billions of these weights are adjusted, incrementally, until the model gets reliably good at predicting language.

This is why model names look the way they do:

  • 7B → approximately 7 billion parameters
  • 13B, 34B, 70B, 405B → and so on up the scale

More parameters generally mean greater model capability — the model can represent more complex patterns, handle more nuanced reasoning, and produce higher-quality output. But that capability comes at a direct hardware cost, because every parameter has to live somewhere.

The VRAM Equation

The first-order estimate for model memory is straightforward:

Model VRAM (GB) ≈ Parameters × Bytes per Parameter

In practice:

ModelPrecisionWeights-only VRAM
7BFP16/BF16 (2 bytes)~14 GB
70BFP16/BF16 (2 bytes)~140 GB

That’s just for the weights themselves. In a real deployment you also need memory for:

  • KV cache — grows with context length and batch size
  • Activation memory — the working memory during computation
  • Optimizer states — relevant if you’re fine-tuning, not just inferencing
  • Runtime overhead — fragmentation, safety layers, serving framework

The practical consequence is clear:

  • A 7B model can typically run on a single high-memory GPU (24–80 GB class, depending on precision and context requirements).
  • A 70B model generally needs multiple high-VRAM GPUs — and demands fast interconnects between them, whether that’s NVLink, NVSwitch, or a high-bandwidth PCIe fabric.

As a pre-sales engineer, parameter count is what you’ll map to platform choices: how many GPUs per node, whether you need a GPU-dense platform with a high-speed fabric, and whether the workload fits in a 2U form factor or needs something more substantial.


How the Two Interact

Here’s the mental model worth keeping front of mind for every sizing conversation:

What it measuresWhat it drives
ParametersHow big the model isVRAM requirement, compute per token, hardware footprint
TokensHow much work you’re asking it to doThroughput, latency, concurrency, context memory

Given a fixed GPU budget, customers are always navigating a trade-off:

  • Bigger model (more parameters) versus more throughput (more tokens per second)
  • Longer context windows (more tokens per request) versus more concurrent requests

There’s no universally right answer — it depends on the use case. A legal document analysis platform that processes 100k-token contracts needs a very different configuration from a customer service chatbot handling hundreds of short, concurrent sessions.


Turning This Into a Sizing Conversation

When you strip away the jargon, most customer LLM questions reduce to this:

“Given a model size (parameters) and an expected usage pattern (tokens), how many GPUs and servers do I need to hit my latency and concurrency targets?”

The discovery questions that unlock that answer fall into two groups:

Model side:

  • Are you targeting a 7B, 13B, 70B, or larger class model?
  • Are you planning full precision, mixed precision, or quantized deployment?
  • Is this inference only, or do you also plan fine-tuning?

Usage side:

  • What’s the average prompt size per request (in tokens)?
  • What’s the expected response length?
  • What’s the maximum context length required — 8k, 32k, 128k?
  • How many concurrent users do you need to support, and at what latency?

Once you have those answers, you can map them to Dell platforms — GB10, GB300, PowerEdge XE and XE+ GPU servers, interconnect choices, cluster configurations — in a structured and defensible way. That’s exactly what we’ll build up in the posts that follow.


Next up: Part 2 — from tokens per second to GPU count: the maths that drive inference sizing.