Part 1: From Clusters to Predictions – How Clare Chose the Right Algorithm

A flowchart illustrating an algorithm decision logic map and confusion matrix, detailing steps for selecting algorithms based on supervised or unsupervised learning, task type, data size, interpretability, and corresponding algorithms such as linear regression, random forest, and neural networks, accompanied by a confusion matrix showing true positives, false negatives, false positives, and true negatives.

As I’ve studied AI and Machine Learning, I’ve come to realise it’s called data science for a reason. And that reason is simple: it’s not easy. Finding the right signal in the noise takes a genuinely analytical approach, logical thinking, and a willingness to sit with uncertainty before reaching for an answer. There’s no shortcut through the discipline. I’ve learned that the hard way more than once working on assignments late in to the night. In this post, our data scientist hero Clare faces exactly that challenge. She has her clusters, she has her labels, and now she has to decide how supervised learning — and which algorithm — will predict wine quality with the fewest errors.


The Work Doesn’t End With the Clusters

When we last left Clare, she had done something quietly impressive. Working through the winery’s historical tasting data with an unsupervised learning approach, she had let the data organise itself. No labels, no guidance — just structure emerging from pattern. Three clusters had formed, and the winemaker had given them names that meant something: Premium, Standard, and Reject.

That was discovery. What Clare needs now is prediction.

The distinction matters more than it first appears. Unsupervised learning asks the data what it contains. Supervised learning asks a model to learn from what the data already knows, so it can make decisions about data it has never seen before. Clare’s clusters gave her the labels. Now those labels become the target. The question sitting on her desk is straightforward to state and genuinely hard to answer: given the measurable chemical properties of a new vintage, can a model predict its quality grade before it ever reaches the tasting panel?

That shift — from discovering structure to predicting outcomes — is the conceptual foundation of supervised learning. The model is given inputs and known outputs. It learns the mapping between them. Then it applies that mapping to inputs where the output is unknown. The discipline lies in how rigorously that estimation is made.

Clare opens her notebook. The data is ready. The labels exist. Now she has to make some decisions.


Reading the Target

The first decision in supervised learning is not which algorithm to use. It is what kind of thing you are trying to predict.

Clare’s target variable is wine quality grade: Premium, Standard, or Reject. That is not a number on a continuous scale. It is a category. And that single observation rules out an entire family of approaches before she has written a line of code.

Linear regression predicts continuous numerical outputs — price, temperature, yield, revenue. It is the right tool when the answer sits anywhere along a spectrum. Classification algorithms predict discrete categorical outputs. They assign an observation to one of a defined set of classes. Clare’s problem is a multi-class classification problem. Linear regression is off the table.

This is not a trivial distinction. Applying linear regression to a categorical target produces nonsense. Treating category labels as if they were ordered numbers — as if Reject equals one, Standard equals two, and Premium equals three — imposes a mathematical relationship the data does not actually contain. The model learns the wrong thing. Weak conceptual understanding at this stage leads to flawed modelling decisions later, and Clare knows it.

She writes classification at the top of her notebook and moves on.


The Logic Map

Professional guidance frameworks for algorithm selection begin with a logic map. The questions are simple. The discipline is in answering them honestly.

Is the problem supervised or unsupervised? Clare’s problem is supervised. She has labelled data. The clusters gave her that.

Is the target continuous or categorical? Categorical. Three classes. Classification confirmed.

How many samples are available? Clare’s dataset contains 1,599 wines — modest by enterprise standards. That steers her toward algorithms that generalise well on smaller datasets rather than those demanding millions of observations.

Are the features interpretable? Clare will need to explain her model’s recommendations to the winemaker. If the model flags a batch as Reject, there will be a conversation. Interpretability is not a nice-to-have. It is a functional requirement.

Is computational cost a constraint? At 1,599 wines and eleven features, this is a modest workload — Clare’s Dell Precision 7 laptop handles the full pipeline without breaking a sweat. Compute is not the bottleneck here. That matters, because it means her selection criteria can focus entirely on what the model learns and how well it explains itself, rather than what the hardware can afford to run.

Each answer narrows the field. By the time Clare reaches the end of the logic map, she is not choosing between dozens of algorithms. She is choosing between a handful of serious candidates.


Shortlisting the Candidates

Clare’s shortlist takes shape around four candidates, each considered on its own terms.

Logistic regression is the natural starting point. It is interpretable, computationally inexpensive, and well understood. For a three-class problem it extends cleanly through a one-versus-rest approach. Its coefficients can be read directly — Clare can tell the winemaker that a unit increase in volatile acidity pushes a wine toward Reject with a quantifiable effect. The limitation is that logistic regression assumes a roughly linear relationship between features and the log-odds of class membership. If the true decision boundary is more complex, the model will underfit.

Decision trees offer a different kind of interpretability. The model produces a flowchart of decisions the winemaker can follow without any statistical training. They handle non-linear boundaries well and make no assumptions about feature distributions. The problem is instability — small changes in the training data can produce substantially different trees, and a single decision tree overfits easily.

Random forest addresses that instability by building many trees and aggregating their votes. It is more robust, typically more accurate, and handles real-world data noise gracefully. The trade-off is interpretability. The winemaker cannot follow the logic of a hundred trees simultaneously.

K-nearest neighbours classifies a new observation by finding the most similar training examples and taking a majority vote. It is intuitive but sensitive to irrelevant features and offers no explanatory power. Clare sets it aside.

Then there is the question that sits at the edge of her shortlist. Deep neural networks are capable of extraordinary things. But Clare’s dataset is modest, her compute is limited, and the winemaker is waiting for an explanation. Neural networks are opaque by nature — the internal representations they learn do not translate into human-readable reasoning. Clare notes them for future reference and removes them from consideration.

She circles random forest as her primary candidate, with logistic regression as a baseline to measure against. The reasoning is sound. The decision is defensible. But before she commits, she needs to know whether her model actually works.


Did It Actually Work?

A model that produces predictions is not necessarily a model that produces good predictions. Clare runs both classifiers on a held-out test set — 320 wines the models have never seen during training — and looks at what comes back.

The first instinct is to reach for overall accuracy. Logistic regression scores 61%. Random forest scores 75%. Random forest wins — but Clare has learned not to stop there.

Her dataset is imbalanced. Of the 1,599 wines, 46.5% are Reject, 39.9% are Standard, and just 13.6% are Premium. A model that simply predicted Reject or Standard for everything would achieve a deceptively high accuracy figure while being commercially useless. The warning is in the numbers before the model even runs.

The confusion matrix gives her a more honest picture. It lays out, for each actual class, how many examples the model correctly identified and how many it misclassified — and crucially, what it misclassified them as. From it, two metrics earn Clare’s attention.

Precision asks: of all the wines the model predicted as Premium, how many actually were Premium? Recall asks: of all the wines that actually were Premium, how many did the model correctly identify?

The results are illuminating. Logistic regression finds only 26% of actual Premium wines — a recall of 0.26. Three quarters of the winery’s best bottles are being miscalled as Standard. Random forest improves this substantially, reaching a Premium recall of 0.58. Still not perfect, but it is now finding more than half the genuine Premium wines rather than missing most of them.

Reject performance is strong across both models — F1 of 0.73 for logistic regression, 0.81 for random forest. The majority class, with the most training examples, is the easiest to learn.

Standard sits in the middle, chemically and statistically. It is the hardest class to call cleanly, and both models reflect that. The soft boundaries the unsupervised clustering revealed — a silhouette score of 0.1892, indicating genuine overlap rather than clean separation — are still present in the supervised results. The models did not invent that difficulty. It was always in the data.

The winemaker asks Clare a pointed question: which error costs more — calling a Premium wine Standard and pricing it down, or calling a Standard wine Premium and disappointing a customer? That question does not have a statistical answer. It has a business answer. And it is exactly where algorithm evaluation and operational reality meet.

Clare documents both models. The random forest performs better across all three classes. But before she declares a winner, she runs one more check.


What Clare Learned

Clare closes her Jupyter Notebook with something that took longer to arrive than she expected: not confidence in the model, but confidence in the process.

Algorithm selection is not a guess dressed up in technical language. It begins with understanding what the data contains and what the prediction task actually requires. It follows a logic that can be stated, examined, and defended. It weighs interpretability, computational cost, and domain requirements alongside statistical performance. And it ends not with a declaration that the model is correct, but with evidence — real numbers, per class, honestly read.

The feature importance analysis added one final thread. The random forest’s three most influential predictors were alcohol, sulphates, and volatile acidity — in that order. These were precisely the features that defined the unsupervised clusters in the previous analysis. The supervised model, trained independently and without knowledge of the clustering work, arrived at the same conclusion the clusters had already reached. The data gave the same answer twice, through two entirely different methods. That is not coincidence. That is signal.

Supervised learning is not a collection of tools. It is a way of thinking about prediction — rigorously, honestly, and always in service of a decision that someone, somewhere, actually needs to make.

The data told Clare what the wine was. The model learned to say it again, about wine it had never tasted.

In Part 2, Clare’s model leaves the winery and enters environments where the stakes are higher, the constraints are harder, and the bias-variance results raise a question she wasn’t expecting.

AI Winter is Coming – Again

Infographic depicting the evolution of artificial intelligence from the 1980s to the present, highlighting key eras such as AI Winter, Quiet AI/Edge AI, and Frontier AI. Includes illustrations of mainframe computers, decentralized architecture, and hyperscale data centers, along with notes on investment, data availability, and technology challenges.

AI Winter is nothing new. We’ve been there before. The last AI winter was 1987 – 1993. Businesses over-invested in “expert systems” — expensive, specialised computers designed to mimic corporate decision-making. When these systems proved too difficult to maintain and update, the commercial market for AI collapsed. This time winter may come because Frontier AI is hitting a scaling ceiling it can’t engineer its way past. This post suggests the answer isn’t bigger. It’s smaller. And the proof of concept has been running quietly for twenty years.

If you want an overview of how AI actually developed — from 1950s foundations through to today — I’ve covered that in AI Didn’t Come From Nowhere. The short version: AI has been building for seventy years. ChatGPT didn’t start it. It just made it visible.

The Clock Got Reset

When ChatGPT launched in late 2022 it did something remarkable and something damaging simultaneously.

Remarkable — it made AI legible to everyone. For the first time, a non-technical person could sit down, ask a question, and feel the capability directly. That democratisation matters. It changed the conversation permanently.

Damaging — it reset the public clock to zero. Suddenly AI had a birthday. And a generation of business leaders started building AI strategies as if nothing had existed before that date.

I meet smart, senior people regularly who genuinely believe AI started with ChatGPT. That 2023 was year one.

It wasn’t. Not even close.

This is what I’ve called the Magpie Effect — organisations distracted by the latest glittering capability, dropping whatever was already working to chase the new thing. The frontier AI conversation feeds it perfectly. New model. New benchmark. New promise. The magpie pivots. The roadmap restarts. The quiet AI that was already delivering sits ignored in the corner.

But before we get to what came before, we need to talk about where frontier AI is heading. Because the ceiling that’s approaching makes the Magpie Effect not just wasteful — but genuinely risky.

The Scaling Wall

The Frontier AI story has been built on a single premise: scale it out and it gets better. Bigger models. More parameters. Larger data centres. More powerful GPUs. The scaling hypothesis held, the results were real, and the industry committed.

But scaling isn’t a strategy. Scaling by laws of physics and economics always has limits. It’s a phase. And this phase is approaching a physical wall that no amount of engineering enthusiasm will move.

A single large-scale AI training run consumes energy comparable to a small town. Inference at scale runs continuously — millions of queries, every day, globally, without pause. The major cloud providers are signing nuclear power agreements not because nuclear is cheap or fast, but because conventional grid capacity cannot keep pace with demand. Microsoft has already reactivated Three Mile Island. Amazon is investing in small modular reactors. They have run out of other options.

Data centre construction can’t keep up. From planning decision to operational facility takes four to seven years in most developed jurisdictions. Grid connection queues in the UK, the US, and across the EU are already measured in years. Water rights for cooling are becoming a site selection constraint. Liquid cooling manages the thermal symptom. It does not touch the energy consumption problem.

The ceiling is lower than the industry’s public projections suggest. The people closest to the infrastructure already know it. The investment cycle is still running hot enough that saying so publicly is commercially inconvenient.

Efficiency gains — model distillation, inference optimisation, new chip architectures — will buy time. They won’t move the ceiling. The physics and the economics don’t negotiate.

As I write this, I use Claude daily. Claude for many of us has become a genuine thinking partner — including in writing this post. I’m not dismissing Frontier AI. I’m being honest about its infrastructure cost and scalability limits. Even Claude will hit this ceiling. That’s not a criticism. It’s physics, it’s economics. It’s a practical viewpoint.


The Transistor Parallel

We’ve been here before.

In the 1940s the valve computer proved the concept. ENIAC, Colossus — genuinely capable machines. But enormous, power hungry, heat generating, and impossible to scale to meet the demand that was already visible. The resource constraint was real. The engineering community knew it.

The answer wasn’t a bigger valve. It was the transistor.

The transistor didn’t just miniaturise the valve. It changed the trajectory entirely. Compute escaped the specialist room. It went into consumer products, industrial equipment, vehicles, eventually into every pocket on the planet. Nobody called it a compromise. The smartphone was always a better destination than a room full of valves — we just couldn’t see it from inside the valve paradigm.

Frontier AI is the valve computer. Capable. Proven. Impressive. And running into the same kind of physical ceiling that made the transistor not just desirable but inevitable.

The transistor moment for AI is smaller, distributed, efficient models — running where the work actually happens, without depending on infrastructure the planet cannot sustainably provide at the rate being demanded.

And here is the thing that gets lost in the Frontier AI conversation.

That transistor moment isn’t coming. It’s already here. It’s been here for twenty years. We just haven’t been paying attention to it.


The AI That Never Made the Headlines

While the world was discovering chatbots, a different AI had already been running quietly for years. Doing real work. Solving defined problems. Delivering measurable outcomes with ROI.

Netflix’s recommendation engine has been shaping what you watch since 2006. Gmail’s spam filter has used machine learning since 2004. Amazon’s demand forecasting and warehouse optimisation runs on ML models that have been refining themselves for over a decade. Google Maps learns from billions of journeys in real time.

Credit card fraud detection catches billions of fraudulent transactions annually using models that make decisions in milliseconds, invisibly, at the edge.

Aircraft engines predict their own failures. Factory production lines inspect their own output. Rail networks scan their own infrastructure at track speed.

None of this started in 2023. None of it needed a press release. None of it requires a nuclear power agreement or a seven year planning queue.

In my post on Machine Learning: The Backbone of Enterprise AI I looked at what this looks like in practice. DHL’s parcel inspection system uses computer vision models running at the edge — cameras over conveyor belts, ML models trained on millions of images, making pass/fail decisions on every parcel at line speed. Duos Technologies scans every railcar at track speed using AI-enabled imaging, catching defects before they become safety incidents. Factory quality control systems watch production lines continuously, catching defects too subtle for the human eye.

None of these run on hyperscale infrastructure. None send data to a cloud API and wait for a response. They run locally, efficiently, on hardware sized to the task — because the task is defined, bounded, and understood.

This is the transistor analogy for AI. Already in production. Already sustainable. Already delivering ROI.


The Right Tool for the Right Problem

The AI decisions that make genuine business sense share a common structure.

A defined problem — not “use AI” but “reduce defect escape rate on line 4” or “inspect every railcar at track speed without taking it out of service” or “predict supply chain disruption before it hits.”

A defined process — the AI is embedded in a workflow with a measurable before and after. It replaces or augments a specific activity with a specific outcome.

A defined return — because the problem is specific and the process is clear, the ROI calculation is straightforward. Not “we have AI” but “our cost per inspection dropped, our throughput increased, our safety incidents reduced.”

When AI is aligned to business process optimisation this way, the technology choices become obvious. Computer vision for inspection. Predictive models for maintenance. Anomaly detection for risk. Small, focused, efficient. Running on hardware sized to the job, not to the ambition.

The organisations getting genuine return from AI are overwhelmingly doing this. They’re not in the headlines. They’re on the production floor. They’ve been there for years.

I’ve written about what that structured approach looks like in practice in Getting AI Right First Time. The organisations that get AI right treat it as process optimisation first, technology second. Start small. Prove value. Escape the pilot graveyard. Make AI boring. That five step discipline is what separates durable capability from expensive experimentation.

The starting point is finding your critical process — the one business process that, if it stopped tomorrow, the business would stop with it. Name it. Find the friction within it. That friction is your AI use case shortlist. Not a vendor briefing. Not a capability demo. The process the business already depends on, and the specific points where better prediction, classification, or inspection removes a real constraint.

The Obvious Outcome

The Frontier AI scaling ceiling isn’t a maybe. It’s a when. Efficiency gains will buy time. The laws of physics and economics won’t be moved.

When it arrives, the organisations that built deep dependencies on hyperscale infrastructure for processes that didn’t require it will be most exposed. The expertise put to waste won’t be quickly rebuilt. The processes restructured around availability assumptions that stop being true won’t easily revert. That’s what winter looks like.

But for the organisations that chose the right tool for the right business problem — summer is already here.

Distributed. Local. Efficient. Sustainable. Aligned to real business processes. Delivering real ROI.

Not because they anticipated the ceiling. Because they were asking the right question from the start — not “how do we use AI” but “where does better prediction, classification, or inspection create measurable value, and what is the smallest, most reliable tool that delivers it.”

Electronics couldn’t scale with valves. It scaled with the transistor. And everything that followed — the PC, the phone, the computer in your car — emerged from that single shift in direction.

AI will follow the same arc. Not because anyone planned it. Because the ceiling makes it inevitable.

The quiet AI already running on the production floor isn’t the consolation prize.

It’s the destination.


From 22 Binders to a Box on the Workbench

Blueprint illustration of a distributed AI architecture, highlighting challenges such as slow distribution and high space usage, and benefits including localized inference and enhanced security.

How AI is following the same arc as every technology before it — and why that’s the most important thing happening in enterprise AI right now.


The 22 Binders Problem

In the 1990s I worked for Volvo Trucks, responsible for technical documentation. Every authorised dealership workshop had 22 steel binders crammed with service manuals, torque specifications, fault code references, and replacement procedures. Covering every model, every variant, every engine configuration.

Getting those binders to the right workshop, in the right language, at the right time was a logistical nightmare. Print runs. Translation cycles. Physical distribution across multiple countries. And the moment a binder left the building, you started losing control of it.

Because trucks don’t stand still. Specifications change. Procedures get revised. Safety-critical updates happen. So we’d issue interim service bulletins — printed updates mailed out to dealerships, hoping they’d find their way into the right binder, in the right place, before a technician needed them.

The knowledge was good. The people who wrote those manuals understood those trucks deeply. The content was authoritative, structured, procedural. But the distribution model was broken by design. The moment you printed something, the clock started on it becoming out of date.

I spent years managing that problem. I never solved it. The technology didn’t exist to solve it.

It does now.


The Sledgehammer Era

When generative AI arrived at scale it came in like a sledgehammer. Vast data centres. Enormous compute. Hundreds of billions of parameters. Everything centralised, everything cloud-dependent, everything expensive.

That was necessary. You needed that scale to prove the capability existed. GPT-4, Gemini, Claude — these required hyperscale infrastructure to demonstrate what large language models could actually do. The sledgehammer era wasn’t wrong. It was the only way to get here.

But it’s not the end state.

There’s a pattern in technology that repeats so reliably it’s almost boring once you’ve seen it enough times. A new capability emerges at scale, centralised, specialist, expensive. Then the algorithms get smarter. The hardware gets smaller. The economics shift. And compute escapes the specialist environment and goes where the work actually happens.

We’ve seen this cycle before. Several times.


The Transistor Moment

In the 1940s the valve computer proved the concept. ENIAC, Colossus — genuinely capable machines doing real computation. Enormous, power hungry, heat generating, fragile, requiring specialist environments and specialist people just to keep running.

Then the transistor arrived. It didn’t just miniaturise the valve. It changed what was possible. Compute escaped the air-conditioned room. It went into consumer products, industrial equipment, vehicles. Eventually into the Volvo workshop — diagnostic computers, engine management systems, electronic service tools.

Nobody staring at a room full of valves in 1955 predicted the smartphone. But it was inevitable once the transistor existed. The use cases emerged from the distribution.

We are at that moment with AI.

The evidence is already here. Models like Mistral 7B, Microsoft’s Phi-3, and Meta’s Llama 3.1 8B perform remarkably well on focused tasks. Quantisation techniques mean a model that needed 80GB of GPU memory two years ago runs comfortably in 8GB today with minimal quality loss. NVIDIA has put one petaflop of AI compute into a desktop machine — the GB10 Grace Blackwell — that sits on a workbench. Apple put a capable language model in a phone.

The sledgehammer is giving way to the scalpel. Smarter algorithms. Smaller models. Distributed inference. The transistor moment.


The Architecture That Changes Everything

So let me describe what I’d build for Volvo Trucks today.

A GB10 desktop in every dealership workshop. On it: a small language model, a local MCP server, and a local cache of the workshop knowledge base. No internet connection required. No cloud dependency. No data leaving the building.

The knowledge base lives centrally — one authoritative source, maintained by a team of technical authors with a proper authoring and approval workflow. Single version of the truth. When a torque specification changes, it changes once, in one place, approved by the right person. Overnight, a delta sync pushes only the changed content to every workshop box in every country. By morning every technician in Europe has the current procedure.

The technician doesn’t type freeform queries. They select from predefined prompt templates — fault code lookup, torque specification, replacement procedure, service interval. Each template fires a carefully engineered retrieval query behind the scenes, pulling the right content from the local knowledge base, passing it to the local model, generating a precise, grounded answer.

The model never wanders off domain. There’s no internet to reach out to. No risk of the model hallucinating — confabulating a procedure it half-remembers from training. The answer comes from the authoritative knowledge base, retrieved precisely, generated locally, watermarked with the sync timestamp so you know exactly which version of the truth informed it.

This isn’t a chatbot. It’s a precision workshop tool that happens to use an AI model internally. The AI is an implementation detail. The value proposition is the right answer, instantly, for a truck that needs to move.

And crucially — this couldn’t be done with a naive RAG implementation bolted onto an ungoverned file store. The intelligence isn’t in the retrieval mechanism. It’s in the governance that happened before retrieval was ever involved. The single version of truth. The approval workflow. The deprecation process that ensures superseded procedures stop being retrieved. The content discipline that technical authors like my 1990s self spent years trying to maintain manually.

The AI amplifies good knowledge management. It doesn’t replace it.


The Pendulum

I’ve watched enough technology cycles to see the pattern clearly.

The PC era distributed compute out of the data centre and into the hands of individuals. The internet centralised it again — everything running on servers you didn’t own. Mobile distributed it once more, putting capable compute in every pocket. Cloud AI is centralising again — everything phoning home to a hyperscale data centre to answer a query.

Each time, the dominant narrative says this is how it will always be now. Each time, the pendulum swings back. Not because the technology fails, but because the same forces reassert themselves: latency becomes unacceptable, data sovereignty pressure builds, cost economics shift, and capability catches up to the point where distribution becomes viable again.

We are at peak centralisation with AI right now. The distribution forces are already building. Sovereignty regulation is tightening globally. Edge hardware is catching up fast. Algorithmic efficiency is compressing capable models into deployable sizes. Enterprises are growing uncomfortable with sensitive data leaving their premises to answer a query.

The pendulum will swing. It always does.


The Control Plane

But there’s something different this time that could break the historical pattern — and it’s the idea I find most interesting in enterprise AI right now.

Every previous distribution wave eventually lost coherence. The PC era fragmented into version chaos, security nightmares, and unmanageable sprawl. Mobile created a device estate that IT departments are still trying to govern. Distribution without discipline becomes a different kind of problem — one that often ends up being worse than the centralisation it replaced.

The question is whether AI distribution can be done differently. Whether you can have local inference without local chaos.

I think you can. The architecture looks like this: local inference running in your server room, your data never leaving your premises, your domain knowledge locked in a governed local knowledge base. But the model’s values, safety alignment, and capability updates maintained centrally by the people who built it. The enterprise controls the execution environment. The model provider maintains the model itself.

Distribution without fragmentation. Sovereignty without chaos.

It’s the Volvo KB architecture applied to the model itself. Central truth. Distributed execution. The same principle that would have solved my 22-binder problem in 1993 — one authoritative source, pushed to the edge, with discipline about what changes and who approves it.

This isn’t a theoretical position. The infrastructure to do it exists today. What’s missing, in most organisations, is the governance thinking that makes it safe. And governance thinking, it turns out, is not a technology problem. It’s a knowledge management problem. Which is a very old problem indeed.


What This Means

The use cases for distributed, domain-locked AI are not going to come from hyperscale thinking. They’re going to emerge from people who understand specific domains deeply — who know where the knowledge lives, what governance it needs, and what questions actually matter in that environment.

The Volvo workshop is one example. But the same architecture applies anywhere that has a bounded domain, authoritative knowledge, and real decisions being made by people who need the right answer quickly.

Consider a hospital ward. A clinician needs the current drug interaction protocol for a specific combination — not a general answer from a model trained on the internet, but the approved formulary for this trust, this version, signed off by the chief pharmacist last Tuesday. The architecture is identical to the workshop: local inference, governed knowledge base, delta sync, no data leaving the building. The AI is an implementation detail. The value is the right answer, for this patient, right now.

Or a field engineer on an offshore platform, no reliable connectivity, needing the current maintenance procedure for a specific valve configuration. Or a legal team needing to retrieve the approved contract clause library — not a hallucinated approximation, but the version that compliance signed off.

In each case the AI isn’t the interesting part. The interesting part is the governed knowledge layer underneath it — built by people who understand the domain, maintained with discipline, versioned and approved and auditable.

The sledgehammer era gave us the capability. The transistor moment gives us the distribution. What makes it useful is the same thing that always made the difference — knowing your domain, respecting your data, and being honest about what the technology actually does.

I learned that managing 22 steel binders for Volvo Trucks in the 1990s.

Some lessons don’t change.

The Plumbing Under the Hood: RAG, MCP and the Architecture Nobody Explains

A diagram illustrating the architecture of a large language model (LLM) with connections to various systems including CRM, ERP, HR, and SharePoint, displayed on a blueprint-style background.

I’m an under the hood type of guy. I hear high-level fluff and I just turn off. I need more. I need to be able to visualise how things work — and the effects of implementation. I guess that’s the Solution Architect in me. Years of seeing projects go south. Experience that says it’s just not that simple.

I learned a long time ago: if you want something doing, do it yourself.

So here it is. No fluff, no hand-waving. The no-nonsense guide to what RAG and MCP actually are, how they work, and why the distinction matters more than most people realise. Enjoy.


The Problem Every Enterprise AI Deployment Hits

Large language models are genuinely extraordinary. The breadth of knowledge, the reasoning capability, the ability to synthesise and explain — it’s real, and it’s useful. But they have a fundamental constraint that every organisation hits the moment they try to deploy one seriously.

They are frozen.

An LLM is trained on a vast corpus of data up to a point in time, and then the weights are fixed. The model doesn’t know what happened last Tuesday. It doesn’t know your organisation’s processes, your customer contracts, your current pipeline, or the policy document your HR team updated this morning. It is, for all its capability, a brilliant mind in a sealed room.

Every enterprise AI deployment is therefore really solving one problem: how do we get relevant, current, organisational knowledge into the model’s hands at the moment it needs to answer?

Two main solutions emerged. They look similar on the surface. They are fundamentally different underneath.


RAG: The Indexed Snapshot

RAG stands for Retrieval Augmented Generation. The name is less important than the mechanism.

Imagine you have a large knowledge base — policy documents, product guides, training materials, technical specifications. RAG takes all of that content and processes it in advance. Each document gets broken into chunks. Each chunk gets converted into a vector embedding — a numerical representation of the meaning of that text, not just its keywords. Those embeddings get stored in a vector database.

When a user asks a question, the question itself gets converted into a vector using the same method. The system then searches the database for the chunks whose meaning is closest to the meaning of the question — semantic similarity, not keyword matching. The most relevant chunks get retrieved and placed into the model’s context window alongside the original question. The model answers using that retrieved material as its working context.

Think of it as a library. Brilliantly organised, perfectly indexed, searchable by meaning rather than title. You walk in, the system finds the most relevant books, opens them to the right pages, and hands them to the model before it answers.

It’s powerful. For stable, curated knowledge bases it works extremely well.

But it has a ceiling, and the ceiling matters.

The library was shelved at a point in time. The moment your source documents change, your index is stale until you re-embed. And the quality of retrieval is entirely dependent on the quality of what went in. Poorly structured documents, inconsistent language, missing metadata — the embeddings become noisy and retrieval underperforms. The foundational principle holds here as firmly as anywhere in AI: weak data quality at the input stage leads to flawed outputs downstream. RAG doesn’t solve a data quality problem. It inherits it.


MCP: The Living Plumbing

MCP — the Model Context Protocol — is a different kind of answer to the same problem. And understanding the difference is where the real business thinking begins.

MCP doesn’t retrieve from a pre-built index. It connects the model to live systems through their APIs — and queries them in real time, at the moment of the conversation.

Here’s what that means practically. Your SharePoint isn’t indexed in advance — the model calls it directly and gets back whatever is there right now, including the contract template someone updated this morning. Your CRM isn’t embedded into vectors — the model queries it and sees the deal that moved stage an hour ago. Your HR system, your procurement platform, your service desk — all of them accessible, all of them current, all of them live.

The model doesn’t see a snapshot of your organisation. It sees your organisation as it actually is, right now.

And here is the point that changes how you should think about this entirely.

Most enterprise knowledge isn’t in one place. It never has been. It’s fragmented across Salesforce and SAP, ServiceNow and SharePoint, HR platforms and finance systems and procurement tools. Getting RAG to span those systems requires significant data engineering effort — ingesting, normalising, embedding, maintaining. It’s achievable, but it’s heavy.

MCP connects to all of them. Through their APIs. Simultaneously. The model becomes a single conversational interface across the entire technology estate — not just one knowledge base, but the living information fabric of the organisation.

That is not a chatbot connected to some documents. That is a fundamentally different proposition.


Not Competitors — Different Layers

It would be tempting to read this as RAG versus MCP. It isn’t.

They solve overlapping problems at different layers and with different trade-offs. RAG is the right tool for large, stable knowledge corpora where semantic similarity search matters — where you need the model to find relevant material even when the exact words don’t appear in the query. MCP is the right tool where data is live, dynamic, and distributed across operational systems.

And they can work together. A well-architected system might use MCP as the orchestration layer — the model deciding which tools to call — while one of those tools triggers a RAG pipeline for a specific stable knowledge base. The plumbing and the library, working in concert.

The practical guidance is straightforward. Start with MCP. It’s the lower point of entry — no vector infrastructure to provision, no embedding pipelines to build and maintain, no index to keep fresh. You’re connecting to systems and APIs you already have. Reach for RAG when you’ve hit the ceiling — when the corpus is large, messy, and semantic retrieval across unstructured content becomes essential.

Start simple. Earn the complexity.


Before You Lay The Plumbing — What Nobody Tells You

The pitch for both RAG and MCP is compelling. The reality, as always, has a few sharp edges worth knowing about before you commit.

RAG brings infrastructure with it. RAG isn’t just a software pattern you switch on. Behind every vector database is a compute and storage requirement that needs provisioning, maintaining, and scaling as your knowledge base grows. Embedding pipelines need to run continuously — every time source content changes, chunks need re-processing and re-indexing or your library goes stale. For organisations already managing data centre complexity, this is a real cost conversation that rarely appears in the vendor presentation.

MCP makes your legacy systems load-bearing. MCP’s power is connecting to live systems. But those live systems are now dependencies. The legacy HR platform with the flaky API. The procurement system that slows under load. The CRM with three years of inconsistent data entry. Once the LLM is reaching across your technology estate, it is only as reliable as the weakest system it touches. A timeout, a bad API response, a data quality problem in one system degrades the entire interface. What felt like a peripheral legacy problem just became front and centre.

Governance and security are not optional extras. When a model can traverse your entire technology estate — reading CRM data, querying HR systems, pulling procurement approvals — your entire technology estate needs to be ready for that conversation. Access controls, data classification, audit trails, API security, compliance boundaries. These cannot be bolted on after deployment. They need to be designed in from the start. MCP without a holistic governance and security view isn’t just risky. It’s an exposed surface at scale.

This is AI Reality. The plumbing is powerful. Lay it properly.


The Interface, The Plumbing, The Flow

Here is the frame I want to leave you with — because it’s the one that changes how you brief a customer, evaluate a vendor, or think about your own AI roadmap.

LLMs are becoming the interface to information. Not a search bar, not a dashboard, not a report. A conversational, reasoning interface that sits in front of your organisation’s entire data landscape and makes it accessible in plain language.

MCP is the plumbing. The connective tissue that links the interface to the living systems underneath — the CRM, the ERP, the HR platform, the document store, the data warehouse. Without the plumbing, the interface has nothing to work with. With it, the interface can see everything.

And once you have an interface and plumbing, something else becomes possible.

Agents.

Not models that answer questions. Models that act. That move through systems, make decisions, complete workflows, and hand off to humans at exactly the right moment. Agents ride the pipelines that MCP creates and turn information flow into work getting done.

That’s where this goes. And that’s what the next post is about.

Next: The Agentic Leap — when AI stops answering and starts acting.


The Interface. The Plumbing. The Flow.

LLM Sizing 101 – Part 3: Platform and GPU Selection

A schematic diagram illustrating the LLM sizing chain, featuring flowcharts that detail model size, precision, tokens per second, GPU count, node count, and platform specifications.

Mapping your sizing to Dell PowerEdge XE configurations

In Part 1 we nailed down the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do.

In Part 2 we made it practical — translating a customer’s real-world requirements into a target tokens-per-second figure, and from there into a GPU count.

Now we make it concrete.

Building on the methodology from Part 2, we apply it to two representative scenarios — a 7B internal assistant and a 70B RAG system — and map everything to actual Dell PowerEdge XE platform configurations you can put in a proposal. But before we get to the reference designs, there’s a gotcha.


The Gotcha: Model Precision

There’s a variable that can silently double — or halve — your GPU count if you don’t nail it down early in the conversation.

Model precision.

When a customer says “we want to run a 70B model,” that sentence is incomplete. The question you need to ask immediately is: at what model precision?

Here’s why it matters so much. The memory footprint of a model is:

Model VRAM (GB) = number of parameters × bytes per parameter

And the bytes-per-parameter figure is entirely determined by precision:

PrecisionBytes per parameter70B model — weights onlyNotes
FP324 bytes~280 GBTraining only; rare in inference
FP16 / BF162 bytes~140 GBFull quality baseline
FP81 byte~70 GBRequires H100, H200, B300 class
INT81 byte~70 GBBroad hardware support
INT40.5 bytes~35 GBValidate quality before committing
FP40.25 bytes~17.5 GBB300/GB300 only; first-class inference precision

Run the same 70B model at FP16 versus INT4 and the weights footprint changes by 4×. That’s the difference between needing two 8-GPU nodes and needing one. It’s the difference between a £400k proposal and a £200k proposal. And it’s a variable that’s completely invisible if you skip the precision conversation.

How to find out

The good news: model precision is almost always discoverable before you size anything.

The model card. Every published model has a model card stating the native training precision — typically FP32, BF16, or FP16 — and whether pre-quantised versions exist. Llama 3.1 405B, for example, is published in BF16 with a separate FP8-quantised version available for single-node deployment. That’s not a footnote — it’s a hardware decision.

The deployment framework. When a customer tells you they’re using vLLM, TensorRT-LLM, or NVIDIA NIM, the framework makes precision explicit. NIM profiles are named by precision — tensorrt_llm-h100-fp8-tp2-latency tells you the precision, the GPU, and the parallelism strategy in one string. If the customer has already chosen a framework, ask what precision they’re deploying at — they’ll either know, or the question will prompt them to find out.

The GPU itself. Not all GPUs support all precisions. FP8 requires H100, H200, B300 or AMD MI300X class hardware. FP4 is exclusive to B300 and GB300 — it isn’t available on earlier generations. INT4 with hardware acceleration requires specific tensor core support. If the customer has already chosen a GPU, that constrains the precision options — and vice versa. The two decisions are linked.

The precision conversation in practice

When a customer names a model, these are the three questions that unlock the sizing:

“Are you using the native model weights, or a quantised version?” “What serving framework are you planning to use?” “Is some accuracy trade-off acceptable in exchange for a smaller hardware footprint?”

That last question is the most important one. Modern quantisation techniques — GPTQ, AWQ, SmoothQuant — preserve the vast majority of model quality for most enterprise inference workloads. The difference between BF16 and INT8 is typically imperceptible for summarisation, search, classification and code assistance. For complex multi-step reasoning or fine-tuned models, it’s worth validating. But for the majority of use cases, INT8 or FP8 is a legitimate production choice — not a compromise.

The rule of thumb: the bigger the model, the more gracefully it quantises — for most enterprise inference workloads. A 70B model at INT8 loses less proportionally than a 7B model at INT4.

Get precision wrong — or leave it undefined — and every GPU count in your proposal is built on a shaky foundation. Get it right, and you have a sizing conversation that’s grounded, defensible, and often more cost-effective than the customer expected.


Two Reference Designs

With precision established, everything else follows.

Sizing disclaimer: The reference designs below illustrate the methodology — they are not a substitute for your own sizing exercise. TPS figures, GPU counts and node recommendations are directional reference points based on representative workloads. Actual performance will vary with your specific model, serving framework, quantisation approach, batch configuration and workload pattern. Always validate against benchmark data for your environment before quoting or committing to a configuration.

These aren’t rigid prescriptions — they’re starting points you can adapt by adjusting the inputs and re-running the TPS maths from Part 2.


Reference Design A: 7B Internal Assistant

Use case: An internal productivity assistant — employees asking about policies, summarising documents, drafting emails. High concurrency, moderate latency sensitivity, cost-conscious.

1. Define the workload

ParameterValue
Concurrent users (peak)500
Average prompt400 tokens
Average response250 tokens
Target response time~8–10 seconds
Acceptable TTFT< 2 seconds
Model7B class

2. Establish precision and memory footprint

For a 7B model:

PrecisionWeights footprintFits on a single GPU?
FP16 / BF16~14 GBYes (48–80 GB class)
INT8~7 GBYes — comfortably
INT4~3.5 GBYes — with significant headroom

For a high-concurrency internal assistant, INT8 or mixed precision (weights in INT8, activations in FP16/BF16) is the practical default. It fits cleanly on a single GPU, leaves room for KV cache and batching overhead, and the quality trade-off is negligible for this kind of workload.

3. Translate to TPS

  • 250 output tokens ÷ 10 seconds = 25 tokens/sec per user
  • 500 users × 25 tokens/sec = 12,500 tokens/sec system TPS

4. Per-GPU TPS estimate

For a 7B model at INT8/mixed precision, batched decode on a high-end accelerator:

GPUApprox. TPS (7B, batched decode)
H100 80GB SXM~2,000–3,000
H200 141GB~2,500–3,500
L40S 48GB~1,000–1,500
B300 288GB~4,000–6,000 (est.)

Conservative estimate: 1,500 TPS per GPU on current generation; higher on B300.

5. GPU and node count

  • 12,500 TPS ÷ 1,500 TPS/GPU ≈ 8.3 GPUs
  • Add 25% headroom: 8.3 × 1.25 ≈ 10.4 GPUs → round up to 12 for a clean 3 × 4 configuration

6. Platform mapping

A 7B model at INT8 fits on a single GPU — no tensor parallelism required. Each GPU runs an independent model replica and you scale out horizontally across nodes. This is compact, balanced GPU server territory.

The Dell PowerEdge XE7745 is the natural fit for this workload class: a 2U platform supporting up to 4 high-memory GPUs, designed for exactly this kind of inference deployment. For organisations planning ahead with Blackwell, the XE7745 also supports NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs — a professional-grade accelerator with 96 GB GDDR7 that offers significant headroom for a 7B workload and future-proofing for multi-model environments, at a lower power and cost envelope than data centre HBM-class GPUs.

“For a 7B internal assistant serving ~500 concurrent users, a small cluster of three PowerEdge XE7745 nodes gives you a responsive chat experience, capacity to grow, and the flexibility to host multiple models or environments — all in a standard rack footprint.”


Reference Design B: 70B RAG System

Use case: Knowledge-heavy workflows — legal, financial or engineering teams querying proprietary documents via a RAG pipeline. Quality matters more than raw user count. Concurrency is moderate.

1. Define the workload

ParameterValue
Concurrent users (peak)100
Average prompt2,000 tokens
Average response500 tokens
Target response time~12–15 seconds
Acceptable TTFT< 3 seconds
Model70B class

Prompts are longer here because RAG injects retrieved document snippets, conversation history and system instructions into every request. That longer context window drives up KV cache memory — which is why the platform choice shifts significantly compared to Reference Design A.

2. Establish precision and memory footprint

This is where precision has the biggest impact on the proposal — and where the B300 changes the calculus significantly:

PrecisionWeights footprintH100/H200 GPUs neededB300 GPUs neededNotes
FP16 / BF16~140 GB2 minimum1 (fits with headroom)Full quality; B300’s 288 GB changes the equation
FP8~70 GB1 minimum1Near-FP16 quality; requires H100/H200/B300
INT8~70 GB1 minimum1Minimal quality loss for most workloads
INT4~35 GB11Validate quality before committing
FP4~17.5 GBN/A — B300/GB300 only1 — with substantial headroomValidate quality for RAG use cases

A key implication for B300 deployments: with 288 GB of HBM3e per GPU, a 70B model at FP16 (~140 GB weights) fits on a single B300. That eliminates the need for tensor parallelism within the node for this model size, simplifying the architecture and reducing interconnect dependency.

For a legal or financial RAG workload where output quality is the primary requirement, FP16 or FP8 remains the right starting point. FP4 on B300 is increasingly viable but worth validating explicitly against the customer’s specific domain before committing.

3. Translate to TPS

  • 500 output tokens ÷ 15 seconds = 33 tokens/sec per user
  • 100 users × 33 tokens/sec = 3,300 tokens/sec system TPS

4. Per-node TPS estimate

For a 70B model running on high-end accelerators:

ConfigurationApprox. TPS (70B, batched decode)
4× H100 80GB (tensor parallel)~800–1,200
8× H100 80GB (tensor parallel)~1,500–2,500
8× H200 141GB (tensor parallel)~2,000–3,500
8× B300 288GB (HGX B300)~4,000–7,000 (est.)

Conservative estimate on an 8× H100 node: 1,500 TPS. On an 8× B300 node: significantly higher, with the added benefit that each GPU can host the full model independently.

5. Node count

  • 3,300 TPS ÷ 1,500 TPS/node (H100 baseline) ≈ 2.2 nodes
  • Add 25–30% headroom: 2.2 × 1.25 ≈ 2.75 → round to 3
  • Total: 3 nodes × 8 GPUs = 24 GPUs (H100/H200 baseline)

On B300 hardware, the same TPS target is achievable with fewer nodes — or the same node count delivers substantially higher capacity.

Three nodes also gives you operational flexibility — you can drain one for maintenance without collapsing capacity below the required TPS floor.

6. Platform mapping

For H100/H200 deployments, the Dell PowerEdge XE9680 with 8× H100 or H200 GPUs remains a proven reference platform for 70B inference, with NVLink and NVSwitch providing the fast GPU-to-GPU interconnect tensor parallelism requires.

For Blackwell deployments, the Dell PowerEdge XE9780 and XE9785 are the direct successors to the XE9680 — delivering up to 4× faster LLM performance with the 8-way NVIDIA HGX B300. The liquid-cooled XE9780L and XE9785L variants support higher GPU densities for rack-scale deployments.

Infrastructure note: B300 systems require liquid cooling, 800 Gb/s networking, and power densities that most existing facilities cannot support without upgrade. The B300 draws 1,400W TDP per GPU — 40% more than the B200, and double the H100. Factor facility readiness into any B300 sizing conversation before committing to a configuration.

“For a 70B RAG assistant used by specialist teams — legal, finance, engineering — the PowerEdge XE9680 with H100/H200 GPUs remains a strong proven choice. For organisations investing in Blackwell infrastructure, the XE9780/XE9785 with HGX B300 delivers significantly higher throughput and eliminates tensor parallelism requirements for 70B class models — but facility readiness for liquid cooling and power density must be confirmed first.”


The Platform Decision in Summary

WorkloadModelPrecisionCurrent PlatformBlackwell PlatformGPUs/nodeNodes
Internal assistant (high concurrency)7BINT8PowerEdge XE7745XE7745 (RTX Pro 6000 BW)4× GPUs3
RAG system (quality-first)70BFP16 / FP8PowerEdge XE9680XE9780 / XE9785 (HGX B300)8× GPUs3

The pattern is consistent: model size drives platform class, precision drives memory footprint, TPS drives node count. Miss any one of those three and the sizing is incomplete.

For organisations moving beyond 70B — frontier models, multi-tenant inference at scale, or combined training and inference workloads — the Dell PowerEdge XE9712 featuring NVIDIA GB300 NVL72 is the next step up. With 72 Blackwell Ultra GPUs and up to 40 TB of fast memory per rack (combining ~20 TB of GPU HBM3e across 72 GPUs and ~17 TB of Grace CPU LPDDR5X), it delivers exascale-class AI performance for workloads that have outgrown the per-node sizing conversation entirely. That’s a different discussion — but it starts with the same methodology.


Three Trade-offs Worth Raising

Once you’ve walked a customer through a reference design, three conversations typically follow.

1. Can we use a smaller model? Sometimes yes — and it’s worth exploring. A well-tuned 13B model can deliver surprisingly strong results for many enterprise use cases, at a fraction of the infrastructure cost of a 70B. The right answer depends on the use case, not just the budget.

2. Can we quantise to reduce the footprint? INT8 quantisation roughly halves the memory footprint with minimal quality loss for most inference workloads. INT4 goes further — but quality trade-offs become more noticeable and are worth validating before committing. FP4 on B300 hardware is the emerging sweet spot for next-generation inference: near-FP8 quality at half the memory cost, with hardware-accelerated compute — but it requires Blackwell Ultra infrastructure.

3. What about fine-tuning? If the customer plans to fine-tune as well as infer, size for fine-tuning — it’s the more demanding workload. Fine-tuning requires storing optimiser states and gradients alongside the model weights, which can triple or quadruple the VRAM requirement compared to inference alone. A platform sized for fine-tuning will handle inference comfortably.


What’s Next

With three posts, we’ve built a complete sizing chain:

  • Part 1: Parameters and tokens — the two dials that drive every sizing decision
  • Part 2: From tokens per second to GPU count — the maths that connects users to hardware
  • Part 3: Precision, platform selection, and reference designs — where the maths meets the metal

The natural next conversation is the one that follows a sizing recommendation: how does an on-premises PowerEdge deployment compare to cloud over three years? That’s the cost modelling discussion — and it’s where a well-sized on-premises platform often tells a very different story to the cloud bill the customer is currently paying.



When the Wine Lies: Finding Quality in the Chemistry

Illustration of a wine bottle with measurements and analysis labels including alcohol, volatile acidity, sulphates, pH, and density, alongside a scatter plot showing data points with ratings.

Sometimes the data tells you what your wallet couldn’t.


Last weekend I cooked steak. Proper steak — the kind that deserves a decent red wine alongside it. So I did what any self-respecting wine buyer does: I spent more than usual. Higher price, better wine. That’s how it works, right?

Wrong.

The wine was awful. Bitter, sharp, aggressive — more paint stripper than Pinot. The kind of wine that makes you wonder whether the person who priced it had ever actually tasted it. As I pushed the glass aside and reached for the water, a question formed: in the real world, how do we actually measure wine quality?

Not price. Clearly not price. So what then?


Meet Charlie and Clare

Regular readers will know Charlie and Clare. Charlie is our Data Engineer — he builds the pipelines, aggregates the sources, and delivers clean, structured data. Clare is our Data Analyst and Data Scientist — she works with what Charlie hands her and finds the patterns worth knowing about.

This week they’re working with wine.

Charlie has been busy. A winery client wants to understand quality across their production batches. The data lives in multiple places — laboratory analysis systems, fermentation monitoring sensors, batch production records. Individually, each source is a fragment. Together, they tell a story. Charlie’s data platform pulls from all of them, normalises the formats, handles the joins, and delivers Clare a clean pipeline: 1,599 red wine samples, each described by eleven physicochemical measurements.

Alcohol content. Fixed and volatile acidity. Citric acid. Sulphates. Residual sugar. Chlorides. Density. pH. Free and total sulphur dioxide.

Eleven numbers per wine.

And crucially — no quality labels.


Clare’s Problem

Clare looks at the dataset and faces a genuinely interesting challenge. There are no categories here, no pre-assigned groups, no right answers to train against. Just eleven continuous measurements and 1,599 rows of chemistry.

This is the domain of unsupervised learning — the branch of machine learning that finds structure in data without being told what to look for. Where supervised learning optimises toward a target, unsupervised learning asks a different question entirely: what patterns exist that we haven’t defined yet?

Clare’s task is to let the data organise itself, then ask whether the organisation means anything.

Before she touches a model, she does something essential. She scales the data.

This matters more than it sounds. The eleven features live on very different scales — alcohol ranges from roughly 8 to 15, total sulphur dioxide from 6 to 289. Feed raw numbers into a distance-based algorithm and the large-scale variables will dominate purely through magnitude, drowning out the signal from smaller-range features. StandardScaler transforms everything to zero mean and unit variance — now every feature competes on equal terms.

Charlie’s pipeline has already handled the missing values and format inconsistencies. Clare inherits clean data. That’s not an accident — it’s the platform working as intended.


Eleven Dimensions Are Too Many to See

Before clustering, Clare reduces complexity using Principal Component Analysis — PCA.

Think of PCA as finding the angles from which the data is most spread out. Eleven features create eleven dimensions, which is impossible to visualise and cognitively overwhelming to reason about. PCA finds new axes — principal components — that capture the maximum variance in the fewest dimensions.

The results are telling. Nine components are needed to explain 95% of the variance. No single axis dominates. The data genuinely is high-dimensional — there’s no shortcut that captures most of the story. The first two components together explain just 45.7% of variance.

That’s an important caveat Clare keeps front of mind: when she later visualises clusters on a two-dimensional PCA plot, she’s seeing less than half the structure. The scatter plot is illustrative, not definitive.

PC1 is driven primarily by acidity-related features — it broadly separates sharper, more acidic wines from rounder ones. PC2 captures alcohol and fermentation character — higher alcohol and sulphate concentrations, reflecting more complete fermentation and stronger microbial stability. Even this compressed view starts to suggest that wine chemistry has meaningful directions of variation.


How k-means Works — The Wine Tasting Table

Imagine Clare pours all 1,599 wine samples into glasses and lines them up on a long tasting table. She doesn’t know how many groups there are yet, but she suspects wines with similar chemistry will naturally belong together.

She picks three glasses to act as reference points — her starting cluster centres, her “k” — and assigns every other wine to whichever reference glass it’s closest to, chemically speaking. Then she looks at each group, finds the glass that sits closest to the average of all its members, and moves her reference point there. Wines get reassigned. Reference points shift again. The process repeats until nothing moves anymore — the groups have stabilised around their natural centres.

That’s k-means. Not magic, not mystery. An algorithm that keeps nudging reference glasses along the table until the groupings settle. The “k” is simply the number of reference points — Clare’s job is to choose it wisely, which is where the elbow method comes in.


Letting the Data Find Its Own Groups

Clare runs k-means clustering — an algorithm that partitions observations into k groups by minimising the distance between each point and its cluster centre.

The question is: what should k be?

She uses two methods in parallel. The elbow method plots inertia — the total within-cluster variance — against increasing values of k. As k grows, inertia falls; the question is where the rate of improvement flattens. The silhouette coefficient measures how well each point sits within its assigned cluster compared to neighbouring clusters — higher is better.

Both methods point toward k=3 as a defensible choice. Three groups. Clare fits the final model.

The clusters contain 722, 502, and 375 wines respectively.


What the Clusters Actually Found

Now Clare looks at the chemistry of each group.

Cluster 1 — 502 wines — stands out immediately. Highest alcohol (10.72%), lowest volatile acidity (0.41 g/dm³), highest sulphates (0.75 g/dm³). These are markers experienced winemakers recognise: lower volatile acidity means less acetic acid, less of that sharp, vinegary edge. Higher sulphates support microbial stability and structure.

Cluster 0 — 722 wines — shows the inverse pattern. Highest volatile acidity (0.61), lowest sulphates (0.61). More of that aggressive sharpness Clare’s colleague experienced at the weekend.

Cluster 2 — 375 wines — is characterised by elevated sulphur dioxide levels and the lowest alcohol of the three groups (9.88%), suggesting less complete fermentation.

Three chemical profiles. Found without a single quality label in sight.


The Reveal

Now Clare looks at quality.

She takes the quality scores — held back throughout the entire analysis — and calculates the mean score per cluster. This is post-hoc interpretation only. The clustering didn’t know about quality. But the results are striking.

ClusterProfileMean Quality
1High alcohol, low volatile acidity, high sulphates5.96
0High volatile acidity, lower sulphates5.55
2High sulphur dioxide, lowest alcohol5.36

The algorithm — working only from chemistry — has separated wines in a way that aligns meaningfully with human quality judgement. The group with the most favourable chemical profile scores highest. The group with the most aggressive volatile acidity scores lowest.

Clare didn’t tell the model what quality meant. The chemistry already knew.


An Honest Number

The silhouette score is 0.19.

By textbook standards that’s weak. Some analysts would look at that number and worry. Clare doesn’t, and it’s worth understanding why.

Wine chemistry is continuous. There are no hard walls between a quality-6 wine and a quality-7 wine — no moment where one chemical compound crosses a threshold and suddenly the wine is better. The boundaries between clusters are gradual, overlapping, real-world messy. A low silhouette score in this context isn’t a sign that the analysis failed. It’s information about the nature of the data itself.

The clusters are soft. The patterns are genuine. These two things are not contradictory.

This matters for how Clare reports her findings. She isn’t presenting three neat buckets — “good wine, average wine, poor wine.” She’s presenting three chemical tendencies, with meaningful separations on the features that wine science already tells us matter.


Why Charlie’s Platform Made This Possible

It’s worth pausing on something easy to take for granted.

Clare’s analysis worked because she had complete, clean, comparable data. Every one of the 1,599 samples described by the same eleven features, scaled and pipeline-ready.

In the real world, that’s rarely the starting point. Laboratory analysis lives in one system. Sensor data from fermentation monitoring lives in another. Batch and production records in a third. Pricing and commercial data somewhere else entirely. Each system uses different formats, different naming conventions, different update frequencies.

Without Charlie’s data platform aggregating those sources into a coherent, governed pipeline, Clare isn’t doing unsupervised learning on 1,599 wines. She’s manually reconciling spreadsheets and hoping nothing got lost in the joins.

The insight — that chemical profile predicts perceived quality independently of price — is only discoverable when the data foundation exists to support the question. Structure has to be built in, not bolted on.


What Clare Does Next

Unsupervised learning is Clare’s first move with an unfamiliar dataset. It reveals what’s there before asking what predicts what.

The natural next step is supervised learning. Now that three chemical profiles have been identified, Clare can use cluster membership to inform stratified sampling — ensuring any training dataset for a quality prediction model includes representative coverage of all three groups rather than accidentally over-representing one chemical type.

She could also bring price into the analysis. If Charlie’s platform connects to commercial data, Clare can ask the question that started this whole investigation: does chemical profile correlate with price? Is the expensive-but-terrible wine an outlier, or is the price-quality assumption systematically weak across the portfolio?

That’s a question worth answering before anyone’s next steak dinner.


The Takeaway

Eleven numbers. No labels. Three meaningful groups.

Clare found chemical structure that aligns with human quality judgement — not because she told the algorithm what quality meant, but because the chemistry already encoded it. Unsupervised learning didn’t give her answers. It gave her the right questions.

And behind all of it, doing the unglamorous work that makes the glamorous work possible, was Charlie’s data platform.

Quality has to be found in the data. But first, the data has to be there to find it.


Next in the series: Clare takes the cluster profiles into supervised learning — and finds out whether chemistry can predict quality well enough to save the rest of us from expensive mistakes.

Ethics in AI: Part 5

An infographic illustrating the feedback loops in content recommendation and credit approval systems, emphasizing the cumulative effect of small errors on structural harm.

Social Impact

Years ago — life before iTunes, before Spotify, before algorithms knew what you liked before you did — I heard the most beautiful song on the radio.

I rushed for a pen and paper. Too late. The DJ had already moved on. I had no title, no artist, no way to find it. Just the music, lodged somewhere in memory, with nowhere to go.

That song stayed with me for years.

Then one day I heard it again. I recognised it immediately — the same ache, the same sound. This time I was ready. Moments later I was on Amazon, searching for what I’d finally caught: Nick Drake. River Man. I bought the CD — Five Leaves Left — and discovered one of the most quietly extraordinary artists I have ever heard.

Nick Drake never had a hit in his lifetime. He sold a few thousand records. He died at twenty-six, largely unknown. His reputation grew slowly, entirely through human recommendation — one person telling another, a song surfacing unexpectedly on a radio programme, a stranger pointing someone in the right direction. Decades after his death, he is considered a towering influence.

I think about that story when I think about what recommendation algorithms do — and what they can’t do.

Those moments of genuine discovery are becoming rarer. And it is not an accident.


In previous parts of this series we have examined bias and fairness, privacy and consent, and transparency and explainability. Each of those topics asks what happens when AI gets something wrong in the moment — a biased decision, a privacy violation, an unexplained outcome. Social impact asks a different and harder question: what happens when AI gets things right, consistently, at scale — and the cumulative effect is still harmful?

This is the part of AI ethics that is easiest to overlook. There is no single decision to challenge. No obvious moment of failure. Just a system doing exactly what it was designed to do, and a world quietly changing around it.


The Feedback Loop

Machine learning systems influence social structures when deployed at scale. Decisions about credit approval, employment screening, content recommendation, and public resource allocation affect opportunities and outcomes for individuals and communities — not just once, but continuously, and often invisibly.

The mechanism behind many social harms is the feedback loop: a system trained on past behaviour makes decisions that shape future behaviour, which then becomes the training data for the next version of the model. Each cycle reinforces what came before. Small biases become structural ones. Initial disparities widen. And because every individual decision appears reasonable, the cumulative drift goes unnoticed until the damage is done.


Example One: The Playlist That Narrows

Consider a music streaming platform that recommends songs based on what users have previously listened to. A user starts with a few popular mainstream artists. The system, doing its job, recommends more of the same. Over time, the user is repeatedly exposed to the same genres, the same sounds, the same familiar names — while niche and emerging artists remain invisible.

The feedback loop runs like this: past listening shapes recommendations, recommendations reinforce listening patterns, and those patterns feed the next round of recommendations. Popular artists become more popular. Smaller artists remain underrepresented. Not because the algorithm intended to marginalise them — but because it was optimised for engagement, and engagement follows familiarity.

Each recommendation, taken alone, is perfectly reasonable. A user who likes one thing probably likes similar things. But the cumulative effect reshapes what people discover, what gains cultural traction, and ultimately who earns a living from their music. A technical optimisation becomes a cultural force. And nobody pressed a button that said “narrow the culture.”


Example Two: The Loan That Was Never Offered

Now consider a model used to decide who gets approved for a loan.

If the historical data that trained the model reflects decades of biased lending practices — and it often does — the model will learn to reject applicants from certain demographic groups at higher rates. It is not making a racist decision in the way a human might. It is making a statistically grounded one, based on patterns in the data. But those patterns are the residue of past discrimination.

The feedback loop here is more severe, and the stakes are higher:

  • Fewer approved loans → fewer opportunities to build credit, start businesses, or buy homes
  • Fewer opportunities → continued financial disadvantage
  • Continued disadvantage → future data that confirms the model’s original assessment

The system appears statistically accurate. It is. And it is also socially harmful. The two things are not mutually exclusive — which is what makes this so difficult to resolve by purely technical means.


Scale Changes Everything

A small systematic error, repeated at scale, produces significant societal consequences. This is the central insight of social impact analysis in AI.

A single biased loan decision is a wrong that can be appealed. A biased model making ten thousand decisions a day, over years, without review, is a structural shift in who gets access to capital. A streaming algorithm that slightly deprioritises independent artists across a platform of three hundred million users does not just affect listening habits — it shapes the economics of an entire industry.

This is why social impact analysis requires evaluating not only individual predictions but cumulative effects. It requires monitoring mechanisms capable of detecting harm early — before it becomes entrenched. It requires stakeholder engagement with affected communities, because the people most likely to identify risks are often the ones the system is making decisions about. Technical analysis, however rigorous, cannot see what it has not been designed to look for.

Machine learning systems are not neutral tools. They are components of socio-technical systems — embedded in institutions, shaped by history, and capable of reinforcing or redirecting the structures they operate within. Their evaluation must extend beyond statistical metrics to include institutional and societal considerations. That is not a soft requirement. It is an engineering one.


Asking “does the model perform well?” is no longer sufficient. The question that matters is: “What does the world look like after this model has been running for five years?”

Social impact has to be built in, not bolted on.


Next: Part 6 — Ethical Trade-offs. The honest conclusion: there are no perfect answers. Only deliberate choices.

Do AI Projects Fail — Or Do We Fail AI?

Blueprint illustration featuring a bird labeled 'DIStraction VECTOR,' and a tower structure with three layered sections labeled 'PROCESS ALIGNMENT,' 'SKILLS & PEOPLE,' and 'DATA FOUNDATIONS.'

If you look back over the last 30 years, our technology history is riddled with failed projects, projects that never got off the ground, and projects that went massively over budget. And then there are the scandals.

The UK Post Office Horizon scandal stands as one of the most serious technology failures in modern British history. Between 1999 and 2015, around 1,000 sub-postmasters were wrongfully prosecuted after the Fujitsu-supplied Horizon accounting software recorded losses that did not exist. The total cost of redress now stands at around £2 billion. More than 13 people took their own lives. The technology did not just fail — it was trusted when it should have been questioned, and the consequences were devastating.

AI is just another technology. That is worth saying plainly. Although many of the possibilities being talked about are genuinely achievable, reality always kicks in. Every project — AI or otherwise — is a complicated combination of people, software, hardware, and process. That has been true for 30 years. It remains true now.

But we can always learn from the past. The question is whether we choose to.

AI is no different. And right now, the data on AI project failure should give everyone pause. The numbers are in. And they are not improving.

RAND Corporation, in one of the most rigorous independent analyses of AI project failure to date, interviewed 65 experienced data scientists and engineers across industries and company sizes. Their finding: more than 80% of AI projects fail to reach meaningful production deployment. That is twice the failure rate of IT projects without an AI component.

S&P Global’s 2025 survey of over 1,000 organisations across North America and Europe found that 42% of companies abandoned most of their AI initiatives that year. In 2024, that figure was 17%. The abandonment rate more than doubled in a single year. MIT’s Project NANDA, published in July 2025, found that 95% of organisations deploying generative AI saw zero measurable return. Not low return. Zero.

With global AI spending projected to reach $630 billion by 2028, these failure rates are not a statistic. They represent hundreds of billions of dollars in wasted investment, stalled initiatives, and businesses no closer to the outcomes they were promised.

What makes this harder to ignore is that the failure rate is moving in the wrong direction. The technology is more capable than it has ever been. The investment is larger than it has ever been. And yet more organisations are abandoning AI initiatives today than they were twelve months ago.

So what is actually going wrong?


The Research Points to Three Things

Across RAND, McKinsey, Gartner, MIT, and Informatica’s CDO Insights survey, the diagnosis is remarkably consistent. AI projects fail for three reasons, and they repeat themselves across industries, geographies, and organisation sizes.

The wrong use case. The process targeted for AI is not the right one, the problem definition is vague, or the initiative is chasing a technology rather than solving a business problem.

Data that is not ready. The data that would be needed to make AI work in production does not exist in the form required — it is fragmented, inconsistent, ungoverned, or simply not there.

A skills gap. The people needed to build, deploy, and sustain AI in a business context are not in place — and the organisation has not yet found a way to close that gap.

None of these are surprises. But the uncomfortable truth is that most organisations are still walking into all three of them, often simultaneously.


The Wrong Use Case: The Magpie Effect at Work

The first failure mode — choosing the wrong process to target — is something I have written about in depth. The Magpie Effect describes what happens when AI strategy is driven by possibility rather than process: the endless pivot towards the latest model, the newest capability, the most impressive vendor demo. Every pivot consumes time, burns budget, and erodes the momentum that comes from doing one thing properly and scaling it.

The RAND report is direct on this point. Stakeholders often misunderstand or miscommunicate what problem needs to be solved. Models get built and deployed optimised for the wrong metrics, or ones that simply do not fit into the real business workflow.

The antidote is starting with the Golden Process — the one business process that everything else hinges on — and asking what AI can do to remove the constraints within it. That conversation has to happen before any vendor is in the room, before any use case is evaluated, and certainly before any infrastructure is selected.

If you have not read the Magpie Effect post, the practical AI test at the end of it is a useful filter for any use case that lands on your desk.


Data That Is Not Ready: The Silent Failure

The second failure mode is the one that catches organisations by surprise — because it tends not to surface until a project is already in trouble.

The pattern is familiar. A proof of concept is built on a carefully selected, cleaned-up sample dataset. The demo runs well. Leadership approves production. And then everything stalls.

Production data is fragmented across systems that were never designed to talk to each other. Basic business terms — “customer”, “order”, “revenue” — are defined differently across departments. Historical records have gaps. Formatting is inconsistent. The clean sample that powered the demo bears almost no resemblance to the messy reality of how the business actually runs.

Informatica’s 2025 CDO Insights survey found that data quality and readiness was the top obstacle to AI success, cited by 43% of organisations. McKinsey found that organisations reporting significant AI returns are twice as likely to have invested in data infrastructure before selecting modelling techniques. Gartner predicts that 60% of AI projects lacking AI-ready data will be abandoned entirely.

This is not a new problem. It is the same problem that has existed since organisations first tried to build anything useful on top of their data. The difference is that AI amplifies it. A flawed report can be corrected. A model trained on broken foundations will confidently produce broken outputs at scale — and the damage compounds before anyone notices.

I have covered the data readiness problem in detail in Garbage In, Expensive Garbage Out and A Million Rows of Nothing. The short version: AI does not fix bad data. It scales it.


The Skills Gap: The Barrier Nobody Budgets For

The third failure mode is the one that receives the least attention — and may be the hardest to solve quickly.

The last 30 years of enterprise IT have built deep, hard-won expertise in infrastructure. Server architecture. Storage design. Networking and virtualisation. Security and compliance. That expertise is not obsolete — it remains essential. The physical and virtual foundations that AI runs on still need people who understand them properly.

But AI demands a different and additional skill set that most organisations are still building.

Traditional IT Skills (Still Relevant)New Skills Now Required
Server architecture and managementData engineering
Storage design and optimisationAI/ML engineering
Networking and virtualisationData science
Security and complianceBusiness domain knowledge

McKinsey’s 2024 State of AI survey found that 58% of businesses are hampered by internal AI skill shortages. Informatica’s CDO Insights survey placed skills and data literacy third in the list of top AI obstacles. PwC found that almost 65% of executives acknowledge their AI initiatives are not succeeding because of a lack of executive sponsorship — a leadership gap that is itself a symptom of not having people in the room who can translate between business outcomes and AI capability.

The skills gap is not a technology problem. It is a people and partnership problem. And it is the reason most AI pilots never leave the room they were born in.

Picture a typical scenario. An organisation identifies a strong AI use case, aligned to a real business process. The data is in reasonable shape. Leadership is supportive. A vendor is engaged. And then the questions start: who is going to build the data pipeline? Who owns the model in production? Who bridges the gap between what the model does and what the business actually needs it to do? The room goes quiet. Not because the will isn’t there — but because the people aren’t.

The honest message is that organisations do not need to replace their existing IT teams — they need to extend them. The people who understand the infrastructure are still essential. But they need colleagues who understand data pipelines, model behaviour, and the business processes those models need to serve. That combination is rare. It takes time to build. And in its absence, even the right use case with clean data will stall.

This is why Partner selection matters as much as use case selection. The right Partner does not just bring technical capability — they bring the scars of what does not work. An AI and data practice built over years has already made the mistakes most organisations have not made yet. That is not a credential. That is insurance.

AI is not a product, it’s an eco-system built on partnership


Three Causes. One Pattern.

What the research is describing — even when it does not use these words — is organisations that jumped to the model before they had solved the three problems that sit upstream of it.

They chased the possibility rather than the process. They skipped the data foundations. And they underestimated how different the skills requirement would be.

McKinsey’s summary of what separates the 6% of genuine AI high performers from everyone else is as sharp as it gets: AI is 20% algorithms, 80% organisational rewiring.

The organisations building durable AI capability are not necessarily the ones with the most sophisticated models. They are the ones that got the process right, made the data ready, and put the right people in place — before they wrote a single line of model code.

The failure statistics look alarming until you understand the causes. Then they look entirely predictable.

The good news: all three are solvable. None of them require waiting for the next frontier model. They just require doing the less glamorous work first.

I have written about what that looks like in practice. Getting AI Right First Time sets out a five-step path from honest awareness through to durable, enterprise-wide capability — the journey from AI as a science project to AI as part of how the business actually runs. And In Praise of Boring, Everyday AI makes the case for what success actually looks like in production: not the demo, not the headline, but the quiet system that runs on a Tuesday afternoon without anyone noticing — because it has simply become part of how the business works.

That is the goal. Not AI that impresses. AI that endures.


AI success has to be built on the right foundations — not retrofitted onto broken ones.

A Million Rows of Nothing

A graphic illustrating a grid labeled 'A MILLION ROWS OF NOTHING,' featuring numerical values, with most cells showing '0.00' and select cells highlighted in orange displaying '1.00.' A crossed-out server icon is on the left, and a note at the bottom reads 'DO NOT SKIP.'

Why business use case and data strategy must come before AI strategy

At a customer event last year, an IT Director told me — with some confidence — that they already had an AI strategy.

“Great,” I said. “Now tell me about your data strategy.”

“We have 250TB,” he replied.

I nodded. And thought: there is a very big difference between data and storage.

That moment has stayed with me because it wasn’t an isolated conversation. It was a pattern. Organisations are arriving at the AI table with infrastructure plans, vendor commitments and boardroom ambition — but without first validating the business use case, the predicted ROI, or the data required to support either one.

That is the gap. And it is an expensive one.


The gap is earlier than most organisations think

Walk any AI conference floor and the energy is real. The technology is genuinely impressive. GPU servers are being specced, procured and racked. Data scientists are being hired. AI roadmaps are being presented to boards.

And somewhere near the bottom of the slide deck, almost as an afterthought: “We’ll need to look at data readiness.”

For organisations serious about AI delivering real outcomes, this is the wrong order.

The first question should not be “what infrastructure should we buy?” It should be “what business problem are we solving, what return do we expect, and does our data actually support that outcome?”

If those questions haven’t been answered, the AI strategy isn’t yet a strategy. It’s an ambition.


Start with the business use case and predicted ROI

Before talking about models or servers, organisations need clarity on three things: what specific business problem are we trying to solve, what result would make this investment worthwhile, and what evidence suggests the data can support that result?

This matters because businesses don’t invest in AI for the sake of AI. They invest in outcomes — lower cost, higher revenue, reduced risk, better service, faster decisions, improved productivity.

The business use case and predicted ROI have to come first. They set the standard the data must meet, the model must prove, and the infrastructure must eventually support. Without that anchor, teams end up building technical capability in search of commercial justification.


Then comes data strategy

This is where many organisations confuse capacity with capability.

Saying “we have 250TB” is not describing a data strategy. It is describing a storage estate.

A real data strategy answers different questions. What data actually matters for the use case? Where does it live? Who owns it? How is it governed? How trustworthy is it? How easily can it be accessed, joined, prepared and used?

AI doesn’t begin with infrastructure. It begins with understanding whether the organisation has data that is usable, relevant, governed and connected to a business objective. That is why data strategy has to come before AI strategy. If you don’t understand the asset you’re asking AI to learn from, you don’t yet know whether the strategy is viable.


Data engineering is not pre-work. It is the work.

The foundational argument is simple, even if it’s routinely ignored: data engineering is not a precursor to AI work. It is AI work.

The pipelines, the schemas, the quality checks, the lineage, the transformation logic — these are not the boring bit before the interesting bit starts. They are the work.

A model is only ever as good as the data it learns from. If that data is incomplete, inconsistently formatted, poorly labelled or structurally flawed, the model will learn the wrong things with great efficiency. Garbage in, amplified garbage out at scale.

The data engineering layer needs to be in place — and understood — before a model is trusted in production. That means clean, documented pipelines with known lineage, a clear system of record for the domain you’re working in, variables that are what they say they are, and critically — someone who has actually interrogated the data, not just counted the rows.

250TB of storage tells you nothing about any of that.


Even clean data can still be useless

Here is where the conversation gets more uncomfortable. Because the problem isn’t always dirty data.

Sometimes the data looks clean. The schema is tidy. The row counts are impressive. The formatting is consistent. It passes the hygiene checks. And then you run the analysis — and discover the data tells you very little. Not because it’s messy. Because it’s empty of useful signal.

This is the moment EDA — Exploratory Data Analysis — earns its place. Not as a technical formality, not as a box to tick before the real work starts, but as the moment of truth. The point at which you find out whether your data can actually answer the question you’re asking of it.

That means looking at distributions, missingness, outliers, feature relationships, basic correlations, and whether the patterns you expected to see are actually present. If they aren’t, that isn’t a minor issue. It’s the whole issue.


A million rows of nothing is still nothing

This is why volume can be so misleading.

Take a look at this correlation matrix.

A correlation matrix displaying the relationships among numerical features including Price, Discount, Tax Rate, Stock Level, Customer Age Group, Shipping Cost, Return Rate, Seasonality, and Popularity Index. The matrix is color-coded with a gradient scale from blue to red indicating strength of correlation.

To the untrained eye it looks impressive. Professional. The kind of output that gets nodded at in a boardroom. But look closer. That red diagonal? Every variable correlating perfectly with itself — mathematically guaranteed, analytically meaningless. Everything else is zero. Price and Discount: no relationship. Seasonality and Stock Level: no relationship. Shipping Cost and Return Rate: no relationship.

In a real retail dataset those relationships should exist. The fact that this data shows none of them is a signal worth taking seriously. A flat correlation view doesn’t prove there is absolutely nothing to learn — but it does tell you there is no obvious predictive signal in this view of the data. That should trigger caution, not confidence.

You shouldn’t respond by buying more infrastructure. You should respond by asking better questions. Are these the right features? Is the data aggregated at the wrong level? Are important variables missing? Is the business question badly framed? Are we trying to predict something the data cannot meaningfully support?

If you can’t answer those questions, you are not ready to build the model. You are ready to do more analysis.


Model readiness comes after data readiness

Only once the business case is clear and the data has been tested should the conversation move to model readiness.

At that point the focus becomes more disciplined. Can the data support the target outcome? Which features actually carry useful predictive weight? What baseline performance is realistic? What error level is acceptable for the business use case? What would success look like in practice, not just in a notebook?

This is the stage where organisations find out whether the use case is genuinely model-worthy — or whether it looked better in a strategy deck than it does in reality. Model readiness is not about enthusiasm. It is about proof.


Infrastructure should be the consequence, not the starting point

The infrastructure conversation is seductive. More compute, faster processing, bigger clusters — these feel like progress. And they are progress, in the right context.

When you have a validated business case, a believable ROI, signal-rich data and a well-framed modelling problem, the right infrastructure genuinely accelerates outcomes. But infrastructure applied to unvalidated data doesn’t solve the problem. It scales it.

A model trained on the wrong data, running on the best hardware available, will produce wrong answers faster and at greater cost than anyone planned for. The servers don’t know the business case is weak. The GPUs don’t know the data is empty of signal. They will process bad assumptions with perfect efficiency.

That is why the sequence matters.

Business use case → Predicted ROI → Data strategy → Data engineering and EDA → Feature validation and model readiness → Infrastructure investment

Getting that sequence right is the difference between an AI investment that delivers and one that quietly disappoints.

The IT Director with 250TB has storage. What he needs first is a conversation about what’s in it, whether it’s been tested, whether it contains usable signal, and whether it can answer the questions the business is asking. That is the conversation worth having before the servers arrive.


Closing thought

There is a version of the AI hype cycle that ends badly — and it ends badly in a specific way. Not with dramatic failure, but with quiet disappointment. Models that don’t perform. Investments that don’t deliver. Data scientists hired to build things the data was never capable of supporting.

The organisations that avoid that outcome are the ones that did the unglamorous work first. They validated the use case. They estimated the ROI. They looked at the data before they bought the infrastructure. They ran EDA before they committed to the model. They asked hard questions before they made bold commitments.

The emperor’s new clothes are always convincing until someone asks the uncomfortable question. In AI, that question is usually the same:

Have you actually tested the data?

Data readiness has to be built in, not bolted on.

Ethics in AI: Part 4

Transparency and Explainabilityinside the black magic box

Diagram illustrating the concept of explainability in AI ethics, featuring a blue background, large text that says 'ETHICS IN AI / PART 04,' and visual representations of explainable data, predictions, and algorithms, linked to an 'ACCOUNTABLE DECISION' node.

Ever since I can remember, I’ve wanted to know how things work. Not just that they work — but why, and what’s going on inside.

My parents had a name for it: Fiddle Fingers. Clocks, radios, household appliances — nothing was safe. I’d take them apart with genuine curiosity and varying degrees of success. The parts I couldn’t reassemble quietly disappeared under my bed. I even unscrewed the back of a plug once — purely to see electricity. The thunderbolts that shot up my arm was, in hindsight, a reasonable price for the lesson. My parents, to their credit, were more understanding than the appliances deserved.

Later I studied motor vehicle engineering — which at least meant I could finally take things apart professionally. And now, working through my MSc, I’m still doing the same thing. Still looking under the hood. Still asking what’s in the box.

Which makes the subject of this post a personal one. Because one of the most troubling things about modern AI systems is that they often don’t let you look inside. Not because the technology prevents it — but because transparency and explainability haven’t been treated as priorities. The box stays closed. And when the box is closed, accountability becomes very difficult to defend.

In the previous parts of this series we’ve looked at bias, fairness and accountability — the ethical challenges that emerge when AI systems make decisions that affect people’s lives. This instalment moves into territory that sits underneath all of those: if you can’t see how a system works, and can’t explain what it decided, the other ethical principles become very difficult to uphold.

Transparency and explainability are the mechanisms that make accountability possible. Without them, everything else is aspiration.


Two related concepts, one shared purpose

The terms transparency and explainability are often used interchangeably. They shouldn’t be — they address different things, and the distinction matters.

Transparency concerns the visibility of the system as a whole. How was it built? What data was it trained on? What modelling choices were made, and why? What does the performance data actually show? Transparency enables external scrutiny. It supports governance, auditability and regulatory oversight. Without it, independent evaluation becomes impossible — you’re simply asked to trust the outcome without any means of checking it.

Explainability concerns the individual decision. Not the system in aggregate, but this specific output: why did the model produce this result for this person, in this context, at this moment? In high-stakes settings — healthcare, criminal justice, financial services — that question isn’t academic. It’s a matter of rights.

Think of it this way. Transparency lets you audit the factory. Explainability lets you understand why one particular product came off the line the way it did.

Both matter. And in most real-world deployments of AI today, both are harder to achieve than the marketing suggests.


The three questions explainability has to answer

When we talk about making an AI system explainable, we’re really asking three distinct questions — and each requires a different kind of answer.

The first is about data. What information was used to train the model, and why was it chosen? This isn’t just a technical question. Training data encodes assumptions about the world, and those assumptions shape every output the model produces. If the data can’t be explained and justified, the decisions downstream can’t be either.

The second is about predictions. What features and weights drove this particular output? Why did the model score this applicant lower than another? Which variables carried the most influence, and in what direction? This is where post hoc explanation techniques — tools that interpret model behaviour after the fact — do most of their work.

The third is about the algorithm itself. What are the layers, the thresholds, the decision boundaries? How does the model move from input to output? For simpler models, this question has a direct answer. For more complex ones, it often doesn’t — which is where the central tension of this topic lives.


COMPAS: when a black box meets a courtroom

No case study illustrates the stakes of transparency and explainability more starkly than COMPAS — the Correctional Offender Management Profiling for Alternative Sanctions tool, widely used in the United States to assess the risk that a defendant will reoffend.

Judges used COMPAS scores to inform decisions about bail, sentencing and parole. The scores carried real weight in outcomes that determined whether people went home or went to prison. And yet the algorithm that produced those scores was proprietary. Defendants had no means of understanding how their score was calculated, no ability to identify errors in the underlying data, and no realistic way to challenge the output in court.

In 2016, ProPublica published an investigation showing that COMPAS assigned significantly higher reoffending risk scores to Black defendants than to white defendants with comparable profiles. The tool wasn’t just opaque — it was producing outcomes that were racially skewed in one of the highest-stakes contexts imaginable.

The Loomis v. Wisconsin case reached the Wisconsin Supreme Court, where the defendant argued that using a proprietary, unexplainable algorithm in sentencing violated his right to due process. The court upheld the use of the tool. The algorithm remained a black box.

COMPAS sits at the intersection of everything that matters in this conversation. Transparency was absent — no visibility into the model’s design, data or validation. Explainability was absent — no way to interrogate individual decisions. And the consequences were borne by people who had no recourse and no means of understanding why.


The tension that doesn’t go away

Here is the dilemma that transparency and explainability force us to confront — and it doesn’t have a clean resolution.

The models that tend to perform best on complex, real-world prediction tasks are also the least interpretable. Deep neural networks, gradient boosting models, large ensemble methods — these approaches can achieve superior predictive accuracy precisely because they capture subtle, non-linear relationships in data that simpler models miss. But that complexity comes at a cost: the internal workings become difficult, sometimes impossible, to explain in terms a human can meaningfully interpret.

Simpler models — linear regression, decision trees, rule-based systems — offer genuine interpretability. You can follow the logic from input to output, identify which variables matter and by how much, and explain a decision to the person it affects. But they often sacrifice accuracy to do it. In a noisy, high-dimensional real world, simpler models sometimes just get more things wrong.

This is not a technical problem waiting for a technical solution. It is a genuine ethical trade-off. In some contexts — say, a recommendation engine for a streaming service — that trade-off sits comfortably on the side of performance. In others — a credit decision, a medical diagnosis, a criminal risk score — the question of what we’re willing to sacrifice for accuracy becomes a question of values, not engineering.

Regulatory frameworks are beginning to codify where that line falls. The EU AI Act classifies high-risk AI applications and mandates transparency and explainability requirements accordingly. The GDPR enshrines a right to explanation for automated decisions. But regulation sets a floor, not a ceiling — and the honest truth is that many organisations are still well below it.


What good looks like in practice

Transparency and explainability aren’t binary. They exist on a spectrum, and the appropriate level depends on context — the stakes involved, the people affected, and the regulatory environment in play.

For high-risk applications, the baseline should include clear documentation of training data, modelling choices and performance metrics across demographic groups; post hoc explanation tools that can surface the key drivers of individual decisions; human review mechanisms for decisions that significantly affect individuals; and the ability to audit the system independently — not just internally.

For lower-risk applications, lighter-touch approaches may be proportionate. But the principle remains: the system should be able to account for itself, and the people it affects should have a meaningful way to understand and, where necessary, challenge its outputs.

The temptation to treat explainability as a presentation problem — a dashboard, a label, a percentage confidence score — should be resisted. A number on a screen is not an explanation. An explanation is something a person can interrogate, reason about and act on.


Closing thought

There is a version of AI development where transparency and explainability are treated as compliance tasks — boxes to tick, documentation to file, a report to produce before launch. That version produces systems that look accountable without being accountable.

The harder version asks the question earlier: before a model is selected, before a dataset is assembled, before a use case is approved. It treats interpretability as a design constraint, not an afterthought. It asks whether a complex model is actually necessary, or whether a simpler, more explainable one would serve the purpose well enough.

That version is also the honest version. Because when a system makes a decision that changes someone’s life — and they ask why — “the algorithm is proprietary” is not an answer any ethical organisation should be comfortable giving.

Transparency and explainability have to be built in, not bolted on.