The Plumbing Under the Hood: RAG, MCP and the Architecture Nobody Explains

A diagram illustrating the architecture of a large language model (LLM) with connections to various systems including CRM, ERP, HR, and SharePoint, displayed on a blueprint-style background.

I’m an under the hood type of guy. I hear high-level fluff and I just turn off. I need more. I need to be able to visualise how things work — and the effects of implementation. I guess that’s the Solution Architect in me. Years of seeing projects go south. Experience that says it’s just not that simple.

I learned a long time ago: if you want something doing, do it yourself.

So here it is. No fluff, no hand-waving. The no-nonsense guide to what RAG and MCP actually are, how they work, and why the distinction matters more than most people realise. Enjoy.


The Problem Every Enterprise AI Deployment Hits

Large language models are genuinely extraordinary. The breadth of knowledge, the reasoning capability, the ability to synthesise and explain — it’s real, and it’s useful. But they have a fundamental constraint that every organisation hits the moment they try to deploy one seriously.

They are frozen.

An LLM is trained on a vast corpus of data up to a point in time, and then the weights are fixed. The model doesn’t know what happened last Tuesday. It doesn’t know your organisation’s processes, your customer contracts, your current pipeline, or the policy document your HR team updated this morning. It is, for all its capability, a brilliant mind in a sealed room.

Every enterprise AI deployment is therefore really solving one problem: how do we get relevant, current, organisational knowledge into the model’s hands at the moment it needs to answer?

Two main solutions emerged. They look similar on the surface. They are fundamentally different underneath.


RAG: The Indexed Snapshot

RAG stands for Retrieval Augmented Generation. The name is less important than the mechanism.

Imagine you have a large knowledge base — policy documents, product guides, training materials, technical specifications. RAG takes all of that content and processes it in advance. Each document gets broken into chunks. Each chunk gets converted into a vector embedding — a numerical representation of the meaning of that text, not just its keywords. Those embeddings get stored in a vector database.

When a user asks a question, the question itself gets converted into a vector using the same method. The system then searches the database for the chunks whose meaning is closest to the meaning of the question — semantic similarity, not keyword matching. The most relevant chunks get retrieved and placed into the model’s context window alongside the original question. The model answers using that retrieved material as its working context.

Think of it as a library. Brilliantly organised, perfectly indexed, searchable by meaning rather than title. You walk in, the system finds the most relevant books, opens them to the right pages, and hands them to the model before it answers.

It’s powerful. For stable, curated knowledge bases it works extremely well.

But it has a ceiling, and the ceiling matters.

The library was shelved at a point in time. The moment your source documents change, your index is stale until you re-embed. And the quality of retrieval is entirely dependent on the quality of what went in. Poorly structured documents, inconsistent language, missing metadata — the embeddings become noisy and retrieval underperforms. The foundational principle holds here as firmly as anywhere in AI: weak data quality at the input stage leads to flawed outputs downstream. RAG doesn’t solve a data quality problem. It inherits it.


MCP: The Living Plumbing

MCP — the Model Context Protocol — is a different kind of answer to the same problem. And understanding the difference is where the real business thinking begins.

MCP doesn’t retrieve from a pre-built index. It connects the model to live systems through their APIs — and queries them in real time, at the moment of the conversation.

Here’s what that means practically. Your SharePoint isn’t indexed in advance — the model calls it directly and gets back whatever is there right now, including the contract template someone updated this morning. Your CRM isn’t embedded into vectors — the model queries it and sees the deal that moved stage an hour ago. Your HR system, your procurement platform, your service desk — all of them accessible, all of them current, all of them live.

The model doesn’t see a snapshot of your organisation. It sees your organisation as it actually is, right now.

And here is the point that changes how you should think about this entirely.

Most enterprise knowledge isn’t in one place. It never has been. It’s fragmented across Salesforce and SAP, ServiceNow and SharePoint, HR platforms and finance systems and procurement tools. Getting RAG to span those systems requires significant data engineering effort — ingesting, normalising, embedding, maintaining. It’s achievable, but it’s heavy.

MCP connects to all of them. Through their APIs. Simultaneously. The model becomes a single conversational interface across the entire technology estate — not just one knowledge base, but the living information fabric of the organisation.

That is not a chatbot connected to some documents. That is a fundamentally different proposition.


Not Competitors — Different Layers

It would be tempting to read this as RAG versus MCP. It isn’t.

They solve overlapping problems at different layers and with different trade-offs. RAG is the right tool for large, stable knowledge corpora where semantic similarity search matters — where you need the model to find relevant material even when the exact words don’t appear in the query. MCP is the right tool where data is live, dynamic, and distributed across operational systems.

And they can work together. A well-architected system might use MCP as the orchestration layer — the model deciding which tools to call — while one of those tools triggers a RAG pipeline for a specific stable knowledge base. The plumbing and the library, working in concert.

The practical guidance is straightforward. Start with MCP. It’s the lower point of entry — no vector infrastructure to provision, no embedding pipelines to build and maintain, no index to keep fresh. You’re connecting to systems and APIs you already have. Reach for RAG when you’ve hit the ceiling — when the corpus is large, messy, and semantic retrieval across unstructured content becomes essential.

Start simple. Earn the complexity.


Before You Lay The Plumbing — What Nobody Tells You

The pitch for both RAG and MCP is compelling. The reality, as always, has a few sharp edges worth knowing about before you commit.

RAG brings infrastructure with it. RAG isn’t just a software pattern you switch on. Behind every vector database is a compute and storage requirement that needs provisioning, maintaining, and scaling as your knowledge base grows. Embedding pipelines need to run continuously — every time source content changes, chunks need re-processing and re-indexing or your library goes stale. For organisations already managing data centre complexity, this is a real cost conversation that rarely appears in the vendor presentation.

MCP makes your legacy systems load-bearing. MCP’s power is connecting to live systems. But those live systems are now dependencies. The legacy HR platform with the flaky API. The procurement system that slows under load. The CRM with three years of inconsistent data entry. Once the LLM is reaching across your technology estate, it is only as reliable as the weakest system it touches. A timeout, a bad API response, a data quality problem in one system degrades the entire interface. What felt like a peripheral legacy problem just became front and centre.

Governance and security are not optional extras. When a model can traverse your entire technology estate — reading CRM data, querying HR systems, pulling procurement approvals — your entire technology estate needs to be ready for that conversation. Access controls, data classification, audit trails, API security, compliance boundaries. These cannot be bolted on after deployment. They need to be designed in from the start. MCP without a holistic governance and security view isn’t just risky. It’s an exposed surface at scale.

This is AI Reality. The plumbing is powerful. Lay it properly.


The Interface, The Plumbing, The Flow

Here is the frame I want to leave you with — because it’s the one that changes how you brief a customer, evaluate a vendor, or think about your own AI roadmap.

LLMs are becoming the interface to information. Not a search bar, not a dashboard, not a report. A conversational, reasoning interface that sits in front of your organisation’s entire data landscape and makes it accessible in plain language.

MCP is the plumbing. The connective tissue that links the interface to the living systems underneath — the CRM, the ERP, the HR platform, the document store, the data warehouse. Without the plumbing, the interface has nothing to work with. With it, the interface can see everything.

And once you have an interface and plumbing, something else becomes possible.

Agents.

Not models that answer questions. Models that act. That move through systems, make decisions, complete workflows, and hand off to humans at exactly the right moment. Agents ride the pipelines that MCP creates and turn information flow into work getting done.

That’s where this goes. And that’s what the next post is about.

Next: The Agentic Leap — when AI stops answering and starts acting.


The Interface. The Plumbing. The Flow.

LLM Sizing 101 – Part 3: Platform and GPU Selection

A schematic diagram illustrating the LLM sizing chain, featuring flowcharts that detail model size, precision, tokens per second, GPU count, node count, and platform specifications.

Mapping your sizing to Dell PowerEdge XE configurations

In Part 1 we nailed down the two fundamentals: parameters define how big the model is, and tokens define how much work you’re asking it to do.

In Part 2 we made it practical — translating a customer’s real-world requirements into a target tokens-per-second figure, and from there into a GPU count.

Now we make it concrete.

Building on the methodology from Part 2, we apply it to two representative scenarios — a 7B internal assistant and a 70B RAG system — and map everything to actual Dell PowerEdge XE platform configurations you can put in a proposal. But before we get to the reference designs, there’s a gotcha.


The Gotcha: Model Precision

There’s a variable that can silently double — or halve — your GPU count if you don’t nail it down early in the conversation.

Model precision.

When a customer says “we want to run a 70B model,” that sentence is incomplete. The question you need to ask immediately is: at what model precision?

Here’s why it matters so much. The memory footprint of a model is:

Model VRAM (GB) = number of parameters × bytes per parameter

And the bytes-per-parameter figure is entirely determined by precision:

PrecisionBytes per parameter70B model — weights onlyNotes
FP324 bytes~280 GBTraining only; rare in inference
FP16 / BF162 bytes~140 GBFull quality baseline
FP81 byte~70 GBRequires H100, H200, B300 class
INT81 byte~70 GBBroad hardware support
INT40.5 bytes~35 GBValidate quality before committing
FP40.25 bytes~17.5 GBB300/GB300 only; first-class inference precision

Run the same 70B model at FP16 versus INT4 and the weights footprint changes by 4×. That’s the difference between needing two 8-GPU nodes and needing one. It’s the difference between a £400k proposal and a £200k proposal. And it’s a variable that’s completely invisible if you skip the precision conversation.

How to find out

The good news: model precision is almost always discoverable before you size anything.

The model card. Every published model has a model card stating the native training precision — typically FP32, BF16, or FP16 — and whether pre-quantised versions exist. Llama 3.1 405B, for example, is published in BF16 with a separate FP8-quantised version available for single-node deployment. That’s not a footnote — it’s a hardware decision.

The deployment framework. When a customer tells you they’re using vLLM, TensorRT-LLM, or NVIDIA NIM, the framework makes precision explicit. NIM profiles are named by precision — tensorrt_llm-h100-fp8-tp2-latency tells you the precision, the GPU, and the parallelism strategy in one string. If the customer has already chosen a framework, ask what precision they’re deploying at — they’ll either know, or the question will prompt them to find out.

The GPU itself. Not all GPUs support all precisions. FP8 requires H100, H200, B300 or AMD MI300X class hardware. FP4 is exclusive to B300 and GB300 — it isn’t available on earlier generations. INT4 with hardware acceleration requires specific tensor core support. If the customer has already chosen a GPU, that constrains the precision options — and vice versa. The two decisions are linked.

The precision conversation in practice

When a customer names a model, these are the three questions that unlock the sizing:

“Are you using the native model weights, or a quantised version?” “What serving framework are you planning to use?” “Is some accuracy trade-off acceptable in exchange for a smaller hardware footprint?”

That last question is the most important one. Modern quantisation techniques — GPTQ, AWQ, SmoothQuant — preserve the vast majority of model quality for most enterprise inference workloads. The difference between BF16 and INT8 is typically imperceptible for summarisation, search, classification and code assistance. For complex multi-step reasoning or fine-tuned models, it’s worth validating. But for the majority of use cases, INT8 or FP8 is a legitimate production choice — not a compromise.

The rule of thumb: the bigger the model, the more gracefully it quantises — for most enterprise inference workloads. A 70B model at INT8 loses less proportionally than a 7B model at INT4.

Get precision wrong — or leave it undefined — and every GPU count in your proposal is built on a shaky foundation. Get it right, and you have a sizing conversation that’s grounded, defensible, and often more cost-effective than the customer expected.


Two Reference Designs

With precision established, everything else follows.

Sizing disclaimer: The reference designs below illustrate the methodology — they are not a substitute for your own sizing exercise. TPS figures, GPU counts and node recommendations are directional reference points based on representative workloads. Actual performance will vary with your specific model, serving framework, quantisation approach, batch configuration and workload pattern. Always validate against benchmark data for your environment before quoting or committing to a configuration.

These aren’t rigid prescriptions — they’re starting points you can adapt by adjusting the inputs and re-running the TPS maths from Part 2.


Reference Design A: 7B Internal Assistant

Use case: An internal productivity assistant — employees asking about policies, summarising documents, drafting emails. High concurrency, moderate latency sensitivity, cost-conscious.

1. Define the workload

ParameterValue
Concurrent users (peak)500
Average prompt400 tokens
Average response250 tokens
Target response time~8–10 seconds
Acceptable TTFT< 2 seconds
Model7B class

2. Establish precision and memory footprint

For a 7B model:

PrecisionWeights footprintFits on a single GPU?
FP16 / BF16~14 GBYes (48–80 GB class)
INT8~7 GBYes — comfortably
INT4~3.5 GBYes — with significant headroom

For a high-concurrency internal assistant, INT8 or mixed precision (weights in INT8, activations in FP16/BF16) is the practical default. It fits cleanly on a single GPU, leaves room for KV cache and batching overhead, and the quality trade-off is negligible for this kind of workload.

3. Translate to TPS

  • 250 output tokens ÷ 10 seconds = 25 tokens/sec per user
  • 500 users × 25 tokens/sec = 12,500 tokens/sec system TPS

4. Per-GPU TPS estimate

For a 7B model at INT8/mixed precision, batched decode on a high-end accelerator:

GPUApprox. TPS (7B, batched decode)
H100 80GB SXM~2,000–3,000
H200 141GB~2,500–3,500
L40S 48GB~1,000–1,500
B300 288GB~4,000–6,000 (est.)

Conservative estimate: 1,500 TPS per GPU on current generation; higher on B300.

5. GPU and node count

  • 12,500 TPS ÷ 1,500 TPS/GPU ≈ 8.3 GPUs
  • Add 25% headroom: 8.3 × 1.25 ≈ 10.4 GPUs → round up to 12 for a clean 3 × 4 configuration

6. Platform mapping

A 7B model at INT8 fits on a single GPU — no tensor parallelism required. Each GPU runs an independent model replica and you scale out horizontally across nodes. This is compact, balanced GPU server territory.

The Dell PowerEdge XE7745 is the natural fit for this workload class: a 2U platform supporting up to 4 high-memory GPUs, designed for exactly this kind of inference deployment. For organisations planning ahead with Blackwell, the XE7745 also supports NVIDIA RTX Pro 6000 Blackwell Server Edition GPUs — a professional-grade accelerator with 96 GB GDDR7 that offers significant headroom for a 7B workload and future-proofing for multi-model environments, at a lower power and cost envelope than data centre HBM-class GPUs.

“For a 7B internal assistant serving ~500 concurrent users, a small cluster of three PowerEdge XE7745 nodes gives you a responsive chat experience, capacity to grow, and the flexibility to host multiple models or environments — all in a standard rack footprint.”


Reference Design B: 70B RAG System

Use case: Knowledge-heavy workflows — legal, financial or engineering teams querying proprietary documents via a RAG pipeline. Quality matters more than raw user count. Concurrency is moderate.

1. Define the workload

ParameterValue
Concurrent users (peak)100
Average prompt2,000 tokens
Average response500 tokens
Target response time~12–15 seconds
Acceptable TTFT< 3 seconds
Model70B class

Prompts are longer here because RAG injects retrieved document snippets, conversation history and system instructions into every request. That longer context window drives up KV cache memory — which is why the platform choice shifts significantly compared to Reference Design A.

2. Establish precision and memory footprint

This is where precision has the biggest impact on the proposal — and where the B300 changes the calculus significantly:

PrecisionWeights footprintH100/H200 GPUs neededB300 GPUs neededNotes
FP16 / BF16~140 GB2 minimum1 (fits with headroom)Full quality; B300’s 288 GB changes the equation
FP8~70 GB1 minimum1Near-FP16 quality; requires H100/H200/B300
INT8~70 GB1 minimum1Minimal quality loss for most workloads
INT4~35 GB11Validate quality before committing
FP4~17.5 GBN/A — B300/GB300 only1 — with substantial headroomValidate quality for RAG use cases

A key implication for B300 deployments: with 288 GB of HBM3e per GPU, a 70B model at FP16 (~140 GB weights) fits on a single B300. That eliminates the need for tensor parallelism within the node for this model size, simplifying the architecture and reducing interconnect dependency.

For a legal or financial RAG workload where output quality is the primary requirement, FP16 or FP8 remains the right starting point. FP4 on B300 is increasingly viable but worth validating explicitly against the customer’s specific domain before committing.

3. Translate to TPS

  • 500 output tokens ÷ 15 seconds = 33 tokens/sec per user
  • 100 users × 33 tokens/sec = 3,300 tokens/sec system TPS

4. Per-node TPS estimate

For a 70B model running on high-end accelerators:

ConfigurationApprox. TPS (70B, batched decode)
4× H100 80GB (tensor parallel)~800–1,200
8× H100 80GB (tensor parallel)~1,500–2,500
8× H200 141GB (tensor parallel)~2,000–3,500
8× B300 288GB (HGX B300)~4,000–7,000 (est.)

Conservative estimate on an 8× H100 node: 1,500 TPS. On an 8× B300 node: significantly higher, with the added benefit that each GPU can host the full model independently.

5. Node count

  • 3,300 TPS ÷ 1,500 TPS/node (H100 baseline) ≈ 2.2 nodes
  • Add 25–30% headroom: 2.2 × 1.25 ≈ 2.75 → round to 3
  • Total: 3 nodes × 8 GPUs = 24 GPUs (H100/H200 baseline)

On B300 hardware, the same TPS target is achievable with fewer nodes — or the same node count delivers substantially higher capacity.

Three nodes also gives you operational flexibility — you can drain one for maintenance without collapsing capacity below the required TPS floor.

6. Platform mapping

For H100/H200 deployments, the Dell PowerEdge XE9680 with 8× H100 or H200 GPUs remains a proven reference platform for 70B inference, with NVLink and NVSwitch providing the fast GPU-to-GPU interconnect tensor parallelism requires.

For Blackwell deployments, the Dell PowerEdge XE9780 and XE9785 are the direct successors to the XE9680 — delivering up to 4× faster LLM performance with the 8-way NVIDIA HGX B300. The liquid-cooled XE9780L and XE9785L variants support higher GPU densities for rack-scale deployments.

Infrastructure note: B300 systems require liquid cooling, 800 Gb/s networking, and power densities that most existing facilities cannot support without upgrade. The B300 draws 1,400W TDP per GPU — 40% more than the B200, and double the H100. Factor facility readiness into any B300 sizing conversation before committing to a configuration.

“For a 70B RAG assistant used by specialist teams — legal, finance, engineering — the PowerEdge XE9680 with H100/H200 GPUs remains a strong proven choice. For organisations investing in Blackwell infrastructure, the XE9780/XE9785 with HGX B300 delivers significantly higher throughput and eliminates tensor parallelism requirements for 70B class models — but facility readiness for liquid cooling and power density must be confirmed first.”


The Platform Decision in Summary

WorkloadModelPrecisionCurrent PlatformBlackwell PlatformGPUs/nodeNodes
Internal assistant (high concurrency)7BINT8PowerEdge XE7745XE7745 (RTX Pro 6000 BW)4× GPUs3
RAG system (quality-first)70BFP16 / FP8PowerEdge XE9680XE9780 / XE9785 (HGX B300)8× GPUs3

The pattern is consistent: model size drives platform class, precision drives memory footprint, TPS drives node count. Miss any one of those three and the sizing is incomplete.

For organisations moving beyond 70B — frontier models, multi-tenant inference at scale, or combined training and inference workloads — the Dell PowerEdge XE9712 featuring NVIDIA GB300 NVL72 is the next step up. With 72 Blackwell Ultra GPUs and up to 40 TB of fast memory per rack (combining ~20 TB of GPU HBM3e across 72 GPUs and ~17 TB of Grace CPU LPDDR5X), it delivers exascale-class AI performance for workloads that have outgrown the per-node sizing conversation entirely. That’s a different discussion — but it starts with the same methodology.


Three Trade-offs Worth Raising

Once you’ve walked a customer through a reference design, three conversations typically follow.

1. Can we use a smaller model? Sometimes yes — and it’s worth exploring. A well-tuned 13B model can deliver surprisingly strong results for many enterprise use cases, at a fraction of the infrastructure cost of a 70B. The right answer depends on the use case, not just the budget.

2. Can we quantise to reduce the footprint? INT8 quantisation roughly halves the memory footprint with minimal quality loss for most inference workloads. INT4 goes further — but quality trade-offs become more noticeable and are worth validating before committing. FP4 on B300 hardware is the emerging sweet spot for next-generation inference: near-FP8 quality at half the memory cost, with hardware-accelerated compute — but it requires Blackwell Ultra infrastructure.

3. What about fine-tuning? If the customer plans to fine-tune as well as infer, size for fine-tuning — it’s the more demanding workload. Fine-tuning requires storing optimiser states and gradients alongside the model weights, which can triple or quadruple the VRAM requirement compared to inference alone. A platform sized for fine-tuning will handle inference comfortably.


What’s Next

With three posts, we’ve built a complete sizing chain:

  • Part 1: Parameters and tokens — the two dials that drive every sizing decision
  • Part 2: From tokens per second to GPU count — the maths that connects users to hardware
  • Part 3: Precision, platform selection, and reference designs — where the maths meets the metal

The natural next conversation is the one that follows a sizing recommendation: how does an on-premises PowerEdge deployment compare to cloud over three years? That’s the cost modelling discussion — and it’s where a well-sized on-premises platform often tells a very different story to the cloud bill the customer is currently paying.



When the Wine Lies: Finding Quality in the Chemistry

Illustration of a wine bottle with measurements and analysis labels including alcohol, volatile acidity, sulphates, pH, and density, alongside a scatter plot showing data points with ratings.

Sometimes the data tells you what your wallet couldn’t.


Last weekend I cooked steak. Proper steak — the kind that deserves a decent red wine alongside it. So I did what any self-respecting wine buyer does: I spent more than usual. Higher price, better wine. That’s how it works, right?

Wrong.

The wine was awful. Bitter, sharp, aggressive — more paint stripper than Pinot. The kind of wine that makes you wonder whether the person who priced it had ever actually tasted it. As I pushed the glass aside and reached for the water, a question formed: in the real world, how do we actually measure wine quality?

Not price. Clearly not price. So what then?


Meet Charlie and Clare

Regular readers will know Charlie and Clare. Charlie is our Data Engineer — he builds the pipelines, aggregates the sources, and delivers clean, structured data. Clare is our Data Analyst and Data Scientist — she works with what Charlie hands her and finds the patterns worth knowing about.

This week they’re working with wine.

Charlie has been busy. A winery client wants to understand quality across their production batches. The data lives in multiple places — laboratory analysis systems, fermentation monitoring sensors, batch production records. Individually, each source is a fragment. Together, they tell a story. Charlie’s data platform pulls from all of them, normalises the formats, handles the joins, and delivers Clare a clean pipeline: 1,599 red wine samples, each described by eleven physicochemical measurements.

Alcohol content. Fixed and volatile acidity. Citric acid. Sulphates. Residual sugar. Chlorides. Density. pH. Free and total sulphur dioxide.

Eleven numbers per wine.

And crucially — no quality labels.


Clare’s Problem

Clare looks at the dataset and faces a genuinely interesting challenge. There are no categories here, no pre-assigned groups, no right answers to train against. Just eleven continuous measurements and 1,599 rows of chemistry.

This is the domain of unsupervised learning — the branch of machine learning that finds structure in data without being told what to look for. Where supervised learning optimises toward a target, unsupervised learning asks a different question entirely: what patterns exist that we haven’t defined yet?

Clare’s task is to let the data organise itself, then ask whether the organisation means anything.

Before she touches a model, she does something essential. She scales the data.

This matters more than it sounds. The eleven features live on very different scales — alcohol ranges from roughly 8 to 15, total sulphur dioxide from 6 to 289. Feed raw numbers into a distance-based algorithm and the large-scale variables will dominate purely through magnitude, drowning out the signal from smaller-range features. StandardScaler transforms everything to zero mean and unit variance — now every feature competes on equal terms.

Charlie’s pipeline has already handled the missing values and format inconsistencies. Clare inherits clean data. That’s not an accident — it’s the platform working as intended.


Eleven Dimensions Are Too Many to See

Before clustering, Clare reduces complexity using Principal Component Analysis — PCA.

Think of PCA as finding the angles from which the data is most spread out. Eleven features create eleven dimensions, which is impossible to visualise and cognitively overwhelming to reason about. PCA finds new axes — principal components — that capture the maximum variance in the fewest dimensions.

The results are telling. Nine components are needed to explain 95% of the variance. No single axis dominates. The data genuinely is high-dimensional — there’s no shortcut that captures most of the story. The first two components together explain just 45.7% of variance.

That’s an important caveat Clare keeps front of mind: when she later visualises clusters on a two-dimensional PCA plot, she’s seeing less than half the structure. The scatter plot is illustrative, not definitive.

PC1 is driven primarily by acidity-related features — it broadly separates sharper, more acidic wines from rounder ones. PC2 captures alcohol and fermentation character — higher alcohol and sulphate concentrations, reflecting more complete fermentation and stronger microbial stability. Even this compressed view starts to suggest that wine chemistry has meaningful directions of variation.


How k-means Works — The Wine Tasting Table

Imagine Clare pours all 1,599 wine samples into glasses and lines them up on a long tasting table. She doesn’t know how many groups there are yet, but she suspects wines with similar chemistry will naturally belong together.

She picks three glasses to act as reference points — her starting cluster centres, her “k” — and assigns every other wine to whichever reference glass it’s closest to, chemically speaking. Then she looks at each group, finds the glass that sits closest to the average of all its members, and moves her reference point there. Wines get reassigned. Reference points shift again. The process repeats until nothing moves anymore — the groups have stabilised around their natural centres.

That’s k-means. Not magic, not mystery. An algorithm that keeps nudging reference glasses along the table until the groupings settle. The “k” is simply the number of reference points — Clare’s job is to choose it wisely, which is where the elbow method comes in.


Letting the Data Find Its Own Groups

Clare runs k-means clustering — an algorithm that partitions observations into k groups by minimising the distance between each point and its cluster centre.

The question is: what should k be?

She uses two methods in parallel. The elbow method plots inertia — the total within-cluster variance — against increasing values of k. As k grows, inertia falls; the question is where the rate of improvement flattens. The silhouette coefficient measures how well each point sits within its assigned cluster compared to neighbouring clusters — higher is better.

Both methods point toward k=3 as a defensible choice. Three groups. Clare fits the final model.

The clusters contain 722, 502, and 375 wines respectively.


What the Clusters Actually Found

Now Clare looks at the chemistry of each group.

Cluster 1 — 502 wines — stands out immediately. Highest alcohol (10.72%), lowest volatile acidity (0.41 g/dm³), highest sulphates (0.75 g/dm³). These are markers experienced winemakers recognise: lower volatile acidity means less acetic acid, less of that sharp, vinegary edge. Higher sulphates support microbial stability and structure.

Cluster 0 — 722 wines — shows the inverse pattern. Highest volatile acidity (0.61), lowest sulphates (0.61). More of that aggressive sharpness Clare’s colleague experienced at the weekend.

Cluster 2 — 375 wines — is characterised by elevated sulphur dioxide levels and the lowest alcohol of the three groups (9.88%), suggesting less complete fermentation.

Three chemical profiles. Found without a single quality label in sight.


The Reveal

Now Clare looks at quality.

She takes the quality scores — held back throughout the entire analysis — and calculates the mean score per cluster. This is post-hoc interpretation only. The clustering didn’t know about quality. But the results are striking.

ClusterProfileMean Quality
1High alcohol, low volatile acidity, high sulphates5.96
0High volatile acidity, lower sulphates5.55
2High sulphur dioxide, lowest alcohol5.36

The algorithm — working only from chemistry — has separated wines in a way that aligns meaningfully with human quality judgement. The group with the most favourable chemical profile scores highest. The group with the most aggressive volatile acidity scores lowest.

Clare didn’t tell the model what quality meant. The chemistry already knew.


An Honest Number

The silhouette score is 0.19.

By textbook standards that’s weak. Some analysts would look at that number and worry. Clare doesn’t, and it’s worth understanding why.

Wine chemistry is continuous. There are no hard walls between a quality-6 wine and a quality-7 wine — no moment where one chemical compound crosses a threshold and suddenly the wine is better. The boundaries between clusters are gradual, overlapping, real-world messy. A low silhouette score in this context isn’t a sign that the analysis failed. It’s information about the nature of the data itself.

The clusters are soft. The patterns are genuine. These two things are not contradictory.

This matters for how Clare reports her findings. She isn’t presenting three neat buckets — “good wine, average wine, poor wine.” She’s presenting three chemical tendencies, with meaningful separations on the features that wine science already tells us matter.


Why Charlie’s Platform Made This Possible

It’s worth pausing on something easy to take for granted.

Clare’s analysis worked because she had complete, clean, comparable data. Every one of the 1,599 samples described by the same eleven features, scaled and pipeline-ready.

In the real world, that’s rarely the starting point. Laboratory analysis lives in one system. Sensor data from fermentation monitoring lives in another. Batch and production records in a third. Pricing and commercial data somewhere else entirely. Each system uses different formats, different naming conventions, different update frequencies.

Without Charlie’s data platform aggregating those sources into a coherent, governed pipeline, Clare isn’t doing unsupervised learning on 1,599 wines. She’s manually reconciling spreadsheets and hoping nothing got lost in the joins.

The insight — that chemical profile predicts perceived quality independently of price — is only discoverable when the data foundation exists to support the question. Structure has to be built in, not bolted on.


What Clare Does Next

Unsupervised learning is Clare’s first move with an unfamiliar dataset. It reveals what’s there before asking what predicts what.

The natural next step is supervised learning. Now that three chemical profiles have been identified, Clare can use cluster membership to inform stratified sampling — ensuring any training dataset for a quality prediction model includes representative coverage of all three groups rather than accidentally over-representing one chemical type.

She could also bring price into the analysis. If Charlie’s platform connects to commercial data, Clare can ask the question that started this whole investigation: does chemical profile correlate with price? Is the expensive-but-terrible wine an outlier, or is the price-quality assumption systematically weak across the portfolio?

That’s a question worth answering before anyone’s next steak dinner.


The Takeaway

Eleven numbers. No labels. Three meaningful groups.

Clare found chemical structure that aligns with human quality judgement — not because she told the algorithm what quality meant, but because the chemistry already encoded it. Unsupervised learning didn’t give her answers. It gave her the right questions.

And behind all of it, doing the unglamorous work that makes the glamorous work possible, was Charlie’s data platform.

Quality has to be found in the data. But first, the data has to be there to find it.


Next in the series: Clare takes the cluster profiles into supervised learning — and finds out whether chemistry can predict quality well enough to save the rest of us from expensive mistakes.

Ethics in AI: Part 5

An infographic illustrating the feedback loops in content recommendation and credit approval systems, emphasizing the cumulative effect of small errors on structural harm.

Social Impact

Years ago — life before iTunes, before Spotify, before algorithms knew what you liked before you did — I heard the most beautiful song on the radio.

I rushed for a pen and paper. Too late. The DJ had already moved on. I had no title, no artist, no way to find it. Just the music, lodged somewhere in memory, with nowhere to go.

That song stayed with me for years.

Then one day I heard it again. I recognised it immediately — the same ache, the same sound. This time I was ready. Moments later I was on Amazon, searching for what I’d finally caught: Nick Drake. River Man. I bought the CD — Five Leaves Left — and discovered one of the most quietly extraordinary artists I have ever heard.

Nick Drake never had a hit in his lifetime. He sold a few thousand records. He died at twenty-six, largely unknown. His reputation grew slowly, entirely through human recommendation — one person telling another, a song surfacing unexpectedly on a radio programme, a stranger pointing someone in the right direction. Decades after his death, he is considered a towering influence.

I think about that story when I think about what recommendation algorithms do — and what they can’t do.

Those moments of genuine discovery are becoming rarer. And it is not an accident.


In previous parts of this series we have examined bias and fairness, privacy and consent, and transparency and explainability. Each of those topics asks what happens when AI gets something wrong in the moment — a biased decision, a privacy violation, an unexplained outcome. Social impact asks a different and harder question: what happens when AI gets things right, consistently, at scale — and the cumulative effect is still harmful?

This is the part of AI ethics that is easiest to overlook. There is no single decision to challenge. No obvious moment of failure. Just a system doing exactly what it was designed to do, and a world quietly changing around it.


The Feedback Loop

Machine learning systems influence social structures when deployed at scale. Decisions about credit approval, employment screening, content recommendation, and public resource allocation affect opportunities and outcomes for individuals and communities — not just once, but continuously, and often invisibly.

The mechanism behind many social harms is the feedback loop: a system trained on past behaviour makes decisions that shape future behaviour, which then becomes the training data for the next version of the model. Each cycle reinforces what came before. Small biases become structural ones. Initial disparities widen. And because every individual decision appears reasonable, the cumulative drift goes unnoticed until the damage is done.


Example One: The Playlist That Narrows

Consider a music streaming platform that recommends songs based on what users have previously listened to. A user starts with a few popular mainstream artists. The system, doing its job, recommends more of the same. Over time, the user is repeatedly exposed to the same genres, the same sounds, the same familiar names — while niche and emerging artists remain invisible.

The feedback loop runs like this: past listening shapes recommendations, recommendations reinforce listening patterns, and those patterns feed the next round of recommendations. Popular artists become more popular. Smaller artists remain underrepresented. Not because the algorithm intended to marginalise them — but because it was optimised for engagement, and engagement follows familiarity.

Each recommendation, taken alone, is perfectly reasonable. A user who likes one thing probably likes similar things. But the cumulative effect reshapes what people discover, what gains cultural traction, and ultimately who earns a living from their music. A technical optimisation becomes a cultural force. And nobody pressed a button that said “narrow the culture.”


Example Two: The Loan That Was Never Offered

Now consider a model used to decide who gets approved for a loan.

If the historical data that trained the model reflects decades of biased lending practices — and it often does — the model will learn to reject applicants from certain demographic groups at higher rates. It is not making a racist decision in the way a human might. It is making a statistically grounded one, based on patterns in the data. But those patterns are the residue of past discrimination.

The feedback loop here is more severe, and the stakes are higher:

  • Fewer approved loans → fewer opportunities to build credit, start businesses, or buy homes
  • Fewer opportunities → continued financial disadvantage
  • Continued disadvantage → future data that confirms the model’s original assessment

The system appears statistically accurate. It is. And it is also socially harmful. The two things are not mutually exclusive — which is what makes this so difficult to resolve by purely technical means.


Scale Changes Everything

A small systematic error, repeated at scale, produces significant societal consequences. This is the central insight of social impact analysis in AI.

A single biased loan decision is a wrong that can be appealed. A biased model making ten thousand decisions a day, over years, without review, is a structural shift in who gets access to capital. A streaming algorithm that slightly deprioritises independent artists across a platform of three hundred million users does not just affect listening habits — it shapes the economics of an entire industry.

This is why social impact analysis requires evaluating not only individual predictions but cumulative effects. It requires monitoring mechanisms capable of detecting harm early — before it becomes entrenched. It requires stakeholder engagement with affected communities, because the people most likely to identify risks are often the ones the system is making decisions about. Technical analysis, however rigorous, cannot see what it has not been designed to look for.

Machine learning systems are not neutral tools. They are components of socio-technical systems — embedded in institutions, shaped by history, and capable of reinforcing or redirecting the structures they operate within. Their evaluation must extend beyond statistical metrics to include institutional and societal considerations. That is not a soft requirement. It is an engineering one.


Asking “does the model perform well?” is no longer sufficient. The question that matters is: “What does the world look like after this model has been running for five years?”

Social impact has to be built in, not bolted on.


Next: Part 6 — Ethical Trade-offs. The honest conclusion: there are no perfect answers. Only deliberate choices.

Do AI Projects Fail — Or Do We Fail AI?

Blueprint illustration featuring a bird labeled 'DIStraction VECTOR,' and a tower structure with three layered sections labeled 'PROCESS ALIGNMENT,' 'SKILLS & PEOPLE,' and 'DATA FOUNDATIONS.'

If you look back over the last 30 years, our technology history is riddled with failed projects, projects that never got off the ground, and projects that went massively over budget. And then there are the scandals.

The UK Post Office Horizon scandal stands as one of the most serious technology failures in modern British history. Between 1999 and 2015, around 1,000 sub-postmasters were wrongfully prosecuted after the Fujitsu-supplied Horizon accounting software recorded losses that did not exist. The total cost of redress now stands at around £2 billion. More than 13 people took their own lives. The technology did not just fail — it was trusted when it should have been questioned, and the consequences were devastating.

AI is just another technology. That is worth saying plainly. Although many of the possibilities being talked about are genuinely achievable, reality always kicks in. Every project — AI or otherwise — is a complicated combination of people, software, hardware, and process. That has been true for 30 years. It remains true now.

But we can always learn from the past. The question is whether we choose to.

AI is no different. And right now, the data on AI project failure should give everyone pause. The numbers are in. And they are not improving.

RAND Corporation, in one of the most rigorous independent analyses of AI project failure to date, interviewed 65 experienced data scientists and engineers across industries and company sizes. Their finding: more than 80% of AI projects fail to reach meaningful production deployment. That is twice the failure rate of IT projects without an AI component.

S&P Global’s 2025 survey of over 1,000 organisations across North America and Europe found that 42% of companies abandoned most of their AI initiatives that year. In 2024, that figure was 17%. The abandonment rate more than doubled in a single year. MIT’s Project NANDA, published in July 2025, found that 95% of organisations deploying generative AI saw zero measurable return. Not low return. Zero.

With global AI spending projected to reach $630 billion by 2028, these failure rates are not a statistic. They represent hundreds of billions of dollars in wasted investment, stalled initiatives, and businesses no closer to the outcomes they were promised.

What makes this harder to ignore is that the failure rate is moving in the wrong direction. The technology is more capable than it has ever been. The investment is larger than it has ever been. And yet more organisations are abandoning AI initiatives today than they were twelve months ago.

So what is actually going wrong?


The Research Points to Three Things

Across RAND, McKinsey, Gartner, MIT, and Informatica’s CDO Insights survey, the diagnosis is remarkably consistent. AI projects fail for three reasons, and they repeat themselves across industries, geographies, and organisation sizes.

The wrong use case. The process targeted for AI is not the right one, the problem definition is vague, or the initiative is chasing a technology rather than solving a business problem.

Data that is not ready. The data that would be needed to make AI work in production does not exist in the form required — it is fragmented, inconsistent, ungoverned, or simply not there.

A skills gap. The people needed to build, deploy, and sustain AI in a business context are not in place — and the organisation has not yet found a way to close that gap.

None of these are surprises. But the uncomfortable truth is that most organisations are still walking into all three of them, often simultaneously.


The Wrong Use Case: The Magpie Effect at Work

The first failure mode — choosing the wrong process to target — is something I have written about in depth. The Magpie Effect describes what happens when AI strategy is driven by possibility rather than process: the endless pivot towards the latest model, the newest capability, the most impressive vendor demo. Every pivot consumes time, burns budget, and erodes the momentum that comes from doing one thing properly and scaling it.

The RAND report is direct on this point. Stakeholders often misunderstand or miscommunicate what problem needs to be solved. Models get built and deployed optimised for the wrong metrics, or ones that simply do not fit into the real business workflow.

The antidote is starting with the Golden Process — the one business process that everything else hinges on — and asking what AI can do to remove the constraints within it. That conversation has to happen before any vendor is in the room, before any use case is evaluated, and certainly before any infrastructure is selected.

If you have not read the Magpie Effect post, the practical AI test at the end of it is a useful filter for any use case that lands on your desk.


Data That Is Not Ready: The Silent Failure

The second failure mode is the one that catches organisations by surprise — because it tends not to surface until a project is already in trouble.

The pattern is familiar. A proof of concept is built on a carefully selected, cleaned-up sample dataset. The demo runs well. Leadership approves production. And then everything stalls.

Production data is fragmented across systems that were never designed to talk to each other. Basic business terms — “customer”, “order”, “revenue” — are defined differently across departments. Historical records have gaps. Formatting is inconsistent. The clean sample that powered the demo bears almost no resemblance to the messy reality of how the business actually runs.

Informatica’s 2025 CDO Insights survey found that data quality and readiness was the top obstacle to AI success, cited by 43% of organisations. McKinsey found that organisations reporting significant AI returns are twice as likely to have invested in data infrastructure before selecting modelling techniques. Gartner predicts that 60% of AI projects lacking AI-ready data will be abandoned entirely.

This is not a new problem. It is the same problem that has existed since organisations first tried to build anything useful on top of their data. The difference is that AI amplifies it. A flawed report can be corrected. A model trained on broken foundations will confidently produce broken outputs at scale — and the damage compounds before anyone notices.

I have covered the data readiness problem in detail in Garbage In, Expensive Garbage Out and A Million Rows of Nothing. The short version: AI does not fix bad data. It scales it.


The Skills Gap: The Barrier Nobody Budgets For

The third failure mode is the one that receives the least attention — and may be the hardest to solve quickly.

The last 30 years of enterprise IT have built deep, hard-won expertise in infrastructure. Server architecture. Storage design. Networking and virtualisation. Security and compliance. That expertise is not obsolete — it remains essential. The physical and virtual foundations that AI runs on still need people who understand them properly.

But AI demands a different and additional skill set that most organisations are still building.

Traditional IT Skills (Still Relevant)New Skills Now Required
Server architecture and managementData engineering
Storage design and optimisationAI/ML engineering
Networking and virtualisationData science
Security and complianceBusiness domain knowledge

McKinsey’s 2024 State of AI survey found that 58% of businesses are hampered by internal AI skill shortages. Informatica’s CDO Insights survey placed skills and data literacy third in the list of top AI obstacles. PwC found that almost 65% of executives acknowledge their AI initiatives are not succeeding because of a lack of executive sponsorship — a leadership gap that is itself a symptom of not having people in the room who can translate between business outcomes and AI capability.

The skills gap is not a technology problem. It is a people and partnership problem. And it is the reason most AI pilots never leave the room they were born in.

Picture a typical scenario. An organisation identifies a strong AI use case, aligned to a real business process. The data is in reasonable shape. Leadership is supportive. A vendor is engaged. And then the questions start: who is going to build the data pipeline? Who owns the model in production? Who bridges the gap between what the model does and what the business actually needs it to do? The room goes quiet. Not because the will isn’t there — but because the people aren’t.

The honest message is that organisations do not need to replace their existing IT teams — they need to extend them. The people who understand the infrastructure are still essential. But they need colleagues who understand data pipelines, model behaviour, and the business processes those models need to serve. That combination is rare. It takes time to build. And in its absence, even the right use case with clean data will stall.

This is why Partner selection matters as much as use case selection. The right Partner does not just bring technical capability — they bring the scars of what does not work. An AI and data practice built over years has already made the mistakes most organisations have not made yet. That is not a credential. That is insurance.

AI is not a product, it’s an eco-system built on partnership


Three Causes. One Pattern.

What the research is describing — even when it does not use these words — is organisations that jumped to the model before they had solved the three problems that sit upstream of it.

They chased the possibility rather than the process. They skipped the data foundations. And they underestimated how different the skills requirement would be.

McKinsey’s summary of what separates the 6% of genuine AI high performers from everyone else is as sharp as it gets: AI is 20% algorithms, 80% organisational rewiring.

The organisations building durable AI capability are not necessarily the ones with the most sophisticated models. They are the ones that got the process right, made the data ready, and put the right people in place — before they wrote a single line of model code.

The failure statistics look alarming until you understand the causes. Then they look entirely predictable.

The good news: all three are solvable. None of them require waiting for the next frontier model. They just require doing the less glamorous work first.

I have written about what that looks like in practice. Getting AI Right First Time sets out a five-step path from honest awareness through to durable, enterprise-wide capability — the journey from AI as a science project to AI as part of how the business actually runs. And In Praise of Boring, Everyday AI makes the case for what success actually looks like in production: not the demo, not the headline, but the quiet system that runs on a Tuesday afternoon without anyone noticing — because it has simply become part of how the business works.

That is the goal. Not AI that impresses. AI that endures.


AI success has to be built on the right foundations — not retrofitted onto broken ones.

A Million Rows of Nothing

A graphic illustrating a grid labeled 'A MILLION ROWS OF NOTHING,' featuring numerical values, with most cells showing '0.00' and select cells highlighted in orange displaying '1.00.' A crossed-out server icon is on the left, and a note at the bottom reads 'DO NOT SKIP.'

Why business use case and data strategy must come before AI strategy

At a customer event last year, an IT Director told me — with some confidence — that they already had an AI strategy.

“Great,” I said. “Now tell me about your data strategy.”

“We have 250TB,” he replied.

I nodded. And thought: there is a very big difference between data and storage.

That moment has stayed with me because it wasn’t an isolated conversation. It was a pattern. Organisations are arriving at the AI table with infrastructure plans, vendor commitments and boardroom ambition — but without first validating the business use case, the predicted ROI, or the data required to support either one.

That is the gap. And it is an expensive one.


The gap is earlier than most organisations think

Walk any AI conference floor and the energy is real. The technology is genuinely impressive. GPU servers are being specced, procured and racked. Data scientists are being hired. AI roadmaps are being presented to boards.

And somewhere near the bottom of the slide deck, almost as an afterthought: “We’ll need to look at data readiness.”

For organisations serious about AI delivering real outcomes, this is the wrong order.

The first question should not be “what infrastructure should we buy?” It should be “what business problem are we solving, what return do we expect, and does our data actually support that outcome?”

If those questions haven’t been answered, the AI strategy isn’t yet a strategy. It’s an ambition.


Start with the business use case and predicted ROI

Before talking about models or servers, organisations need clarity on three things: what specific business problem are we trying to solve, what result would make this investment worthwhile, and what evidence suggests the data can support that result?

This matters because businesses don’t invest in AI for the sake of AI. They invest in outcomes — lower cost, higher revenue, reduced risk, better service, faster decisions, improved productivity.

The business use case and predicted ROI have to come first. They set the standard the data must meet, the model must prove, and the infrastructure must eventually support. Without that anchor, teams end up building technical capability in search of commercial justification.


Then comes data strategy

This is where many organisations confuse capacity with capability.

Saying “we have 250TB” is not describing a data strategy. It is describing a storage estate.

A real data strategy answers different questions. What data actually matters for the use case? Where does it live? Who owns it? How is it governed? How trustworthy is it? How easily can it be accessed, joined, prepared and used?

AI doesn’t begin with infrastructure. It begins with understanding whether the organisation has data that is usable, relevant, governed and connected to a business objective. That is why data strategy has to come before AI strategy. If you don’t understand the asset you’re asking AI to learn from, you don’t yet know whether the strategy is viable.


Data engineering is not pre-work. It is the work.

The foundational argument is simple, even if it’s routinely ignored: data engineering is not a precursor to AI work. It is AI work.

The pipelines, the schemas, the quality checks, the lineage, the transformation logic — these are not the boring bit before the interesting bit starts. They are the work.

A model is only ever as good as the data it learns from. If that data is incomplete, inconsistently formatted, poorly labelled or structurally flawed, the model will learn the wrong things with great efficiency. Garbage in, amplified garbage out at scale.

The data engineering layer needs to be in place — and understood — before a model is trusted in production. That means clean, documented pipelines with known lineage, a clear system of record for the domain you’re working in, variables that are what they say they are, and critically — someone who has actually interrogated the data, not just counted the rows.

250TB of storage tells you nothing about any of that.


Even clean data can still be useless

Here is where the conversation gets more uncomfortable. Because the problem isn’t always dirty data.

Sometimes the data looks clean. The schema is tidy. The row counts are impressive. The formatting is consistent. It passes the hygiene checks. And then you run the analysis — and discover the data tells you very little. Not because it’s messy. Because it’s empty of useful signal.

This is the moment EDA — Exploratory Data Analysis — earns its place. Not as a technical formality, not as a box to tick before the real work starts, but as the moment of truth. The point at which you find out whether your data can actually answer the question you’re asking of it.

That means looking at distributions, missingness, outliers, feature relationships, basic correlations, and whether the patterns you expected to see are actually present. If they aren’t, that isn’t a minor issue. It’s the whole issue.


A million rows of nothing is still nothing

This is why volume can be so misleading.

Take a look at this correlation matrix.

A correlation matrix displaying the relationships among numerical features including Price, Discount, Tax Rate, Stock Level, Customer Age Group, Shipping Cost, Return Rate, Seasonality, and Popularity Index. The matrix is color-coded with a gradient scale from blue to red indicating strength of correlation.

To the untrained eye it looks impressive. Professional. The kind of output that gets nodded at in a boardroom. But look closer. That red diagonal? Every variable correlating perfectly with itself — mathematically guaranteed, analytically meaningless. Everything else is zero. Price and Discount: no relationship. Seasonality and Stock Level: no relationship. Shipping Cost and Return Rate: no relationship.

In a real retail dataset those relationships should exist. The fact that this data shows none of them is a signal worth taking seriously. A flat correlation view doesn’t prove there is absolutely nothing to learn — but it does tell you there is no obvious predictive signal in this view of the data. That should trigger caution, not confidence.

You shouldn’t respond by buying more infrastructure. You should respond by asking better questions. Are these the right features? Is the data aggregated at the wrong level? Are important variables missing? Is the business question badly framed? Are we trying to predict something the data cannot meaningfully support?

If you can’t answer those questions, you are not ready to build the model. You are ready to do more analysis.


Model readiness comes after data readiness

Only once the business case is clear and the data has been tested should the conversation move to model readiness.

At that point the focus becomes more disciplined. Can the data support the target outcome? Which features actually carry useful predictive weight? What baseline performance is realistic? What error level is acceptable for the business use case? What would success look like in practice, not just in a notebook?

This is the stage where organisations find out whether the use case is genuinely model-worthy — or whether it looked better in a strategy deck than it does in reality. Model readiness is not about enthusiasm. It is about proof.


Infrastructure should be the consequence, not the starting point

The infrastructure conversation is seductive. More compute, faster processing, bigger clusters — these feel like progress. And they are progress, in the right context.

When you have a validated business case, a believable ROI, signal-rich data and a well-framed modelling problem, the right infrastructure genuinely accelerates outcomes. But infrastructure applied to unvalidated data doesn’t solve the problem. It scales it.

A model trained on the wrong data, running on the best hardware available, will produce wrong answers faster and at greater cost than anyone planned for. The servers don’t know the business case is weak. The GPUs don’t know the data is empty of signal. They will process bad assumptions with perfect efficiency.

That is why the sequence matters.

Business use case → Predicted ROI → Data strategy → Data engineering and EDA → Feature validation and model readiness → Infrastructure investment

Getting that sequence right is the difference between an AI investment that delivers and one that quietly disappoints.

The IT Director with 250TB has storage. What he needs first is a conversation about what’s in it, whether it’s been tested, whether it contains usable signal, and whether it can answer the questions the business is asking. That is the conversation worth having before the servers arrive.


Closing thought

There is a version of the AI hype cycle that ends badly — and it ends badly in a specific way. Not with dramatic failure, but with quiet disappointment. Models that don’t perform. Investments that don’t deliver. Data scientists hired to build things the data was never capable of supporting.

The organisations that avoid that outcome are the ones that did the unglamorous work first. They validated the use case. They estimated the ROI. They looked at the data before they bought the infrastructure. They ran EDA before they committed to the model. They asked hard questions before they made bold commitments.

The emperor’s new clothes are always convincing until someone asks the uncomfortable question. In AI, that question is usually the same:

Have you actually tested the data?

Data readiness has to be built in, not bolted on.

Ethics in AI: Part 4

Transparency and Explainabilityinside the black magic box

Diagram illustrating the concept of explainability in AI ethics, featuring a blue background, large text that says 'ETHICS IN AI / PART 04,' and visual representations of explainable data, predictions, and algorithms, linked to an 'ACCOUNTABLE DECISION' node.

Ever since I can remember, I’ve wanted to know how things work. Not just that they work — but why, and what’s going on inside.

My parents had a name for it: Fiddle Fingers. Clocks, radios, household appliances — nothing was safe. I’d take them apart with genuine curiosity and varying degrees of success. The parts I couldn’t reassemble quietly disappeared under my bed. I even unscrewed the back of a plug once — purely to see electricity. The thunderbolts that shot up my arm was, in hindsight, a reasonable price for the lesson. My parents, to their credit, were more understanding than the appliances deserved.

Later I studied motor vehicle engineering — which at least meant I could finally take things apart professionally. And now, working through my MSc, I’m still doing the same thing. Still looking under the hood. Still asking what’s in the box.

Which makes the subject of this post a personal one. Because one of the most troubling things about modern AI systems is that they often don’t let you look inside. Not because the technology prevents it — but because transparency and explainability haven’t been treated as priorities. The box stays closed. And when the box is closed, accountability becomes very difficult to defend.

In the previous parts of this series we’ve looked at bias, fairness and accountability — the ethical challenges that emerge when AI systems make decisions that affect people’s lives. This instalment moves into territory that sits underneath all of those: if you can’t see how a system works, and can’t explain what it decided, the other ethical principles become very difficult to uphold.

Transparency and explainability are the mechanisms that make accountability possible. Without them, everything else is aspiration.


Two related concepts, one shared purpose

The terms transparency and explainability are often used interchangeably. They shouldn’t be — they address different things, and the distinction matters.

Transparency concerns the visibility of the system as a whole. How was it built? What data was it trained on? What modelling choices were made, and why? What does the performance data actually show? Transparency enables external scrutiny. It supports governance, auditability and regulatory oversight. Without it, independent evaluation becomes impossible — you’re simply asked to trust the outcome without any means of checking it.

Explainability concerns the individual decision. Not the system in aggregate, but this specific output: why did the model produce this result for this person, in this context, at this moment? In high-stakes settings — healthcare, criminal justice, financial services — that question isn’t academic. It’s a matter of rights.

Think of it this way. Transparency lets you audit the factory. Explainability lets you understand why one particular product came off the line the way it did.

Both matter. And in most real-world deployments of AI today, both are harder to achieve than the marketing suggests.


The three questions explainability has to answer

When we talk about making an AI system explainable, we’re really asking three distinct questions — and each requires a different kind of answer.

The first is about data. What information was used to train the model, and why was it chosen? This isn’t just a technical question. Training data encodes assumptions about the world, and those assumptions shape every output the model produces. If the data can’t be explained and justified, the decisions downstream can’t be either.

The second is about predictions. What features and weights drove this particular output? Why did the model score this applicant lower than another? Which variables carried the most influence, and in what direction? This is where post hoc explanation techniques — tools that interpret model behaviour after the fact — do most of their work.

The third is about the algorithm itself. What are the layers, the thresholds, the decision boundaries? How does the model move from input to output? For simpler models, this question has a direct answer. For more complex ones, it often doesn’t — which is where the central tension of this topic lives.


COMPAS: when a black box meets a courtroom

No case study illustrates the stakes of transparency and explainability more starkly than COMPAS — the Correctional Offender Management Profiling for Alternative Sanctions tool, widely used in the United States to assess the risk that a defendant will reoffend.

Judges used COMPAS scores to inform decisions about bail, sentencing and parole. The scores carried real weight in outcomes that determined whether people went home or went to prison. And yet the algorithm that produced those scores was proprietary. Defendants had no means of understanding how their score was calculated, no ability to identify errors in the underlying data, and no realistic way to challenge the output in court.

In 2016, ProPublica published an investigation showing that COMPAS assigned significantly higher reoffending risk scores to Black defendants than to white defendants with comparable profiles. The tool wasn’t just opaque — it was producing outcomes that were racially skewed in one of the highest-stakes contexts imaginable.

The Loomis v. Wisconsin case reached the Wisconsin Supreme Court, where the defendant argued that using a proprietary, unexplainable algorithm in sentencing violated his right to due process. The court upheld the use of the tool. The algorithm remained a black box.

COMPAS sits at the intersection of everything that matters in this conversation. Transparency was absent — no visibility into the model’s design, data or validation. Explainability was absent — no way to interrogate individual decisions. And the consequences were borne by people who had no recourse and no means of understanding why.


The tension that doesn’t go away

Here is the dilemma that transparency and explainability force us to confront — and it doesn’t have a clean resolution.

The models that tend to perform best on complex, real-world prediction tasks are also the least interpretable. Deep neural networks, gradient boosting models, large ensemble methods — these approaches can achieve superior predictive accuracy precisely because they capture subtle, non-linear relationships in data that simpler models miss. But that complexity comes at a cost: the internal workings become difficult, sometimes impossible, to explain in terms a human can meaningfully interpret.

Simpler models — linear regression, decision trees, rule-based systems — offer genuine interpretability. You can follow the logic from input to output, identify which variables matter and by how much, and explain a decision to the person it affects. But they often sacrifice accuracy to do it. In a noisy, high-dimensional real world, simpler models sometimes just get more things wrong.

This is not a technical problem waiting for a technical solution. It is a genuine ethical trade-off. In some contexts — say, a recommendation engine for a streaming service — that trade-off sits comfortably on the side of performance. In others — a credit decision, a medical diagnosis, a criminal risk score — the question of what we’re willing to sacrifice for accuracy becomes a question of values, not engineering.

Regulatory frameworks are beginning to codify where that line falls. The EU AI Act classifies high-risk AI applications and mandates transparency and explainability requirements accordingly. The GDPR enshrines a right to explanation for automated decisions. But regulation sets a floor, not a ceiling — and the honest truth is that many organisations are still well below it.


What good looks like in practice

Transparency and explainability aren’t binary. They exist on a spectrum, and the appropriate level depends on context — the stakes involved, the people affected, and the regulatory environment in play.

For high-risk applications, the baseline should include clear documentation of training data, modelling choices and performance metrics across demographic groups; post hoc explanation tools that can surface the key drivers of individual decisions; human review mechanisms for decisions that significantly affect individuals; and the ability to audit the system independently — not just internally.

For lower-risk applications, lighter-touch approaches may be proportionate. But the principle remains: the system should be able to account for itself, and the people it affects should have a meaningful way to understand and, where necessary, challenge its outputs.

The temptation to treat explainability as a presentation problem — a dashboard, a label, a percentage confidence score — should be resisted. A number on a screen is not an explanation. An explanation is something a person can interrogate, reason about and act on.


Closing thought

There is a version of AI development where transparency and explainability are treated as compliance tasks — boxes to tick, documentation to file, a report to produce before launch. That version produces systems that look accountable without being accountable.

The harder version asks the question earlier: before a model is selected, before a dataset is assembled, before a use case is approved. It treats interpretability as a design constraint, not an afterthought. It asks whether a complex model is actually necessary, or whether a simpler, more explainable one would serve the purpose well enough.

That version is also the honest version. Because when a system makes a decision that changes someone’s life — and they ask why — “the algorithm is proprietary” is not an answer any ethical organisation should be comfortable giving.

Transparency and explainability have to be built in, not bolted on.

The Unglamorous Guide to Agentic AI

A flowchart illustrating a framework with sections labeled 'Experiments,' 'Mapping,' 'Assisted,' 'Narrow Autonomy,' 'Governance,' 'Tools,' and 'Ownership,' showing connections and percentages for each section within a 'Home Lab Zone' and 'Enterprise Zone.'

A practical map from chatbots to digital colleagues

Every AI conference this year ends the same way. Someone walks off stage having promised you digital colleagues, autonomous operations, and a workforce that never sleeps. The audience files out buzzing. Then Monday arrives, and the to-do list still starts with “fix the CMDB.”

The hype is real. The map is missing.

This post is the map. It’s written for business leaders, IT decision-makers and pre-sales architects who are back from the conference circuit with a head full of agents and a backlog full of reality. I’m not going to predict the entire future of agentic AI. I’m going to draw a practical route from today’s chatbots to tomorrow’s digital colleagues — and be honest about what has to be true before you make that journey safely.


From clever autocomplete to colleagues that actually do things

Let’s get the terminology out of the way in plain language, because the distinction matters more than most people admit.

An LLM or chatbot is great at language. You ask, it responds. It drafts documents, summarises threads, generates emails. But once the text stops, so does the system. It is, if we’re honest, very clever autocomplete with no hands.

An agentic AI — or AI agent — is great at doing. You give it a goal, not just a prompt. It plans multi-step work. It calls tools and systems: APIs, RPA bots, workflows, orchestration engines. It observes outcomes, adapts its plan, and retries. It can operate with humans “on the loop” rather than in every loop.

That last point is precisely where the hype tends to skip ahead. An agent can only plan, act and adapt meaningfully if your data is accessible and trustworthy, your systems expose safe and well-defined actions, you have guardrails and governance that decide where autonomy is allowed, and someone actually owns the agent as a product rather than a side project.

If those conditions aren’t in place, your “agent” is just a more expensive chatbot wearing a new badge. So before you flip the agentic switch, what actually needs to be true?


“Can’t I just install something and it works?”

Yes. In your home lab, absolutely — and there’s nothing wrong with that. Download an open-source agent framework, point it at a few APIs, chain some tasks together and you can have something impressive running by the weekend. The ecosystem is genuinely good and moving fast. More importantly, experimenting in a home lab is one of the best ways to develop real intuition: what agents can actually do, where they get confused, which tools behave well and which don’t, and what the supporting infrastructure starts to look like when things get serious.

Think of it as the design phase. Low stakes, high learning.

But your home lab has a very specific set of conditions that don’t exist in your business. It has one user who knows where all the data lives, trusts their own judgement on every call, and absorbs the full blast radius if something goes wrong. It has no compliance team, no change management process, no SLAs, and no regulator asking what happened at 2am on a Tuesday. And it has no customers.

The technology has never been the hard part. Dropping an agent into a business environment and having it behave well — reliably, at scale, with governance — is a different problem entirely. Not harder in a technical sense. Harder in the sense that it requires decisions, ownership and trust that no framework installs for you.

Good business is practical by design. That’s not a constraint on ambition. It’s the architecture that makes ambition safe to deploy. Every guardrail, every policy, every named owner exists because the blast radius of getting it wrong is real — and someone else’s problem to live with.

The home lab shows you what’s possible. The foundations in this post are how you make it production-ready.

Before we get to those foundations, there’s an even more basic stack that has to exist: a data platform with clean-enough pipelines, a working LLM capability, and an API and tool layer your systems expose to the outside world. Agentic AI sits on top of those three. If you’re missing any one of them, you don’t have a gap in agents — you have a gap in data, language or automation. That distinction matters, because it tells you where to invest first.


Four foundations before you go agentic

Think of these as readiness pillars built on top of three prerequisites: a data platform, a working LLM capability and an API and automation layer. The foundations below are how you turn that stack into something you’d trust with real work. You don’t need them perfect. You need them good enough for a real workflow.

1. Data the agent can actually use

Agents don’t run on slide decks. They run on the reality of your systems — and the gap between the two is where most early agent projects quietly fail.

For a given workflow, you need an authoritative system of record. You need reasonable data hygiene: IDs that line up across systems, critical fields mostly filled in, obvious duplicates under control. And you need retrieval paths an agent can actually call — APIs, search indices, vector stores — not just dashboards that a human reads.

Sniff test: Pick one workflow, say resolving a recurring support issue. Ask a human expert: “Using only the tools available today, can you find the relevant data and context in under five minutes?” If the answer is “not really”, you’ve found a data readiness gap for agents too.

Practical move: Start with one domain — tickets, orders, incidents, invoices — and make sure it has a clear system of record. Document how to query it programmatically. Fix the top three to five quality issues that keep surfacing in real conversations. You don’t need an enterprise-wide AI data lake to start. You need one solid patch of ground.


2. Tools, not just text

A lot of agentic demos quietly assume something you may not have: reliable, callable actions. The agent in the video creates a record, triggers a workflow, updates a case — and it looks seamless. What’s invisible is the API scaffolding underneath.

Agents become genuinely useful when they can create or update records, trigger automations, and kick off jobs in your ITSM, CRM, RPA or orchestration platforms. For that to work safely, your tools need clear contracts — input and output schemas, error codes and behaviours, latency expectations. They also need safety properties: idempotency where possible so retries don’t cause chaos, a bounded blast radius so a misfiring agent can only touch tickets in a certain state, and sensible timeouts and fallbacks.

Self-check: Take a target workflow — “close low-risk tickets automatically”, for example. List the steps a human takes today. Mark which already have an API or automation, which could be automated with tools you already own but haven’t yet exposed, and which are inherently human — judgement calls, negotiation, relationship management. If there’s nothing in the first two categories, you’re not ready for an agent. You’re ready to build some tools first.

Practical move: Pick one or two high-volume, low-risk actions and expose them as well-behaved tools. Add basic protection: permissions, rate limiting, logging. Treat them as building blocks for multiple future agents, not a single-point solution. Tools built once should serve many agents.


3. Guardrails, governance and observability

This is where the uncomfortable questions live. What if an agent misconfigures production? Who approved the change? How do you roll it back?

The answer can’t be “we’ll figure it out when it happens.” You need policy decisions made upfront: which actions can an agent perform autonomously, which require human approval, and which are strictly observe-only. Agents should have their own identities — service principals or technical users — not borrowed developer accounts. Access should match the agent’s actual mandate, not the path of least resistance.

Equally important is observability. You need logs that capture the goal the agent was working toward, the plan it generated, every tool call with inputs, outputs and timestamps, and the final outcome or handoff. Without that, you’re not managing an agent. You’re hoping one.

Self-check: “If an agent did something wrong yesterday, could we explain what it did, undo it, and prevent a repeat — without a digital forensics exercise?” If the honest answer is no, stay in assisted modes until observability and governance catch up.

Practical move: Start with human-in-the-loop patterns. The agent drafts actions; a human clicks Approve or Edit. Add a kill switch per agent, simple policies like “never act on Premium Tier customers without human sign-off”, and alerting on unusual error or action rates. Only then consider fully autonomous zones.


4. Org ownership and operating model

Even with perfect data, well-behaved tools and solid governance, you still need to answer one question nobody wants to get stuck on: who owns this thing?

Every non-trivial agent needs a product owner who defines its purpose and scope, maintains a backlog of improvements, and keeps business stakeholders aligned on value. It also needs an operations owner responsible for platform health, reliability, security and DR. And it needs feedback loops — channels for the frontline teams who use it to report what’s working, what’s misfiring, and what’s missing.

Treat meaningful agents as products, not toys. For each one, write down its mission in one sentence, define its scope by system and region, set clear success metrics — resolution rate, cycle time, human override rate — and document how new capabilities get promoted and who signs off.

Without this, you end up with shadow agents nobody wants to admit they depend on. That’s not digital transformation. That’s technical debt with a better name.


A staged roadmap: from copilots to digital colleagues

With the foundations clear, here’s a practical maturity path — one you can use as a roadmap or, if you need it, a slide.

Stage 0 — Experiments: chatbots and copilots

Objective: Learn what good looks like with minimal risk.

Deploy chat or document copilots on internal knowledge: FAQs, SOPs, policies, runbooks, playbooks. Let teams use them for drafting responses, answering “how do I…” questions, and summarising documents and threads. This isn’t about automation yet. It’s about building intuition.

Exit when you’ve identified three to five workflows where users say: “If this system could not only answer but also do things, it would be a big deal.” Those are your candidate agent use cases. Capture them.


Stage 1 — Workflow mapping and instrumentation

Objective: Understand the real process end to end before touching it.

Pick one or two candidate workflows with clear business value and contained risk. Run a value stream mapping session with the people who actually do the work: map the steps, systems, handoffs and wait times, and surface the common exceptions and failure modes. Instrument what you can — volumes, cycle times, where work gets stuck or loops.

Exit when you can draw the workflow on a single page and annotate which steps are data lookups, which are decisions with clear rules, and which are actions that could be automated. That diagram becomes the blueprint for your first agent.


Stage 2 — Assisted agents: recommend, don’t press the buttons

Objective: Put an agent in the loop safely.

Build an agent that reads the current context — a ticket, case, order or incident — calls read-only tools to gather information, and proposes a plan with recommended actions. Integrate it into the tools people already use: the service desk console, the CRM screen, the ops dashboard. Humans review the plan, approve or edit recommendations, and execute final actions via normal tooling.

Exit when a healthy proportion of recommendations are being accepted with minimal edits and are clearly improving speed or quality — and when you’ve surfaced and fixed the worst data issues and most brittle automations in the process.

At this stage, the agent is a smart junior colleague sitting beside an expert. Not an unsupervised intern loose in production.


Stage 3 — Narrow autonomy with strong guardrails

Objective: Let the agent own low-risk slices end to end.

Choose narrow scenarios where the rules are clear, the actions are reversible and the impact is measurable. Define hard policies: which customers, regions, ticket types or transaction sizes are in scope, the maximum number of actions per hour or day, and the conditions that always require human sign-off. Implement autonomous execution within those constraints, alongside kill switches and alerts on rising error or override rates.

Exit when you can show clear business impact — 30% faster resolution for a class of issues, for example — with errors within agreed budgets, and when stakeholders trust the agent for that specific scope and are actively asking to expand it.

This is where “digital workforce” starts to mean something. Not because of a leap of faith — because the groundwork is there.


Stage 4 — A portfolio of agents, not a one-off hero

The destination isn’t one giant agent that does everything. It’s a portfolio of specialised agents — for support, infrastructure, finance, sales operations — built on a shared platform that handles identity and access, logging, monitoring, guardrails, policy and tool discovery consistently across all of them.

At this level, agent performance reviews become a normal part of the operating rhythm: what did each agent deliver this quarter, where did it fail, what new capabilities are now justified? Onboarding a new agent starts to look like product onboarding — a repeatable process with known steps — rather than a bespoke science project that requires a specialist to lead it every time.

This is the digital colleagues future people describe on stage. It’s reached via boring, disciplined steps. Not a leap of faith.


Start here Monday

If you’ve skimmed to the end, here’s the short version for your next leadership or architecture meeting.

You don’t have to fix everything everywhere. For a single outcome, you can build a thin vertical slice of the whole stack: one domain of data with clean-enough pipelines; LLM patterns that understand and plan for that domain; a handful of safe, well-wrapped APIs and tools; and an assisted agent workflow on top. Then go deep, not wide. Master the slice before you scale it.

Pick one real outcome — not “do something with agents”, but something measurable: reduce the L1 ticket backlog by 20%, cut quote cycle time by 30%, shorten onboarding by 25%.

Map the true workflow. Get the people who actually do the work into a room. Draw the steps, systems and handoffs. Capture volumes and pain points.

List systems and tools for that workflow. Where does the truth live? What APIs, RPA bots or workflows already exist? What’s missing?

Choose three to five candidate tools — actions an agent could safely call, with clear inputs and outputs, that are reversible or low risk, and high volume or high annoyance.

Decide guardrails upfront. What is read-only? What is recommend-only? What can be fully automated, and under precisely what conditions?

Assign owners. Name a business or product owner for the agent and a technical or platform owner. Both. Before you start.

Build an assisted agent first. Let it plan, gather data and recommend. Keep humans in the loop until the numbers — and your honest comfort level — say otherwise.

Then turn the lessons from that first agent into your second and third. Use what broke or surprised you to improve the platform, refine the guardrails, and standardise patterns and tooling. That’s how a portfolio grows.


Closing thought

Agentic AI is not magic. It’s automation with a better brain — plus all the unglamorous work of data, tools, governance and ownership that nobody puts on the keynote slide.

Invest in those foundations first and your digital colleagues won’t arrive as an uncontrolled experiment. They’ll arrive as well-behaved members of the team, delivering measurable value from day one.

And that’s the bit we don’t hear enough about on stage.

Generative AI in Business: From Hype to Everyday Use

A schematic diagram illustrating Generative AI architecture for enterprise deployment, showing flow from data infrastructure to natural language input and response, featuring components including data engineer, ML backbone, and distribution grid for business users.

We can talk to our data – and it can talk back.

I grew up on science fiction, and it’s still my favourite genre. As a kid I watched Captain Kirk speak to the Enterprise computer: he’d ask a question, the ship would analyse everything it knew, and then calmly talk back in plain language. Later it was HAL in 2001: A Space Odyssey – a computer you could converse with, for better or worse.

Here’s the thing: what was fiction is now reality.

For the first time, we can ask our business systems questions in natural language – about customers, operations, risks, performance – and get meaningful answers back, grounded in our own data. That’s the practical side of generative AI that matters: not party tricks in the browser, but the ability to talk to the business and have it talk back in ways that save time, reduce errors and unlock real ROI.


From finding the golden process to opening it up

In the my blogs, I’ve talked about machine learning as the discipline of finding that golden process – the one worth fixing, optimising, or automating. ML is brilliant when you can frame a problem as a narrow, well-defined prediction task: given this input, what is the most likely outcome? It forecasts demand, scores risk, flags likely defects, and fine-tunes routes and schedules.

But here’s what ML cannot do: it cannot explain itself in plain English. It cannot draft the shift handover report. It cannot answer the technician’s question about a procedure buried in a manual. It cannot have a conversation.

That’s where generative AI enters the refinery.

Think back to our crude oil analogy. If machine learning is the refining process – turning raw data into specific, purpose-built outputs – then generative AI is the distribution grid. It takes everything the refinery produces and makes it accessible to people who never needed to understand how the refinery works.

Data Engineers still run the pipelines. Data Scientists still interpret the outputs and build the models. But now, the sales manager, the warehouse supervisor, the contact centre agent – they can walk up to a terminal, ask a question in plain English, and get a meaningful answer grounded in real business data.

That’s the shift. ML found the golden process. GenAI opens it up to everyone.


What’s actually new

Most organisations have already met “classic” AI, even if they don’t call it that. Traditional machine learning classifies, predicts and optimises. It works brilliantly within defined boundaries.

Generative AI adds a different kind of capability on top of that backbone:

  • It can read and summarise large volumes of unstructured text – emails, PDFs, reports, notes.
  • It can draft content: responses, proposals, documentation, code.
  • It provides a conversational interface into your data and systems.
  • It can generate variations – alternative phrasings, translations, test cases, outreach angles.

If traditional ML optimises numbers and events, generative AI optimises words and workflows. That makes it especially powerful in the gaps between your systems: the email trail, the shared drive full of PDFs, the complaint notes, the procedures that nobody can ever quite find.

The value isn’t in talking to an LLM in the cloud for its own sake. It’s in embedding these capabilities directly into the work people are already doing.


Four use cases that are returning real ROI

The patterns that consistently deliver value share a few traits: they’re anchored in existing processes, they use data you already control, and they have clear owners and measurable outcomes. Here are four that I see working across industries right now.

1. Customer service copilots

Contact centres have long used AI for call routing and basic automation. Generative AI takes this further by acting as a copilot for human agents – surfacing customer history and relevant knowledge articles the moment a case comes in, proposing draft replies that agents can review and send, suggesting next best actions based on similar cases, and automatically updating tickets and notes from the conversation.

This doesn’t replace the agent. It removes the searching, copying and typing that sits around the real work.

In practice: A regional insurer deployed a generative copilot alongside its existing contact centre platform for 120 agents. Within three months, average email handle time fell by 23%, first contact resolution improved by nine percentage points, and new hire time to competency dropped from twelve weeks to eight. No new platform was purchased – they configured the generative features already included in the suite they owned.

2. Knowledge assistants for frontline staff

In most organisations, the people who most need answers – technicians, nurses, warehouse supervisors, branch staff – are the ones with the least time to hunt through policy documents and manuals. Generative AI can sit on top of your existing documentation and act as a conversational handbook.

A technician asks: “What’s the safety procedure for this task on model X?” and gets a precise answer with a link back to the source. A warehouse team member asks: “What do we do when a pallet arrives damaged?” and sees the current procedure with the right forms attached.

Under the hood, this uses retrieval augmented generation – pulling relevant content from your own documents, not from the open internet. The answers are grounded in your procedures, your standards, your data.

3. Sales and account management copilots

Sales teams are already surrounded by data: CRM records, emails, meeting notes, product catalogues, pricing rules. Very little of it is easy to use in the moment.

Generative AI can act as a deal desk in your pocket – producing account summaries from CRM history, turning meeting transcripts into action-oriented follow-up emails, drafting proposal text by combining standard boilerplate with customer-specific details.

In practice: A UK-based B2B distributor introduced a generative assistant inside its CRM. Reps could click “summarise” on any opportunity and get a one-paragraph overview of needs, stakeholders and risks. First-draft proposals were generated from product data, standard terms and prior similar deals. After six months, average proposal turnaround time dropped from 5.2 days to 2.7 days, the number of opportunities receiving a formal proposal before closing increased by 18%, and win rates on AI-assisted proposals ran four to five points above the historic average – not because the AI was magical, but because more prospects received a timely, coherent response instead of a rushed one.

4. Paperwork killers in operations

Manufacturing, logistics and field-based work are full of semi-structured documents: inspection reports, delivery notes, shift handover emails, maintenance logs. Generative AI helps at both ends of this process.

On the input side, it reads and structures information – extracting key fields from free text, normalising inconsistent terminology, flagging anomalies. On the output side, it drafts standardised documents: turning technician notes into formal incident reports, auto-drafting non-conformance records, generating customer-ready summaries of delays and resolutions.

The ROI here is quiet but consistent: less administrative overhead per job or shift, fewer errors in documentation, faster handovers, and better auditability without anyone doing extra work.


This is not just for the Fortune 500

A few years ago, meaningful AI work did feel like the preserve of companies with research labs and large data science teams. Three things have changed that:

Infrastructure is now a service. You don’t need to run your own GPU cluster. Cloud platforms and partners provide the models and the plumbing; you focus on the use case and the data.

Your existing vendors are already there. CRM, ERP, contact centre, productivity suites – most now ship with generative features as standard or as add-ons. For many organisations, adopting generative AI means configuring what you already pay for, not buying an entirely new stack.

The skills barrier is lower than it looks. You still need expertise in data, integration and governance – which is exactly where Charlie earns his keep. But a significant proportion of high-value work now looks like defining good prompts and templates, curating the right documents, and designing sensible review workflows. The data foundation Clare relies on for her models is the same foundation that powers a generative assistant. The platform is the same; the interface is new.

The main differentiators between leaders and laggards are no longer who has the biggest lab. They’re who has their data in reasonable shape, clear processes they want to improve, and the discipline to run small, focused pilots and scale what works.


A straightforward way to get started

Start with a specific problem. Resist the temptation to open with “we need a GenAI strategy.” Start with: “Our agents spend 40% of their time on after-call notes.” Or: “Our engineers hate writing shift handover reports.” Make the problem specific, measurable and owned by a business leader.

Choose a constrained pilot. One team, one process, one geography. Eight to twelve weeks from start to learning. Clear metrics – time saved, error rates, throughput, whatever actually matters.

Use off-the-shelf building blocks first. In 2026, there is rarely a good reason for a typical organisation to start by training its own large language model. Turn on and configure the copilots your existing vendors provide. Use retrieval-augmented assistants over your own documents. Work with partners who understand both your industry and the tooling. Custom models come later, if the business case ever justifies them.

Wrap everything in governance. Decide what data can and cannot be used. Keep humans in the loop for decisions with legal, safety or reputational consequences. Make sure users can see where an answer came from. Monitor for errors and have a feedback mechanism. The goal is not to eliminate all risk; it’s to manage it the way you’d manage any powerful tool.


From backbone to interface

If machine learning is the backbone that finds and optimises the processes worth improving, generative AI is becoming the interface layer that lets people work with those processes more naturally.

Its most valuable contributions are not mysterious. They turn scattered information into usable knowledge. They turn blank pages into workable first drafts. They turn complex systems into conversational tools.

The organisations that will get the most from generative AI over the next few years are unlikely to be the ones with the flashiest demos. They will be the ones quietly embedding these capabilities into everyday processes – and letting the ROI accumulate in faster cycles, fewer errors and better experiences.

In that sense, generative AI is following the same path as every meaningful technology shift in business.

After the hype comes the hard work.

And that’s where it gets interesting.

Ethics in AI: Part 3

Infographic on Ethics in AI, illustrating different consent types (genuine, passive, implicit) and key processes including data collection, storage, processing, deployment, and retirement, with emphasis on privacy and data minimization.

Privacy and Consent

Think about the last time you clicked “I agree.”

Chances are you didn’t read what you were agreeing to. Neither did most people. And somewhere in that moment — buried in a wall of legal text nobody was realistically going to parse — a decision was made about your data. What would be collected. How it would be stored. What it would be used for. How long it would be kept. Who else might see it.

That’s passive consent. And it’s the foundation a significant proportion of AI training data is built on.

AI systems depend on personal data. That’s not a criticism — it’s a reality. The predictive power that makes modern AI useful is inseparable from the data that feeds it. But the collection, storage, and processing of personal data introduce privacy considerations that ethical design cannot treat as an afterthought. Because behind every data point is a person. And that person had a reasonable expectation about what would happen to it.

The refinery analogy applies here — but at a more fundamental level than processing. Before the question of how data is refined, there’s a more important question: did you have the right to extract it in the first place? Privacy and consent isn’t about the quality of the pipeline. It’s about the right to dig.


Consider what happened in 2015 when Google’s DeepMind division received 1.6 million patient records from the Royal Free NHS Trust. The stated purpose was to develop Streams, a clinical app designed to detect acute kidney injury. The intent was genuinely beneficial. But the 1.6 million patients whose records were transferred were never informed. They didn’t consent. Many of them had no idea their data had changed hands at all. The arrangement was later ruled unlawful by the Information Commissioner’s Office — not because the technology failed, but because the legitimacy of the data source was never established.

Good intentions are not a substitute for consent. The pipeline was clean. The extraction wasn’t.


Genuine consent requires clarity. The person providing data should understand what it will be used for, the scope of that use, and how long it will be retained. Not in principle — in practice. In language a reasonable person can understand, not language engineered to satisfy a legal requirement while obscuring the reality.

Passive or implicit consent — the pre-ticked box, the buried clause, the “by continuing to use this service” small print — undermines that legitimacy entirely. If someone wouldn’t consent knowing the full picture, then designing the consent mechanism to obscure the full picture isn’t a workaround. It’s a violation.

Data minimisation offers a practical discipline: collect only what is genuinely necessary for a defined objective. Not what might be useful one day. Not what’s technically available. What is actually required. The instinct in data-driven organisations is to collect everything and decide later what matters. Ethically, that instinct needs to be resisted.


Technical safeguards exist and they matter. Encryption protects data at rest and in transit. Federated learning allows models to be trained across distributed data sources without centralising sensitive information. Differential privacy introduces carefully calibrated noise into datasets to protect individual identities while preserving statistical utility.

But these tools come with trade-offs. Restricting information access frequently reduces predictive performance — a model trained on anonymised data may be less accurate than one trained on raw personal data. That gap is real. And navigating it honestly requires asking a genuine question about proportionality: is the predictive gain worth the privacy intrusion? Not as a rhetorical question. As a design decision, documented and accountable.


Privacy considerations don’t begin at model training and end at deployment. They run the full length of the workflow — from the moment data is collected, through every transformation and pipeline stage, through deployment, monitoring, and eventual retirement. Data retention policies and governance frameworks aren’t compliance bureaucracy. They’re the institutional memory that makes accountability possible.

And like fairness, privacy cannot be retrofitted. The decisions that matter most are made at the very beginning — what to collect, how to collect it, and whether you had the right to collect it at all.

Privacy, like fairness, has to be built in. Not bolted on.


Next: Part 4 — Transparency and Explainability.