Sometimes the data tells you what your wallet couldn’t.
Last weekend I cooked steak. Proper steak — the kind that deserves a decent red wine alongside it. So I did what any self-respecting wine buyer does: I spent more than usual. Higher price, better wine. That’s how it works, right?
Wrong.
The wine was awful. Bitter, sharp, aggressive — more paint stripper than Pinot. The kind of wine that makes you wonder whether the person who priced it had ever actually tasted it. As I pushed the glass aside and reached for the water, a question formed: in the real world, how do we actually measure wine quality?
Not price. Clearly not price. So what then?
Meet Charlie and Clare
Regular readers will know Charlie and Clare. Charlie is our Data Engineer — he builds the pipelines, aggregates the sources, and delivers clean, structured data. Clare is our Data Analyst and Data Scientist — she works with what Charlie hands her and finds the patterns worth knowing about.
This week they’re working with wine.
Charlie has been busy. A winery client wants to understand quality across their production batches. The data lives in multiple places — laboratory analysis systems, fermentation monitoring sensors, batch production records. Individually, each source is a fragment. Together, they tell a story. Charlie’s data platform pulls from all of them, normalises the formats, handles the joins, and delivers Clare a clean pipeline: 1,599 red wine samples, each described by eleven physicochemical measurements.
Alcohol content. Fixed and volatile acidity. Citric acid. Sulphates. Residual sugar. Chlorides. Density. pH. Free and total sulphur dioxide.
Eleven numbers per wine.
And crucially — no quality labels.
Clare’s Problem
Clare looks at the dataset and faces a genuinely interesting challenge. There are no categories here, no pre-assigned groups, no right answers to train against. Just eleven continuous measurements and 1,599 rows of chemistry.
This is the domain of unsupervised learning — the branch of machine learning that finds structure in data without being told what to look for. Where supervised learning optimises toward a target, unsupervised learning asks a different question entirely: what patterns exist that we haven’t defined yet?
Clare’s task is to let the data organise itself, then ask whether the organisation means anything.
Before she touches a model, she does something essential. She scales the data.
This matters more than it sounds. The eleven features live on very different scales — alcohol ranges from roughly 8 to 15, total sulphur dioxide from 6 to 289. Feed raw numbers into a distance-based algorithm and the large-scale variables will dominate purely through magnitude, drowning out the signal from smaller-range features. StandardScaler transforms everything to zero mean and unit variance — now every feature competes on equal terms.
Charlie’s pipeline has already handled the missing values and format inconsistencies. Clare inherits clean data. That’s not an accident — it’s the platform working as intended.
Eleven Dimensions Are Too Many to See
Before clustering, Clare reduces complexity using Principal Component Analysis — PCA.
Think of PCA as finding the angles from which the data is most spread out. Eleven features create eleven dimensions, which is impossible to visualise and cognitively overwhelming to reason about. PCA finds new axes — principal components — that capture the maximum variance in the fewest dimensions.
The results are telling. Nine components are needed to explain 95% of the variance. No single axis dominates. The data genuinely is high-dimensional — there’s no shortcut that captures most of the story. The first two components together explain just 45.7% of variance.
That’s an important caveat Clare keeps front of mind: when she later visualises clusters on a two-dimensional PCA plot, she’s seeing less than half the structure. The scatter plot is illustrative, not definitive.
PC1 is driven primarily by acidity-related features — it broadly separates sharper, more acidic wines from rounder ones. PC2 captures alcohol and fermentation character — higher alcohol and sulphate concentrations, reflecting more complete fermentation and stronger microbial stability. Even this compressed view starts to suggest that wine chemistry has meaningful directions of variation.
How k-means Works — The Wine Tasting Table
Imagine Clare pours all 1,599 wine samples into glasses and lines them up on a long tasting table. She doesn’t know how many groups there are yet, but she suspects wines with similar chemistry will naturally belong together.
She picks three glasses to act as reference points — her starting cluster centres, her “k” — and assigns every other wine to whichever reference glass it’s closest to, chemically speaking. Then she looks at each group, finds the glass that sits closest to the average of all its members, and moves her reference point there. Wines get reassigned. Reference points shift again. The process repeats until nothing moves anymore — the groups have stabilised around their natural centres.
That’s k-means. Not magic, not mystery. An algorithm that keeps nudging reference glasses along the table until the groupings settle. The “k” is simply the number of reference points — Clare’s job is to choose it wisely, which is where the elbow method comes in.
Letting the Data Find Its Own Groups
Clare runs k-means clustering — an algorithm that partitions observations into k groups by minimising the distance between each point and its cluster centre.
The question is: what should k be?
She uses two methods in parallel. The elbow method plots inertia — the total within-cluster variance — against increasing values of k. As k grows, inertia falls; the question is where the rate of improvement flattens. The silhouette coefficient measures how well each point sits within its assigned cluster compared to neighbouring clusters — higher is better.
Both methods point toward k=3 as a defensible choice. Three groups. Clare fits the final model.
The clusters contain 722, 502, and 375 wines respectively.
What the Clusters Actually Found
Now Clare looks at the chemistry of each group.
Cluster 1 — 502 wines — stands out immediately. Highest alcohol (10.72%), lowest volatile acidity (0.41 g/dm³), highest sulphates (0.75 g/dm³). These are markers experienced winemakers recognise: lower volatile acidity means less acetic acid, less of that sharp, vinegary edge. Higher sulphates support microbial stability and structure.
Cluster 0 — 722 wines — shows the inverse pattern. Highest volatile acidity (0.61), lowest sulphates (0.61). More of that aggressive sharpness Clare’s colleague experienced at the weekend.
Cluster 2 — 375 wines — is characterised by elevated sulphur dioxide levels and the lowest alcohol of the three groups (9.88%), suggesting less complete fermentation.
Three chemical profiles. Found without a single quality label in sight.
The Reveal
Now Clare looks at quality.
She takes the quality scores — held back throughout the entire analysis — and calculates the mean score per cluster. This is post-hoc interpretation only. The clustering didn’t know about quality. But the results are striking.
| Cluster | Profile | Mean Quality |
|---|---|---|
| 1 | High alcohol, low volatile acidity, high sulphates | 5.96 |
| 0 | High volatile acidity, lower sulphates | 5.55 |
| 2 | High sulphur dioxide, lowest alcohol | 5.36 |
The algorithm — working only from chemistry — has separated wines in a way that aligns meaningfully with human quality judgement. The group with the most favourable chemical profile scores highest. The group with the most aggressive volatile acidity scores lowest.
Clare didn’t tell the model what quality meant. The chemistry already knew.
An Honest Number
The silhouette score is 0.19.
By textbook standards that’s weak. Some analysts would look at that number and worry. Clare doesn’t, and it’s worth understanding why.
Wine chemistry is continuous. There are no hard walls between a quality-6 wine and a quality-7 wine — no moment where one chemical compound crosses a threshold and suddenly the wine is better. The boundaries between clusters are gradual, overlapping, real-world messy. A low silhouette score in this context isn’t a sign that the analysis failed. It’s information about the nature of the data itself.
The clusters are soft. The patterns are genuine. These two things are not contradictory.
This matters for how Clare reports her findings. She isn’t presenting three neat buckets — “good wine, average wine, poor wine.” She’s presenting three chemical tendencies, with meaningful separations on the features that wine science already tells us matter.
Why Charlie’s Platform Made This Possible
It’s worth pausing on something easy to take for granted.
Clare’s analysis worked because she had complete, clean, comparable data. Every one of the 1,599 samples described by the same eleven features, scaled and pipeline-ready.
In the real world, that’s rarely the starting point. Laboratory analysis lives in one system. Sensor data from fermentation monitoring lives in another. Batch and production records in a third. Pricing and commercial data somewhere else entirely. Each system uses different formats, different naming conventions, different update frequencies.
Without Charlie’s data platform aggregating those sources into a coherent, governed pipeline, Clare isn’t doing unsupervised learning on 1,599 wines. She’s manually reconciling spreadsheets and hoping nothing got lost in the joins.
The insight — that chemical profile predicts perceived quality independently of price — is only discoverable when the data foundation exists to support the question. Structure has to be built in, not bolted on.
What Clare Does Next
Unsupervised learning is Clare’s first move with an unfamiliar dataset. It reveals what’s there before asking what predicts what.
The natural next step is supervised learning. Now that three chemical profiles have been identified, Clare can use cluster membership to inform stratified sampling — ensuring any training dataset for a quality prediction model includes representative coverage of all three groups rather than accidentally over-representing one chemical type.
She could also bring price into the analysis. If Charlie’s platform connects to commercial data, Clare can ask the question that started this whole investigation: does chemical profile correlate with price? Is the expensive-but-terrible wine an outlier, or is the price-quality assumption systematically weak across the portfolio?
That’s a question worth answering before anyone’s next steak dinner.
The Takeaway
Eleven numbers. No labels. Three meaningful groups.
Clare found chemical structure that aligns with human quality judgement — not because she told the algorithm what quality meant, but because the chemistry already encoded it. Unsupervised learning didn’t give her answers. It gave her the right questions.
And behind all of it, doing the unglamorous work that makes the glamorous work possible, was Charlie’s data platform.
Quality has to be found in the data. But first, the data has to be there to find it.
Next in the series: Clare takes the cluster profiles into supervised learning — and finds out whether chemistry can predict quality well enough to save the rest of us from expensive mistakes.

Leave a Reply