Part 1: From Clusters to Predictions – How Clare Chose the Right Algorithm

As I’ve studied AI and Machine Learning, I’ve come to realise it’s called data science for a reason. And that reason is simple: it’s not easy. Finding the right signal in the noise takes a genuinely analytical approach, logical thinking, and a willingness to sit with uncertainty before reaching for an answer. There’s no shortcut through the discipline. I’ve learned that the hard way more than once working on assignments late in to the night. In this post, our data scientist hero Clare faces exactly that challenge. She has her clusters, she has her labels, and now she has to decide how supervised learning — and which algorithm — will predict wine quality with the fewest errors.
The Work Doesn’t End With the Clusters
When we last left Clare, she had done something quietly impressive. Working through the winery’s historical tasting data with an unsupervised learning approach, she had let the data organise itself. No labels, no guidance — just structure emerging from pattern. Three clusters had formed, and the winemaker had given them names that meant something: Premium, Standard, and Reject.
That was discovery. What Clare needs now is prediction.
The distinction matters more than it first appears. Unsupervised learning asks the data what it contains. Supervised learning asks a model to learn from what the data already knows, so it can make decisions about data it has never seen before. Clare’s clusters gave her the labels. Now those labels become the target. The question sitting on her desk is straightforward to state and genuinely hard to answer: given the measurable chemical properties of a new vintage, can a model predict its quality grade before it ever reaches the tasting panel?
That shift — from discovering structure to predicting outcomes — is the conceptual foundation of supervised learning. The model is given inputs and known outputs. It learns the mapping between them. Then it applies that mapping to inputs where the output is unknown. The discipline lies in how rigorously that estimation is made.
Clare opens her notebook. The data is ready. The labels exist. Now she has to make some decisions.
Reading the Target
The first decision in supervised learning is not which algorithm to use. It is what kind of thing you are trying to predict.
Clare’s target variable is wine quality grade: Premium, Standard, or Reject. That is not a number on a continuous scale. It is a category. And that single observation rules out an entire family of approaches before she has written a line of code.
Linear regression predicts continuous numerical outputs — price, temperature, yield, revenue. It is the right tool when the answer sits anywhere along a spectrum. Classification algorithms predict discrete categorical outputs. They assign an observation to one of a defined set of classes. Clare’s problem is a multi-class classification problem. Linear regression is off the table.
This is not a trivial distinction. Applying linear regression to a categorical target produces nonsense. Treating category labels as if they were ordered numbers — as if Reject equals one, Standard equals two, and Premium equals three — imposes a mathematical relationship the data does not actually contain. The model learns the wrong thing. Weak conceptual understanding at this stage leads to flawed modelling decisions later, and Clare knows it.
She writes classification at the top of her notebook and moves on.
The Logic Map
Professional guidance frameworks for algorithm selection begin with a logic map. The questions are simple. The discipline is in answering them honestly.
Is the problem supervised or unsupervised? Clare’s problem is supervised. She has labelled data. The clusters gave her that.
Is the target continuous or categorical? Categorical. Three classes. Classification confirmed.
How many samples are available? Clare’s dataset contains 1,599 wines — modest by enterprise standards. That steers her toward algorithms that generalise well on smaller datasets rather than those demanding millions of observations.
Are the features interpretable? Clare will need to explain her model’s recommendations to the winemaker. If the model flags a batch as Reject, there will be a conversation. Interpretability is not a nice-to-have. It is a functional requirement.
Is computational cost a constraint? At 1,599 wines and eleven features, this is a modest workload — Clare’s Dell Precision 7 laptop handles the full pipeline without breaking a sweat. Compute is not the bottleneck here. That matters, because it means her selection criteria can focus entirely on what the model learns and how well it explains itself, rather than what the hardware can afford to run.
Each answer narrows the field. By the time Clare reaches the end of the logic map, she is not choosing between dozens of algorithms. She is choosing between a handful of serious candidates.
Shortlisting the Candidates
Clare’s shortlist takes shape around four candidates, each considered on its own terms.
Logistic regression is the natural starting point. It is interpretable, computationally inexpensive, and well understood. For a three-class problem it extends cleanly through a one-versus-rest approach. Its coefficients can be read directly — Clare can tell the winemaker that a unit increase in volatile acidity pushes a wine toward Reject with a quantifiable effect. The limitation is that logistic regression assumes a roughly linear relationship between features and the log-odds of class membership. If the true decision boundary is more complex, the model will underfit.
Decision trees offer a different kind of interpretability. The model produces a flowchart of decisions the winemaker can follow without any statistical training. They handle non-linear boundaries well and make no assumptions about feature distributions. The problem is instability — small changes in the training data can produce substantially different trees, and a single decision tree overfits easily.
Random forest addresses that instability by building many trees and aggregating their votes. It is more robust, typically more accurate, and handles real-world data noise gracefully. The trade-off is interpretability. The winemaker cannot follow the logic of a hundred trees simultaneously.
K-nearest neighbours classifies a new observation by finding the most similar training examples and taking a majority vote. It is intuitive but sensitive to irrelevant features and offers no explanatory power. Clare sets it aside.
Then there is the question that sits at the edge of her shortlist. Deep neural networks are capable of extraordinary things. But Clare’s dataset is modest, her compute is limited, and the winemaker is waiting for an explanation. Neural networks are opaque by nature — the internal representations they learn do not translate into human-readable reasoning. Clare notes them for future reference and removes them from consideration.
She circles random forest as her primary candidate, with logistic regression as a baseline to measure against. The reasoning is sound. The decision is defensible. But before she commits, she needs to know whether her model actually works.
Did It Actually Work?
A model that produces predictions is not necessarily a model that produces good predictions. Clare runs both classifiers on a held-out test set — 320 wines the models have never seen during training — and looks at what comes back.
The first instinct is to reach for overall accuracy. Logistic regression scores 61%. Random forest scores 75%. Random forest wins — but Clare has learned not to stop there.
Her dataset is imbalanced. Of the 1,599 wines, 46.5% are Reject, 39.9% are Standard, and just 13.6% are Premium. A model that simply predicted Reject or Standard for everything would achieve a deceptively high accuracy figure while being commercially useless. The warning is in the numbers before the model even runs.
The confusion matrix gives her a more honest picture. It lays out, for each actual class, how many examples the model correctly identified and how many it misclassified — and crucially, what it misclassified them as. From it, two metrics earn Clare’s attention.
Precision asks: of all the wines the model predicted as Premium, how many actually were Premium? Recall asks: of all the wines that actually were Premium, how many did the model correctly identify?
The results are illuminating. Logistic regression finds only 26% of actual Premium wines — a recall of 0.26. Three quarters of the winery’s best bottles are being miscalled as Standard. Random forest improves this substantially, reaching a Premium recall of 0.58. Still not perfect, but it is now finding more than half the genuine Premium wines rather than missing most of them.
Reject performance is strong across both models — F1 of 0.73 for logistic regression, 0.81 for random forest. The majority class, with the most training examples, is the easiest to learn.
Standard sits in the middle, chemically and statistically. It is the hardest class to call cleanly, and both models reflect that. The soft boundaries the unsupervised clustering revealed — a silhouette score of 0.1892, indicating genuine overlap rather than clean separation — are still present in the supervised results. The models did not invent that difficulty. It was always in the data.
The winemaker asks Clare a pointed question: which error costs more — calling a Premium wine Standard and pricing it down, or calling a Standard wine Premium and disappointing a customer? That question does not have a statistical answer. It has a business answer. And it is exactly where algorithm evaluation and operational reality meet.
Clare documents both models. The random forest performs better across all three classes. But before she declares a winner, she runs one more check.
What Clare Learned
Clare closes her Jupyter Notebook with something that took longer to arrive than she expected: not confidence in the model, but confidence in the process.
Algorithm selection is not a guess dressed up in technical language. It begins with understanding what the data contains and what the prediction task actually requires. It follows a logic that can be stated, examined, and defended. It weighs interpretability, computational cost, and domain requirements alongside statistical performance. And it ends not with a declaration that the model is correct, but with evidence — real numbers, per class, honestly read.
The feature importance analysis added one final thread. The random forest’s three most influential predictors were alcohol, sulphates, and volatile acidity — in that order. These were precisely the features that defined the unsupervised clusters in the previous analysis. The supervised model, trained independently and without knowledge of the clustering work, arrived at the same conclusion the clusters had already reached. The data gave the same answer twice, through two entirely different methods. That is not coincidence. That is signal.
Supervised learning is not a collection of tools. It is a way of thinking about prediction — rigorously, honestly, and always in service of a decision that someone, somewhere, actually needs to make.
The data told Clare what the wine was. The model learned to say it again, about wine it had never tasted.
In Part 2, Clare’s model leaves the winery and enters environments where the stakes are higher, the constraints are harder, and the bias-variance results raise a question she wasn’t expecting.









