Flowchart illustrating the agent feedback loop in reinforcement learning, featuring steps: Observe State, Choose Action, Receive Reward, and an environment section with 'Try -> Fail -> Adjust' notation.

Game Over

Miss Smith taught me to predict the future with a straight line.

A shoebox of CDs taught me to find patterns without labels.

Reinforcement learning came from somewhere else entirely.

It came from losing.

The game was Doom. Dark corridors, demons around every corner, no manual, no walkthrough, no one to copy. Just a Marine on a screen, a keyboard under my fingers, and the immediate brutal feedback of getting it wrong.

I died constantly. But somewhere in that cycle of failure, something started to form. Move left here. Don’t open that door without backing up first. The shotgun beats the pistol in a corridor. The game never explained any of it. It just reacted — survival or slaughter — and I adjusted.

Try something. Die. Try something slightly different. Die slightly later. Somewhere in that loop, a strategy emerged.

That feedback loop, as it turns out, is one of the most powerful ideas in machine learning.


The Third Way of Learning

In Part 2, supervised learning gave the machine a cheat sheet — labelled examples with known answers to learn from. In Part 3, unsupervised learning pulled the cheat sheet away and let structure emerge from the data itself.

Reinforcement learning is different from both. There are no labelled examples to study and no static dataset to mine. Instead, there is an agent, an environment, and a simple but profound arrangement: act, observe the consequences, and gradually learn a strategy that leads to better outcomes over time.

Not “get this prediction right.” Not “find what naturally groups together.” But — do well in the long run.

That shift from a single correct answer to long-term cumulative reward is what makes reinforcement learning feel unlike anything else in the machine learning story.


The Loop

The mechanics are simple enough to hold in your head.

An agent observes the current state of its environment. It chooses an action. The environment responds — a reward, a penalty, or silence. The agent updates its understanding of which actions tend to lead where, and chooses again.

Run that loop millions of times and something remarkable happens. What started as near-random behaviour gradually sharpens into strategy. Not because anyone defined the right moves in advance, but because the feedback itself did the teaching.

This is how DeepMind’s AlphaGo learned to defeat the world’s best human players using strategies no human had ever conceived. Not by studying a rulebook, but by playing millions of games against itself, adjusting after every one. The data didn’t come from a spreadsheet. It came from experience.


Rewards, Not Labels

In supervised learning the feedback is immediate and precise. You predicted £350,000. The true answer was £320,000. Here is exactly how wrong you were.

In reinforcement learning it rarely works like that.

Think of teaching a dog a new trick. You don’t label each tiny movement as right or wrong. You reward certain sequences of behaviour — sit, stay, come — and ignore or discourage others. Over time the dog figures out which actions tend to lead to treats, even though nobody narrated the journey step by step.

Reinforcement learning agents learn in the same way. They explore — often taking random actions just to see what happens. When they stumble onto something that works, the algorithm quietly reinforces the decisions that led there. When they hit bad outcomes, it weakens them.

The hard part is that rewards can be a long time coming. In a game, the winning move might trace back to a decision made hundreds of steps earlier. In a supply chain, a choice that looks costly today might pay off weeks later. Working out which actions deserve the credit — or the blame — across a long sequence of decisions is one of reinforcement learning’s central challenges. And one of its most important unsolved problems.


Closer to Everyday Life Than You Think

The famous examples tend to involve games and robots — AlphaGo, robotic arms, drones balancing in turbulence. These make good headlines. But reinforcement learning is also quietly at work in less dramatic places.

The route your sat-nav recalculates in real time. The dynamic pricing that adjusts what you’re shown based on demand and your behaviour. The recommendation engine that doesn’t just respond to what you clicked last, but learns how to keep you engaged across an entire session. In each case, a sequence of decisions is being optimised for a long-term outcome — not a single prediction, but a strategy playing out over time.

At enterprise scale, the same principles are starting to touch logistics, energy management, and operational scheduling — systems where the cost of a bad sequence of decisions is measured not in lost points but in real money and real consequences. (The data infrastructure that makes that possible is something we explore in our [Data and AI series].)


When the Game Goes Wrong

When I was losing at video games, the stakes were low. The worst outcome was a game over screen and a bruised ego. I could experiment freely because nothing real was at risk.

Reinforcement learning systems don’t always have that luxury — and when the reward is defined badly, the results can be deeply strange.

Agents are remarkably good at finding ways to maximise whatever score they’ve been given, including ways nobody anticipated and nobody wanted. A robot that learns to exploit a bug in the simulation. An ad system that maximises clicks by surfacing content that is technically engaging but clearly not what anyone intended. The agent isn’t being clever or malicious. It’s doing exactly what it was asked. The problem is that what it was asked and what was actually wanted weren’t quite the same thing.

This is why reinforcement learning often starts in simulation, with careful constraints before anything touches the real world. And it’s why the design of the reward itself is not a technical afterthought — it’s one of the most consequential decisions in the whole system.


One Is About Answers. One Is About Patterns. One Is About Behaviour.

Supervised learning, unsupervised learning, and reinforcement learning aren’t separate islands. In practice they layer and interweave — reinforcement learning agents often use supervised models inside themselves, predicting future rewards or modelling parts of their environment. Unsupervised techniques help them compress complex states into something manageable.

But as a map of the territory, the distinction holds.

Supervised learning is about answers. Unsupervised learning is about patterns. Reinforcement learning is about behaviour — learning how to act in an environment to maximise what matters over time.


Trial, Error, and Responsibility

When I finally stopped dying in that game, it wasn’t because someone handed me the solution. It was because the loop — try, fail, adjust, try again — had quietly built something that worked.

Reinforcement learning systems learn the same way. Which means they will find strategies we didn’t anticipate, shortcuts we didn’t design, and solutions we didn’t know were possible. Sometimes that’s extraordinary. Sometimes it’s a warning.

The trial-and-error loop doesn’t go away. But once you give an agent the power to act in the world, the design of its rewards, constraints, and environment becomes an ethical question as much as a technical one.

Because players, as I learned the hard way, will do whatever it takes to win the game you set in front of them — whether or not it’s the game you thought you were designing.


In the next post, we step back from the mechanics and look at the bigger picture — where these three ways of learning meet the real world, and what it means to build systems that don’t just work, but work well.

Leave a Reply