A Million Rows of Nothing
Why business use case and data strategy must come before AI strategy
At a customer event last year, an IT Director told me — with some confidence — that they already had an AI strategy.
“Great,” I said. “Now tell me about your data strategy.”
“We have 250TB,” he replied.
I nodded. And thought: there is a very big difference between data and storage.
That moment has stayed with me because it wasn’t an isolated conversation. It was a pattern. Organisations are arriving at the AI table with infrastructure plans, vendor commitments and boardroom ambition — but without first validating the business use case, the predicted ROI, or the data required to support either one.
That is the gap. And it is an expensive one.
The gap is earlier than most organisations think
Walk any AI conference floor and the energy is real. The technology is genuinely impressive. GPU servers are being specced, procured and racked. Data scientists are being hired. AI roadmaps are being presented to boards.
And somewhere near the bottom of the slide deck, almost as an afterthought: “We’ll need to look at data readiness.”
For organisations serious about AI delivering real outcomes, this is the wrong order.
The first question should not be “what infrastructure should we buy?” It should be “what business problem are we solving, what return do we expect, and does our data actually support that outcome?”
If those questions haven’t been answered, the AI strategy isn’t yet a strategy. It’s an ambition.
Start with the business use case and predicted ROI
Before talking about models or servers, organisations need clarity on three things: what specific business problem are we trying to solve, what result would make this investment worthwhile, and what evidence suggests the data can support that result?
This matters because businesses don’t invest in AI for the sake of AI. They invest in outcomes — lower cost, higher revenue, reduced risk, better service, faster decisions, improved productivity.
The business use case and predicted ROI have to come first. They set the standard the data must meet, the model must prove, and the infrastructure must eventually support. Without that anchor, teams end up building technical capability in search of commercial justification.
Then comes data strategy
This is where many organisations confuse capacity with capability.
Saying “we have 250TB” is not describing a data strategy. It is describing a storage estate.
A real data strategy answers different questions. What data actually matters for the use case? Where does it live? Who owns it? How is it governed? How trustworthy is it? How easily can it be accessed, joined, prepared and used?
AI doesn’t begin with infrastructure. It begins with understanding whether the organisation has data that is usable, relevant, governed and connected to a business objective. That is why data strategy has to come before AI strategy. If you don’t understand the asset you’re asking AI to learn from, you don’t yet know whether the strategy is viable.
Data engineering is not pre-work. It is the work.
The foundational argument is simple, even if it’s routinely ignored: data engineering is not a precursor to AI work. It is AI work.
The pipelines, the schemas, the quality checks, the lineage, the transformation logic — these are not the boring bit before the interesting bit starts. They are the work.
A model is only ever as good as the data it learns from. If that data is incomplete, inconsistently formatted, poorly labelled or structurally flawed, the model will learn the wrong things with great efficiency. Garbage in, amplified garbage out at scale.
The data engineering layer needs to be in place — and understood — before a model is trusted in production. That means clean, documented pipelines with known lineage, a clear system of record for the domain you’re working in, variables that are what they say they are, and critically — someone who has actually interrogated the data, not just counted the rows.
250TB of storage tells you nothing about any of that.
Even clean data can still be useless
Here is where the conversation gets more uncomfortable. Because the problem isn’t always dirty data.
Sometimes the data looks clean. The schema is tidy. The row counts are impressive. The formatting is consistent. It passes the hygiene checks. And then you run the analysis — and discover the data tells you very little. Not because it’s messy. Because it’s empty of useful signal.
This is the moment EDA — Exploratory Data Analysis — earns its place. Not as a technical formality, not as a box to tick before the real work starts, but as the moment of truth. The point at which you find out whether your data can actually answer the question you’re asking of it.
That means looking at distributions, missingness, outliers, feature relationships, basic correlations, and whether the patterns you expected to see are actually present. If they aren’t, that isn’t a minor issue. It’s the whole issue.
A million rows of nothing is still nothing
This is why volume can be so misleading.
Take a look at this correlation matrix.

To the untrained eye it looks impressive. Professional. The kind of output that gets nodded at in a boardroom. But look closer. That red diagonal? Every variable correlating perfectly with itself — mathematically guaranteed, analytically meaningless. Everything else is zero. Price and Discount: no relationship. Seasonality and Stock Level: no relationship. Shipping Cost and Return Rate: no relationship.
In a real retail dataset those relationships should exist. The fact that this data shows none of them is a signal worth taking seriously. A flat correlation view doesn’t prove there is absolutely nothing to learn — but it does tell you there is no obvious predictive signal in this view of the data. That should trigger caution, not confidence.
You shouldn’t respond by buying more infrastructure. You should respond by asking better questions. Are these the right features? Is the data aggregated at the wrong level? Are important variables missing? Is the business question badly framed? Are we trying to predict something the data cannot meaningfully support?
If you can’t answer those questions, you are not ready to build the model. You are ready to do more analysis.
Model readiness comes after data readiness
Only once the business case is clear and the data has been tested should the conversation move to model readiness.
At that point the focus becomes more disciplined. Can the data support the target outcome? Which features actually carry useful predictive weight? What baseline performance is realistic? What error level is acceptable for the business use case? What would success look like in practice, not just in a notebook?
This is the stage where organisations find out whether the use case is genuinely model-worthy — or whether it looked better in a strategy deck than it does in reality. Model readiness is not about enthusiasm. It is about proof.
Infrastructure should be the consequence, not the starting point
The infrastructure conversation is seductive. More compute, faster processing, bigger clusters — these feel like progress. And they are progress, in the right context.
When you have a validated business case, a believable ROI, signal-rich data and a well-framed modelling problem, the right infrastructure genuinely accelerates outcomes. But infrastructure applied to unvalidated data doesn’t solve the problem. It scales it.
A model trained on the wrong data, running on the best hardware available, will produce wrong answers faster and at greater cost than anyone planned for. The servers don’t know the business case is weak. The GPUs don’t know the data is empty of signal. They will process bad assumptions with perfect efficiency.
That is why the sequence matters.
Business use case → Predicted ROI → Data strategy → Data engineering and EDA → Feature validation and model readiness → Infrastructure investment
Getting that sequence right is the difference between an AI investment that delivers and one that quietly disappoints.
The IT Director with 250TB has storage. What he needs first is a conversation about what’s in it, whether it’s been tested, whether it contains usable signal, and whether it can answer the questions the business is asking. That is the conversation worth having before the servers arrive.
Closing thought
There is a version of the AI hype cycle that ends badly — and it ends badly in a specific way. Not with dramatic failure, but with quiet disappointment. Models that don’t perform. Investments that don’t deliver. Data scientists hired to build things the data was never capable of supporting.
The organisations that avoid that outcome are the ones that did the unglamorous work first. They validated the use case. They estimated the ROI. They looked at the data before they bought the infrastructure. They ran EDA before they committed to the model. They asked hard questions before they made bold commitments.
The emperor’s new clothes are always convincing until someone asks the uncomfortable question. In AI, that question is usually the same:
Have you actually tested the data?
Data readiness has to be built in, not bolted on.



