Revolutionizing AI: From Centralization to Localized Efficiency

Blueprint illustration of a distributed AI architecture, highlighting challenges such as slow distribution and high space usage, and benefits including localized inference and enhanced security.

How AI is following the same arc as every technology before it — and why that’s the most important thing happening in enterprise AI right now.

The 22 Binders Problem

In the 1990s I worked for Volvo Trucks, responsible for technical documentation. Every authorised dealership workshop had 22 steel binders crammed with service manuals, torque specifications, fault code references, and replacement procedures. Covering every model, every variant, every engine configuration.

Getting those binders to the right workshop, in the right language, at the right time was a logistical nightmare. Print runs. Translation cycles. Physical distribution across multiple countries. And the moment a binder left the building, you started losing control of it.

Because trucks don’t stand still. Specifications change. Procedures get revised. Safety-critical updates happen. So we’d issue interim service bulletins — printed updates mailed out to dealerships, hoping they’d find their way into the right binder, in the right place, before a technician needed them.

The knowledge was good. The people who wrote those manuals understood those trucks deeply. The content was authoritative, structured, procedural. But the distribution model was broken by design. The moment you printed something, the clock started on it becoming out of date.

I spent years managing that problem. I never solved it. The technology didn’t exist to solve it.

It does now.

The Sledgehammer Era

When generative AI arrived at scale it came in like a sledgehammer. Vast data centres. Enormous compute. Hundreds of billions of parameters. Everything centralised, everything cloud-dependent, everything expensive.

That was necessary. You needed that scale to prove the capability existed. GPT-4, Gemini, Claude — these required hyperscale infrastructure to demonstrate what large language models could actually do. The sledgehammer era wasn’t wrong. It was the only way to get here.

But it’s not the end state.

There’s a pattern in technology that repeats so reliably it’s almost boring once you’ve seen it enough times. A new capability emerges at scale, centralised, specialist, expensive. Then the algorithms get smarter. The hardware gets smaller. The economics shift. And compute escapes the specialist environment and goes where the work actually happens.

We’ve seen this cycle before. Several times.

The Transistor Moment

In the 1940s the valve computer proved the concept. ENIAC, Colossus — genuinely capable machines doing real computation. Enormous, power hungry, heat generating, fragile, requiring specialist environments and specialist people just to keep running.

Then the transistor arrived. It didn’t just miniaturise the valve. It changed what was possible. Compute escaped the air-conditioned room. It went into consumer products, industrial equipment, vehicles. Eventually into the Volvo workshop — diagnostic computers, engine management systems, electronic service tools.

Nobody staring at a room full of valves in 1955 predicted the smartphone. But it was inevitable once the transistor existed. The use cases emerged from the distribution.

We are at that moment with AI.

The evidence is already here. Models like Mistral 7B, Microsoft’s Phi-3, and Meta’s Llama 3.1 8B perform remarkably well on focused tasks. Quantisation techniques mean a model that needed 80GB of GPU memory two years ago runs comfortably in 8GB today with minimal quality loss. NVIDIA has put one petaflop of AI compute into a desktop machine — the GB10 Grace Blackwell — that sits on a workbench. Apple put a capable language model in a phone.

The sledgehammer is giving way to the scalpel. Smarter algorithms. Smaller models. Distributed inference. The transistor moment.

The Architecture That Changes Everything

So let me describe what I’d build for Volvo Trucks today.

A GB10 desktop in every dealership workshop. On it: a small language model, a local MCP server, and a local cache of the workshop knowledge base. No internet connection required. No cloud dependency. No data leaving the building.

The knowledge base lives centrally — one authoritative source, maintained by a team of technical authors with a proper authoring and approval workflow. Single version of the truth. When a torque specification changes, it changes once, in one place, approved by the right person. Overnight, a delta sync pushes only the changed content to every workshop box in every country. By morning every technician in Europe has the current procedure.

The technician doesn’t type freeform queries. They select from predefined prompt templates — fault code lookup, torque specification, replacement procedure, service interval. Each template fires a carefully engineered retrieval query behind the scenes, pulling the right content from the local knowledge base, passing it to the local model, generating a precise, grounded answer.

The model never wanders off domain. There’s no internet to reach out to. No risk of the model hallucinating — confabulating a procedure it half-remembers from training. The answer comes from the authoritative knowledge base, retrieved precisely, generated locally, watermarked with the sync timestamp so you know exactly which version of the truth informed it.

This isn’t a chatbot. It’s a precision workshop tool that happens to use an AI model internally. The AI is an implementation detail. The value proposition is the right answer, instantly, for a truck that needs to move.

And crucially — this couldn’t be done with a naive RAG implementation bolted onto an ungoverned file store. The intelligence isn’t in the retrieval mechanism. It’s in the governance that happened before retrieval was ever involved. The single version of truth. The approval workflow. The deprecation process that ensures superseded procedures stop being retrieved. The content discipline that technical authors like my 1990s self spent years trying to maintain manually.

The AI amplifies good knowledge management. It doesn’t replace it.

The Pendulum

I’ve watched enough technology cycles to see the pattern clearly.

The PC era distributed compute out of the data centre and into the hands of individuals. The internet centralised it again — everything running on servers you didn’t own. Mobile distributed it once more, putting capable compute in every pocket. Cloud AI is centralising again — everything phoning home to a hyperscale data centre to answer a query.

Each time, the dominant narrative says this is how it will always be now. Each time, the pendulum swings back. Not because the technology fails, but because the same forces reassert themselves: latency becomes unacceptable, data sovereignty pressure builds, cost economics shift, and capability catches up to the point where distribution becomes viable again.

We are at peak centralisation with AI right now. The distribution forces are already building. Sovereignty regulation is tightening globally. Edge hardware is catching up fast. Algorithmic efficiency is compressing capable models into deployable sizes. Enterprises are growing uncomfortable with sensitive data leaving their premises to answer a query.

The pendulum will swing. It always does.

The Control Plane

But there’s something different this time that could break the historical pattern — and it’s the idea I find most interesting in enterprise AI right now.

Every previous distribution wave eventually lost coherence. The PC era fragmented into version chaos, security nightmares, and unmanageable sprawl. Mobile created a device estate that IT departments are still trying to govern. Distribution without discipline becomes a different kind of problem — one that often ends up being worse than the centralisation it replaced.

The question is whether AI distribution can be done differently. Whether you can have local inference without local chaos.

I think you can. The architecture looks like this: local inference running in your server room, your data never leaving your premises, your domain knowledge locked in a governed local knowledge base. But the model’s values, safety alignment, and capability updates maintained centrally by the people who built it. The enterprise controls the execution environment. The model provider maintains the model itself.

Distribution without fragmentation. Sovereignty without chaos.

It’s the Volvo KB architecture applied to the model itself. Central truth. Distributed execution. The same principle that would have solved my 22-binder problem in 1993 — one authoritative source, pushed to the edge, with discipline about what changes and who approves it.

This isn’t a theoretical position. The infrastructure to do it exists today. What’s missing, in most organisations, is the governance thinking that makes it safe. And governance thinking, it turns out, is not a technology problem. It’s a knowledge management problem. Which is a very old problem indeed.

What This Means

The use cases for distributed, domain-locked AI are not going to come from hyperscale thinking. They’re going to emerge from people who understand specific domains deeply — who know where the knowledge lives, what governance it needs, and what questions actually matter in that environment.

The Volvo workshop is one example. But the same architecture applies anywhere that has a bounded domain, authoritative knowledge, and real decisions being made by people who need the right answer quickly.

Consider a hospital ward. A clinician needs the current drug interaction protocol for a specific combination — not a general answer from a model trained on the internet, but the approved formulary for this trust, this version, signed off by the chief pharmacist last Tuesday. The architecture is identical to the workshop: local inference, governed knowledge base, delta sync, no data leaving the building. The AI is an implementation detail. The value is the right answer, for this patient, right now.

Or a field engineer on an offshore platform, no reliable connectivity, needing the current maintenance procedure for a specific valve configuration. Or a legal team needing to retrieve the approved contract clause library — not a hallucinated approximation, but the version that compliance signed off.

In each case the AI isn’t the interesting part. The interesting part is the governed knowledge layer underneath it — built by people who understand the domain, maintained with discipline, versioned and approved and auditable.

The sledgehammer era gave us the capability. The transistor moment gives us the distribution. What makes it useful is the same thing that always made the difference — knowing your domain, respecting your data, and being honest about what the technology actually does.

I learned that managing 22 steel binders for Volvo Trucks in the 1990s.

Some lessons don’t change.

	Unlocking Enterprise… on The Plumbing Under the Hood: R…
	Optimize Dell PowerE… on LLM Sizing 101 – Part 2: From…
	Optimize Dell PowerE… on LLM Sizing 101 – Part 1:…
	Understanding AI Pro… on In Praise of Boring, Everyday…
	Understanding AI Pro… on Garbage In, Expensive Garbage…

From 22 Binders to a Box on the Workbench

The 22 Binders Problem

The Sledgehammer Era

The Transistor Moment

The Architecture That Changes Everything

The Pendulum

The Control Plane

What This Means

Like this:

Leave a ReplyCancel reply

The 22 Binders Problem

The Sledgehammer Era

The Transistor Moment

The Architecture That Changes Everything

The Pendulum

The Control Plane

What This Means

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from garagegeekblog