Why AI Fails in Supply Chain Compliance Workflows
The failure mode I see most often is not that teams build bad harnesses. It is that they build no harness at all, then wonder why scaling is hard.
The pattern looks like this. A capable person discovers that a language model can help with supplier screening, regulation mapping, or corrective action drafting. It works well enough in a conversation. They share it with their team. The team uses it in conversations. And then, six months later, someone asks: how do we know the same logic was applied to every supplier? Can we audit why a particular rating was given? Can we re-run this when the regulation updates? Can we process 3,000 records consistently?
The answer, in a pure chat-based workflow, is almost always no.
This is not a criticism of the people building these tools, chat is a genuinely useful starting point and exploration mode. The problem is treating exploration as production. General-purpose assistants are excellent for testing what is possible. They are not, by themselves, a substitute for a controlled AI workflow in a regulated context.
What a Real Harness Looks Like in This Domain
Let me make this concrete. Imagine you are building an AI-assisted supplier due diligence workflow for a company in scope for CSDDD and LkSG, with several hundred new suppliers to screen each quarter.
The surface problem sounds tractable enough. Public information exists, corporate websites, sustainability reports, certifications, policy documents. The challenge is turning it into something consistent, structured, and usable at scale, without the result depending entirely on how the question happened to be phrased that day.
A naive implementation would ask a model to summarize a supplier's ESG posture based on whatever it can find. You would get outputs. Some would even be good. But you would have no control over what questions were asked, no consistent evidence standard, no ability to audit why one supplier was flagged differently from another, and no traceability back to source material.

A real AI harness for supplier risk assessment looks quite different.
The first stage is not generation; it is classification and grounding. What is the correct supplier identity? Which legal entity or subsidiary is in scope? Is there an engagement history? Which regulation is primarily relevant here? This is mostly deterministic work. It sets the context for everything downstream and prevents the model from reasoning about the wrong entity, which, without this step, happens more often than you would expect.
The second stage is structured evidence retrieval. Not "ask the model what it thinks," but "find what actually exists". The model's job here is to absorb variability: normalizing language across suppliers, identifying relevant passages across inconsistently structured sources, surfacing what is present and what is absent. It is doing what it is genuinely good at, handling messy, unstructured input, but within a constrained and repeatable frame.
The third stage is gap detection, before any risk signal is produced. For each relevant topic, the harness distinguishes between strong evidence, partial evidence, and no evidence found. This is not a confidence score bolted on at the end. It is a structural checkpoint. The output is not "low risk" when evidence is missing. It is "no evidence found”, a materially different and more honest signal, and one that tells the analyst something actionable.
The fourth stage is the output itself, built for human review. Every result surfaces the underlying question logic, the evidence level, an explanation, direct quotes from the source, and a link back to the original material. There is no opaque score to accept or reject. A compliance analyst can see exactly why a supplier was flagged, what was found, and what was not.
The analyst then decides what to do, whether to trigger a deeper assessment, prioritize outreach, or note it alongside country risk, news monitoring, and other due diligence inputs. The AI surfaces and structures.
The human judges and acts. That is what a usable human-in-control AI compliance workflow looks like.
That design is not hypothetical. It is the pattern that makes AI genuinely useful in this context rather than just impressive in a demo.