• Blog
  • AI for Supply Chain Sustainability: How Harness Engineering Makes AI Work in Practice
May 7, 2026
George Karapetyan
Connect on

AI for Supply Chain Sustainability: How Harness Engineering Makes AI Work in Practice

This is the second in a series of posts I'm writing about bringing AI into practice in sustainability, and product compliance. The first covered what BOM extraction taught us about building AI for regulated workflows. This one goes one level up: not what the model does, but how you build the AI system and workflow around it.

Imagine you hire a highly capable new analyst. Day one, you could give them access to your supplier database and say "Assess risk for this portfolio." They will produce something. It might even be impressive.

Or you could give them the workflow. Which regulation applies to which supplier type. Which data fields are required before a risk rating is valid. When missing evidence means escalation, not estimation. What threshold separates "monitor" from "engage" from "remediate." What an audit-ready output looks like in your system.

Same analyst. Completely different reliability.

This is the gap I keep seeing in AI projects for supply chain sustainability and product compliance. Someone comes to you and says: "We plugged in the new model. It's so much better at answering supplier risk questions." And the demo is compelling. You ask about a supplier, the model gives a nuanced answer. You ask about a regulation, it knows the details. You ask it to flag risk in a batch of supplier records, and the outputs look thoughtful.

Here is the uncomfortable thing: they are probably right. The model probably is better. And that is not the main problem.

Why Better AI Models Don’t Solve Supplier Risk and Compliance

Better AI Models Don’t Solve Supplier Risk and Compliance

 

The real question is not whether a better model gives better single answers. Of course it does.

The question is what happens when your compliance analyst asks it to screen 400 new suppliers against your CSDDD scope criteria, using your internal risk thresholds, pulling from your supplier portal, escalating to the right person when evidence is missing, and producing output that will hold up six months from now when an auditor asks where a rating came from.

That is not a model problem. That is a harness problem. And it is the problem that separates a useful AI assistant from an AI system your organization can actually rely on.

A quick definition before going further. In the research literature, "harness" refers to the orchestration layer around a model: what information it receives, in what order, through what tools, with what verification, producing what kind of output, under what constraints. LangChain puts it simply: Agent = Model + Harness.

Recent research makes this hard to ignore. Work from Stanford and Tsinghua in early 2026 shows that the performance of AI systems often depends as much on the surrounding harness as on the model itself. In one line of work, researchers point to prior evidence that changing only the harness, while keeping the model fixed, can lead to performance gaps of up to 6× on the same benchmark (SWE-Bench Mobile, Meta-Harness, 2026).

In another, a team redesigning the orchestration layer for a desktop automation task cut runtime from roughly six hours to two, reduced AI calls from about 1,200 to 34, and improved accuracy, all without changing the underlying model (Natural-Language Agent Harnesses, 2026).

Thirty-four calls instead of twelve hundred. Same model. Different harness.

What Is an AI Harness? (And Why It Matters for Compliance Workflows)

In supply chain sustainability and product compliance, this maps to something most of your teams already think about, just not in these terms. It is the difference between asking a very smart person a question and giving that same person a structured workflow to execute.

The harness is the workflow. It is where your organization's compliance logic becomes machine-executable, not as a vague instruction, but as steps, thresholds, evidence requirements, escalation rules, and definitions of done.

Why AI Fails in Supply Chain Compliance Workflows

The failure mode I see most often is not that teams build bad harnesses. It is that they build no harness at all, then wonder why scaling is hard.

The pattern looks like this. A capable person discovers that a language model can help with supplier screening, regulation mapping, or corrective action drafting. It works well enough in a conversation. They share it with their team. The team uses it in conversations. And then, six months later, someone asks: how do we know the same logic was applied to every supplier? Can we audit why a particular rating was given? Can we re-run this when the regulation updates? Can we process 3,000 records consistently?

The answer, in a pure chat-based workflow, is almost always no.

This is not a criticism of the people building these tools, chat is a genuinely useful starting point and exploration mode. The problem is treating exploration as production. General-purpose assistants are excellent for testing what is possible. They are not, by themselves, a substitute for a controlled AI workflow in a regulated context.

What a Real Harness Looks Like in This Domain

Let me make this concrete. Imagine you are building an AI-assisted supplier due diligence workflow for a company in scope for CSDDD and LkSG, with several hundred new suppliers to screen each quarter.

The surface problem sounds tractable enough. Public information exists, corporate websites, sustainability reports, certifications, policy documents. The challenge is turning it into something consistent, structured, and usable at scale, without the result depending entirely on how the question happened to be phrased that day.

A naive implementation would ask a model to summarize a supplier's ESG posture based on whatever it can find. You would get outputs. Some would even be good. But you would have no control over what questions were asked, no consistent evidence standard, no ability to audit why one supplier was flagged differently from another, and no traceability back to source material.

What a Real Harness Looks Like in This Domain

 
A real AI harness for supplier risk assessment looks quite different.

The first stage is not generation; it is classification and grounding. What is the correct supplier identity? Which legal entity or subsidiary is in scope? Is there an engagement history? Which regulation is primarily relevant here? This is mostly deterministic work. It sets the context for everything downstream and prevents the model from reasoning about the wrong entity, which, without this step, happens more often than you would expect.

The second stage is structured evidence retrieval. Not "ask the model what it thinks," but "find what actually exists". The model's job here is to absorb variability: normalizing language across suppliers, identifying relevant passages across inconsistently structured sources, surfacing what is present and what is absent. It is doing what it is genuinely good at, handling messy, unstructured input, but within a constrained and repeatable frame.

The third stage is gap detection, before any risk signal is produced. For each relevant topic, the harness distinguishes between strong evidence, partial evidence, and no evidence found. This is not a confidence score bolted on at the end. It is a structural checkpoint. The output is not "low risk" when evidence is missing. It is "no evidence found”, a materially different and more honest signal, and one that tells the analyst something actionable.

The fourth stage is the output itself, built for human review. Every result surfaces the underlying question logic, the evidence level, an explanation, direct quotes from the source, and a link back to the original material. There is no opaque score to accept or reject. A compliance analyst can see exactly why a supplier was flagged, what was found, and what was not.

The analyst then decides what to do, whether to trigger a deeper assessment, prioritize outreach, or note it alongside country risk, news monitoring, and other due diligence inputs. The AI surfaces and structures.

The human judges and acts. That is what a usable human-in-control AI compliance workflow looks like.

That design is not hypothetical. It is the pattern that makes AI genuinely useful in this context rather than just impressive in a demo.

What AI Research Reveals About Workflow vs. Model Performance

The academic finding that stuck with me most was not the headline 6× performance gap. It was a subtler result about structure and efficiency.

In one set of experiments, researchers compared two harness configurations on the same benchmark. The heavier one used more verification stages, more candidate generation, more parallelism. The lighter one used a disciplined self-improvement loop and stripped out most of the extra structure.

The lighter configuration was both more accurate and far cheaper. It processed tasks in a fraction of the time with a fraction of the compute.

The lesson is not that structure is always good. It is that the right structure, the structure that mirrors what actually needs to happen, is what matters. In supply chain terms: a harness built around your actual compliance workflow will outperform a generic AI assistant with more sophistication layered on top of it. Domain knowledge is not context that helps the model. It is the architecture of the harness itself.

Why AI Systems Require Continuous Optimization (Not One-Time Setup)

There is a subtler point from Anthropic's engineering guidance that matters practically for anyone building in this space. Harnesses encode assumptions about what the model cannot do on its own, and those assumptions go stale.

A workflow you built for an earlier model generation may have included context-resetting steps, explicit re-summarization instructions, or carefully managed retrieval patterns the model needed at the time. A year later, a newer model handles those things natively. The extra scaffolding is no longer helping. It has become dead weight that adds latency and cost.

This means harness engineering is not a one-time build. It is ongoing maintenance against a moving capability baseline. Teams that understand this treat their harnesses the way they treat any software: version-controlled, tested, and revisited when the underlying model changes.

How to Operationalize AI in Supply Chain Sustainability

If you work in supply chain sustainability, due diligence, or product compliance and you are leading AI development, you are probably already sitting on the most valuable raw material for a well-engineered harness.

You know which regulations apply under which conditions. You know what evidence is acceptable for what purpose. You know where data is reliably structured and where it always arrives messy. You know what a risk threshold should be and what triggers escalation. You know what an auditor will ask for.

That is not supplementary context for a model. It is the harness, waiting to be built.

The teams that will build the most reliable AI systems in this space are not the ones waiting for a smarter model. They are the ones translating their operational expertise into structured, auditable, human-controlled workflows, and then putting AI to work inside those workflows, doing exactly what it is genuinely good at.

The model is not the system. The workflow around it is.

How IntegrityNext Applies This in Practice

This is the design philosophy behind how we build at IntegrityNext. Take AI Screening as one example. The work is not "ask a model what it thinks about a supplier." It is a structured workflow: identify the correct legal entity first, retrieve evidence from public sources within a constrained frame, distinguish strong evidence from partial evidence from no evidence found, and surface every result with the underlying question, the source quote, and a link back to the original material.

A compliance analyst can see exactly why a supplier was flagged, what was found, and what was not — and decide what to do next. The same harness logic runs through supply chain mapping, continuous risk monitoring, supplier data collection, and regulatory tracking across CSDDD, CSRD, CBAM, and LkSG. The point is not the feature list. The point is that AI operates inside a controlled, auditable workflow rather than next to one. That is what makes the output usable six months later when an auditor asks where a rating came from.

The model is not the system. The workflow around it is.