May 27 / James Kavanagh

How I Fix Broken AI Governance

A practical approach to identifying the causes of governance problems, and then design the right fixes
There's a story about Taiichi Ohno, the engineer who built the Toyota Production System. I don't know if it's true or not, but I like it. Apparently, he used to draw a chalk circle on the factory floor and make a manager stand in it. Then he'd leave, sometimes for hours. And the manager's job was just to watch. No clipboard, no checklist, no framework, nothing. Just observe how work actually flows, where things get stuck, where people have invented workarounds that aren't in any process document, and where the procedure so carefully drawn out and posted on the wall bears no resemblance to what's actually happening in front of them.
Ohno's point was that you can't improve a system you haven't observed. And that what you observe won't match what you're told.
If you're entering the AI governance field as a career, you might imagine your work will be designing an elegant governance framework from the ground up. Or perhaps auditing sophisticated AI management systems against well-defined standards. Or perhaps navigating obscure procedures of legislation like the EU AI Act to classify a system as "high-risk" whether or not it poses any risk at all. Or designing and implementing new technical controls for evaluation of models and guardrails for their safe execution. Maybe you will. 
But more likely, your experience will be quite different from all of these, and honestly much more interesting and meaningful. Because if you stop, if you stand in the chalk circle, and observe, what you'll usually find is a bit of a mess and a puzzle. You'll see things that sort of work, things that were designed for different purposes, things that exist on paper but not in practice, and things that do happen in practice but not the way they exist on paper. And your job becomes figuring out how to get it all working effectively, bit by bit. I find that's the real practical skill of AI governance. It doesn't start with mapping the EU AI Act, issuing an audit report, implementing a GRC tool or writing a Pulitzer-prize winning Policy. It starts with standing in the circle and watching.

Approaches to diagnosis

Now there are plenty of approaches out there for assessing governance readiness and identifying gaps. I think I've applied or seen most of them in action, and some of them have merit. But I've also watched a lot of them fail to produce the kind of insight that leads to genuine improvement. And I've seen many do little more than add to the pile of documents that can best be described as governance theatre. 
Let me go through a few. Maturity models are everywhere. You rate each capability 1-5 across dimensions, produce a spider chart, present it to the board. Executives love them because they're visual and they track year-on-year. Bonus points if you draw sparklines on the spreadsheet. The problem is that a "Level 3" in incident management tells you nothing about what's actually broken. You get a score without a diagnosis. It's like a doctor telling you your health is a 6 out of 10 and sending you home.
Clause-by-clause gap analysis against a standard is probably the most common method. Take ISO 42001, go through every clause, mark each as conformant, partially conformant, or non-conformant. This seems rigorous and it has its place. But it operates at the control level and only superficially at that, assessing whether a control is claimed to be in place or not, whether there are some sampled artifacts as 'evidence'. It doesn't really get at the work. You might mark incident management as "partially conformant" but that finding doesn't tell you whether the tooling is wrong, the ownership is missing, or the culture is resistant.
GRC platform-driven approaches give you a dashboard and a compliance tracker. You import the control framework, assign owners, click through a few wizards, collect evidence. Present the sparkline-infused dashboard like magic. The tool structures the work, and that has value. But it doesn't teach anyone how to diagnose whether a control is actually functioning. You end up tracking whether someone uploaded a document, or ticked a box, not whether a mechanism is working to meaningfully improve outcomes.
Policy-first approaches are very common, especially when governance is driven by functions with good intent but little understanding or engagement of how work is really done, or how products are built and operated. Map all the controls, write all the policies, then throw them over the fence for others to figure out implementation. You end up with beautifully written policies that describe mechanisms that don't exist. And may never exist.
And then there's the questionnaire. Send surveys asking people about their practices, collect the responses, aggregate the findings. This brings you straight back to Ohno's chalk circle. People report what should happen, or what they believe happens. Not what they'd see if they stood in the circle and watched. 
Then there are much more technical and complex approaches: failure modes analysis, five-whys, threat modelling, red-teaming, simulation, safety cases, and even system theoretic process analysis (if you want to bring the big guns). These have analytical power, they get at detail and some at least have deep roots in safety science and engineering. They're fun if you geek out on safety science. But they're specialist tools that require specialist skills, and most organisations adopting AI governance just don't have the expertise, time or the budget to deploy them broadly. They have their place, particularly for high-risk systems, and as a pracititioner you want a variety of tools in your toolbox, but they're not where most governance practitioners will start their diagnostic work.
My honest assessment is that most of these partially work (except questionnaires - they're useless). Gap analysis and audit approaches are rigorous but stop at "there's a gap." Maturity models give you a communication tool for leadership but not a diagnostic tool for practitioners. GRC platforms give you tracking but not insight. Questionnaires give you self-reported opinions, not observed reality. Sophisticated technical approaches are expensive and narrow. And some of these unfortunately devolve into theatre, ticking boxes to satisfy a certification body or a board paper, without anyone asking whether the governance is meaningful.
What I've found missing is a way to explore efficiently, but systematically the layer beneath the control. Not just "is this control implemented?" but "what's the implementation behind it, can it adapt to change, which of its components are functioning, which aren't, and what type of intervention does each gap need?"
That's what I want to walk through now. It's my method and it works for me. I was inspired to write it down after a conversation with some Australian risk professionals yesterday.

How thinking in adaptive mechanisms can help diagnose and fix your broken governance

I've written before about mechanisms, the closed-loop systems that make adaptive governance real and sustained. A mechanism senses something, triggers a decision, produces an action, generates an outcome, and feeds back so it can improve. Policies describe what you want to happen. Mechanisms make it happen. If you're not familiar with that distinction, it might be worth reading the earlier article before this one.
Most of the people who come through the Foundation Track are either career changers moving into AI governance from adjacent fields like privacy, audit, compliance, engineering, product or law, or they are practitioners who already hold the AIGP and want to test their knowledge against real applied work. The cohort assumes no specific background. You do not need legal expertise, technical expertise, or prior governance experience. You build that capability as you go.
That article focused on the design of mechanisms. But before you can design mechanisms, you need to assess what already exists. And that assessment is a craft in itself. It's diagnostic work, and it takes time to develop a method that works for you. That method (or at least my method) is what I want to share in this article, because I think it's one of the skills that can help governance practitioners produce real value and avoid the trap of producing impressive-looking frameworks that don't create value or sustain.
Let me use incident management as a running example, because it's a domain every organisation has some version of, and because the gap between what organisations think they have and what they actually have is usually wide enough to demonstrate the task.
So imagine you're in a mid-sized company in an industry like education or healthcare, and they've been building and deploying some software systems for their own use and to sell. They've been operating for a few years. They have an IT incident process that handles production outages and security events. It works reasonably well for those purposes. They hold ISO 27001 certification, maybe SOC2, so the process was built to satisfy information security requirements. And they need those because, operating in health or education, they have sensitive personal data and compliance requirements to meet.
And now they're building AI into those systems, consuming AI from external service providers too. Most employees are using some AI tools in their daily work whether the company knows about it or not. And AI governance lands on the agenda. The EU AI Act requires serious incident reporting for high risk systems. ISO 42001 requires nonconformity and corrective action processes. NIST AI RMF expects incident response and documentation. So you start with compliance mapping or start with risk identification, whatever way (and that's a topic for another article), you recognise that they need controls in say the incident management domain. And those controls need operational mechanisms behind them.
The temptation at this point is to start designing. Build the AI incident management framework, the incident handling standards, nominate incident owners and escalation processes, define regulatory notification steps. But that of course skips a step, and it's the step where I think most of the real work and value lives.
That step is the diagnostic. Before you design anything, you stand in the circle and find out what's actually there. The method I use asks a few questions in sequence. First: does a mechanism exist that's scoped to this control? Second: if it does, is it actually working? Third, if it's not working or now working well, what specific issues exist? And finally, what are the right interventions to fix the mechanism. Let me walk through these four steps for you:

Step 1: Does a mechanism exist?

The first diagnostic question sounds simple. Is there a mechanism scoped to implement this control? But the answer is rarely a clean yes or no. In practice, you'll find one of four situations.
  • Nothing exists. The control is on paper, but there's no operational reality behind it. Nobody has built anything for real. This is a straightforward gap and the easiest to diagnose, though not necessarily the easiest to fix.
  • Informal practice. This is by far the most common finding, and it's the one that matters most to understand. People are doing things. Engineers respond to production issues. Someone handles client complaints about AI outputs. There are habits, Slack channels, people who get called. But none of it was designed as a governance mechanism. Instead, it just grew from operational necessity. It overlaps with what the control requires, but usually only partially and is often quite rigid.
  • A mechanism exists but its scope doesn't match. The organisation has intentionally built something, but it was built for different purposes. That IT incident process designed for ISO 27001 handles system outages and security breaches. It was never extended to cover AI-specific failure modes like bias drift, safety degradation, or adversarial misuse. So sure, the mechanism is real, but it just doesn't cover what the AI governance control requires.
  • A mechanism exists and it's properly scoped. Celebrate, go to step 2! Sadly, at least in my experience, this is the least common finding, especially when organisations first start governing AI seriously.
Back to our example company. What do you actually find when you look at their incident management? In this domain, probably some form of informal practice with partial coverage. Their engineers handle production outages competently. When something breaks, people mobilise, diagnose, fix, and move on. That works, within its scope.
But then you have to look at what it doesn't cover. Someone reports that the AI-powered university admissions consistently recommends less challenging courses to students from certain demographics. A clinician notices that a patient triage tool seems to deprioritise referrals from a particular postcode. These are likely AI incidents. They may be serious incidents under the EU AI Act's definition. But they don't look like outages, they don't trigger monitoring alerts. They arrive as informal complaints and somebody just handles them informally, or not at all. You only see this by pausing and carefully watching. Standing in the chalk circle.
The scope mismatch is what's important. It's not that governance is absent. It's that what exists was never designed for the governance territory the control addresses. And this distinction matters enormously for what you do next, because extending something that works is very different from building from scratch.

Step 2: Is the mechanism working?

Once you've confirmed something exists and has at least partial scope, the second question is whether it's functioning effectively. This is where I go back to the seven mechanism components and they become useful as a diagnostic. Rememer, a mechanism is a closed-loop system designed to achieve a specific governance objective, with inputs, outputs, tooling, ownership, adoption, inspection and continuous improvement. It needs all 7 of those components to function.
So I walk through thos seven and for each one, I'm asking myself two things: is this component designed appropriately for what the control requires, and is it operating as designed? Let me walk through all seven for our incident management example to give you an idea of how this works.
  • Inputs. These are about what triggers the mechanism and what information feeds it. There are likely existing channels that work for what they were built for. Monitoring detects system outages. Client reports come through support tickets. Engineers escalate production issues. But none of these channels were designed to detect AI-specific failure modes. Bias drift doesn't generate a monitoring alert. A safety safeguard degrading over months doesn't trigger an escalation. A data quality issue corrupting scores without producing errors looks like normal operation from the inside. The inputs that exist are ok. They just don't cover the territory the control addresses. That's a design issue.
  • Outputs. What the mechanism produces and who consumes it. So the informal process I described produces fixes. Something breaks, someone fixes it, everyone moves on. But you have to go back and look at what the governance controls require. Structured incident records with classification data. Root cause analysis docs. Notifications that contain the information Article 73 of the EU AI Act specifies for serious incidents. Corrective action plans that feed into improvement cycles. The informal process most likely produces none of this. And think about the knock on effects, like how the outputs of one mechanism become the inputs to other mechanisms downstream. If incident management doesn't produce structured corrective action plans, whatever mechanism is supposed to track corrective actions has nothing to consume. Poorly designed outputs in one mechanism starve every mechanism downstream. Design issue again.
  • Tooling. So let's say the company has a general IT ticketing system. It handles what it was built for. But look at what the AI governance controls need. Incident classification that distinguishes a bias-related incident from a service degradation. Templates that capture the fields a regulator requires. Workflow routing that sends a safety incident to a different response team than a performance glitch. The ticketing system isn't configured for this - it probably could be, it just hasn't been designed for. Design issue. Now you have to play out why this matters beyond the tooling component itself. You see the strongest form of adoption is architectural, where using the mechanism is easier than not using it. And if the tooling required AI-specific classification before a ticket could progress, incidents would be classified correctly by default. But the tooling doesn't support that, which means architectural adoption for AI incidents is impossible. A design gap in tooling creates an adoption constraint. It's a design issue with knock on effects to other components in the mechanism - so this is most likely an issue to fix first.
  • Ownership. Alright, let's assume an engineering lead handles incidents when they occur, and they're good. But nobody designed incident management ownership into the governance structure, so it's not in their objectives, they've no time allocated for it. They just use their personal relationships to pull together cross-functional teams for post-incident review. So there's an owner, but they're not fully equipped to properly own the mechanism. My ownership test is pretty simple: I think genuine ownership requires accountability (as in your performance is assessed on whether the mechanism works), authority (you can direct the resources and people it requires), and capacity (you have the time and bandwidth). Remove any one of these and ownership is nominal, not real. You guessed it, design gap.
  • Adoption. This is where something interesting happens. So for production outages, the informal process is adopted. The system is down, clients are affected, people mobilise. That works. But that doesn't work for AI incidents for the reasons I discussed under tooling. This selective adoption is one of the hardest patterns to diagnose because it looks like the mechanism is working. If you count tickets, the numbers look cool. The gap is in what never becomes a ticket. And the gap is both a design problem (the tooling doesn't support it, the inputs don't detect it) and an adaptive problem (nobody thinks of these observations as "incidents" in the first place). Think about it this way: if the tooling was fully capable, but people still wouldn't use it because they don't think a drifting AI recommendation is an incident, then that's not just a design issue. Both an operational and a design issue here.
  • Inspection. Ok, so clearly here we have a problem - without tooling and adoption, nobody can tell you how many AI-related incidents were detected. Nobody tracks whether detected anomalies actually became incident tickets. Nobody measures whether triage classifications were accurate, whether the same root causes are recurring, how long it takes from detection to resolution, or whether regulatory notification deadlines were met. There are no leading indicators telling you whether the mechanism is running. There are no lagging indicators telling you whether it succeeded. And without indicators, there are no targets to measure against. Design issue.
  • Continuous improvement. And this is where everything connects. The improvement cycle depends on inspection data. Indicators deviate from targets. Those deviations signal that something needs to change. The change could target any component: update the inputs, redesign the outputs, extend the tooling, strengthen adoption, tighten inspection itself. The mechanism owner does that inspection and drives the improvement continuously. They're not waiting for some line in a audit report once a year. They're continuously monitoring, reporting and adjusting or fixing. That's how a mechanism adapts over time. But without inspection, the improvement cycle has nothing to work with. No indicators means no deviations. No deviations means no signal. No signal means no improvement. The same categories of incident recur, and each time they're treated as new. The mechanism also can't absorb regulatory change. When a delegated act clarifies Article 73 reporting requirements, a mechanism with a functioning improvement cycle absorbs that as routine input. A mechanism without one requires a special project every time. Given the pace of regulatory development right now, that doesn't scale. Design gap.
Now when I'm finished mentally working through all those seven components, I step back from the component-by-component findings and look at the profile as a whole. In this case, almost every finding is a design gap. Some components have operational presence but are designed for the wrong territory. Others were never designed at all. The people may be competent, sometimes through heroics it works. But the channels that exist function within their designed scope and it's just not designed for the governance territory.
I can't stand vague "needs improvement" findings, even if they're specific they're often cherry-picked or almost random about what is found and usually just say what the control gap is without any clear view on what specific change is needed. But in this method, at least you now have a list of specific component-level issues, that are hopefully actionable. In our example, it might tell you that inputs need extending to detect AI-specific failure modes, that outputs need restructuring for multiple consumers, that tooling needs AI-specific classification, that ownership needs formalisation with authority, that adoption has both a design problem and an operational training problem, that inspection needs to be built from the ground up, and that continuous improvement has no signal to work with until inspection exists. Each of those points to a different kind of work. Which brings me to step #3 - what kind of issue is it.

Step #3 What kind of issue are we seeing?

And that brings us to what I think is maybe the most important diagnostic skill of all: correctly classifying the type of issue we see, which in turn will drive the right kind of intervention each issue requires. I see this go wrong all the time. 
Whether it's a design issue or an operational issue, you'll see this happen. Some issues are resolved through technical intervention. You need tooling that doesn't exist yet. You need monitoring coverage that hasn't been configured. You need a template that captures the fields a regulator requires. You need workflow routing. These are well-defined problems with known or knowable solutions. You need to specify the issue, specify the solution, then just execute.
Other gaps require adaptive intervention. They need people to change how they think about the work. Adoption is selective because people don't see AI-specific incidents as "real" incidents. That's a mental model problem. Ownership is informal because nobody wants accountability for a process that crosses functional boundaries. That's an organisational politics problem. Post-incident reviews don't happen because the culture treats reporting as blame rather than learning. That's a values problem. No amount of tooling or process design fixes any of these issues.
Applying a technical fix to an adaptive issue is the single most common mistake in compliance improvement (at least in my distinctly biased, personal experience). You buy the platform. Nobody uses it. The dashboards go unread. Six months later someone asks why incident management hasn't improved, and the answer is that the problem was never the tool. The problem was that nobody believed AI-specific incidents warranted the same response as system outages. There's no procurement decision or tool implementation that solves that. 
Then the reverse is equally expensive. Throwing training at a tooling problem wastes months and builds frustration. You can run workshops on incident classification all year, but if the ticketing system has no fields for AI-specific types, or a cumbersome multi-step workflow in a different system, the people you've trained have a disjointed time applying what they've learned.
Most mechanisms have both kinds of gap. The diagnostic skill is matching the right intervention to each one.

Step #4 Plan and do your interventions

So at this stage, you have a big long list of specific issues. They form a component-level map, each classified by type, each pointing to a specific kind of intervention. The mechanism owner (if they exist) owns all those issues. The natural instinct at this point is to design some form of transformation program. A big piece of work with a project plan, a steering committee, a timeline measured in quarters, and a target state that represents "done." Looks pretty, consultants love it, and ... it's mostly pointless (IMHO :) ). 
Over and over again, I've seen far more governance improvement come from a Kaizen-style approach: lots of small, sequenced changes rather than one large program. 
Kaizen
The philosophy of making small incremental changes to enhance efficiency, quality, and overall effectiveness over time in both personal and professional aspects of life.
This is Ohno's legacy again. The Toyota Production System advocates against a single transformation program. It works on the idea of thousands of small improvements made by the people doing the work, each one informed by what they observed in the chalk circle. Governance improvement works the same way. Add AI-specific classification fields to the ticketing system this sprint. Update the triage criteria next week. Run a single tabletop exercise with the response team the week after. Formalise the ownership role in next month's objectives cycle.
Each of these changes is small enough to implement without a steering committee. Each is visible enough to build momentum. And each is owned by the person who actually runs the mechanism, not by a central governance team designing from above.
I think this really matters for two reasons. First, it's faster. A transformation program takes months to scope, approve, and staff before any actual governance improvement happens. Small changes start producing results immediately. You learn what works and what doesn't in days and weeks rather than quarters.
Second, and more importantly, it puts improvement in the hands of the mechanism owners. The engineering lead who handles incidents, the data science team lead, the operations manager. These are the people who understand the operational reality. They know where the friction is. They know which changes will be adopted and which will be ignored. When they own the improvement, the changes stick. When a central team designs improvements and hands them over, the adoption issue we just diagnosed will show up all over again in a different form.
Don't get me wrong. This doesn't mean there's no coordination. Someone needs to hold the diagnostic map, track which gaps are being addressed, and make sure the sequencing makes sense. Someone needs to manage budget and Technical fixes that enable architectural adoption should usually come early, because they reduce the adaptive burden. Adaptive work starts in parallel, because cultural shifts and mental model changes take longer than platform configurations. Inspection gets established as soon as possible, even in rudimentary form, because it provides the data that tells you whether your interventions are actually working.
But the unit of improvement is really small. As small as possible and the cadence is continuous. And the people closest to the work drive the changes. That's how governance improvement scales in practice.

Wrapping Up

So that's my method, hope it helps to share it. Genuinely, I think the real work of AI governance starts in the chalk circle. You stand in it. You observe what's actually happening, not what the policy says should happen, or the textbook said would happen. Then diagnose: what exists, how well it's scoped, where the components are strong and where they fall short, whether each issue needs a technical fix or adaptive work. And then improve, in really small continuous increments owned by the people who run the mechanisms.
It's not always glamorous work. There's no moment where you unveil the elegant framework or the pretty dashboard with traffic lights and sparklines to rapturous applause. But it's the work that produces governance that works, and that keeps functioning as the organisation, the technology, and the regulatory landscape just keep on changing around it.
This kind of diagnostic thinking is central to what I teach in the AI Governance Practitioner Program, and especially in the upcoming Compliance and Risk speciality courses. That's not because I think everyone needs my specific approach. Practitioners create their own craft. But I think the practice of assessing and improving governance mechanisms is the skill that lets practitioners enjoy making a real difference rather than succumb to the frustration of governance theatre. None of us enjoy that particular show.
I'm always keen to learn from other people's methods and ideas. If you have thoughts or your own approaches that work, please share them. Makes us all better practitioners.
You can always find out more about our AI Governance Practitioner Program here, and always welcome to join if you'd like to learn these practices in depth.