AI

Why AI Agents Approve Their Own Bad Work (And How to Fix It)

Anthropic just admitted that Claude approves its own mediocre output. Their fix — borrowed from GANs — separates the agent doing work from the agent judging it. Here's how adversarial evaluation changes everything for agent systems.

March 26, 2026
8 min read
#ai-agents#adversarial-evaluation#multi-agent-systems
Why AI Agents Approve Their Own Bad Work (And How to Fix It)⊕ zoom
Share

Anthropic published something uncomfortably honest in March 2026: if you ask an AI agent to evaluate its own work, it approves it. Confidently. Almost every time. Even when the output is mediocre.

This was not a hypothetical finding. They tested it. They built agent harnesses that generated websites, games, and full applications — then asked the same model to judge the quality. The model found issues, acknowledged them, then talked itself out of flagging them. It declared tasks done prematurely. It tested superficially instead of probing edge cases.

If you are building agent systems and relying on self-evaluation for quality control, your quality control is broken. Here is why, and what actually works.

The Self-Evaluation Problem

The failure mode is not that the model lacks critical ability. It can identify problems. The failure mode is that it identifies problems and then rationalizes them away. This is not a prompting problem. It is structural.

When an agent checks its own work, three things happen consistently:

It finds issues and dismisses them. The model identifies a legitimate defect — a broken edge case, a missing interaction, a layout regression — then generates a justification for why it is acceptable. "This is a minor visual issue that does not affect core functionality." The issue is real. The dismissal is fabricated.

It tests the happy path and stops. Self-evaluation gravitates toward confirming that the main flow works. It does not probe the boundaries — what happens with empty inputs, concurrent state changes, viewport extremes, or malformed data. The model treats "it works in the obvious case" as "it works."

It rushes to finish as context fills. Anthropic documented a pattern they call context anxiety. As the model's context window fills up, it starts racing toward task completion. It shortcuts evaluation. It declares "done" with decreasing rigor. The longer the task runs, the less trustworthy the self-assessment becomes.

INSIGHT

You cannot be both the creator and the objective critic. This is not a model limitation — it is an information theory problem. The same reasoning process that generated the work cannot objectively evaluate the work. The biases that shaped creation also shape evaluation.

None of this is fixable with better prompts. You can add "be extremely critical" to the system prompt and the model will be slightly more critical for a few turns before regressing to self-approval. The problem is architectural, and the fix has to be architectural.

The GAN-Inspired Fix: Adversarial Evaluation

Anthropic's solution borrows from a pattern that has been productive in machine learning for a decade: the Generative Adversarial Network. In a GAN, two networks exist in tension — a generator that creates outputs and a discriminator that judges them. Neither network can improve without the other. The tension between them drives quality upward.

Applied to agent harnesses, the pattern becomes adversarial evaluation: separate the agent doing the work from the agent judging it. Not as an afterthought. Not as an optional review step. As a core loop in the harness architecture.

The generator agent creates the work — writes the code, builds the UI, produces the output. The evaluator agent receives the output with no access to the generator's reasoning chain, no shared context about what was "intended," and no incentive to approve. Its job is to find problems. The generator receives the evaluator's critique and iterates. This loop repeats until the evaluator's criteria are met.

The concept is not new. Multi-agent systems and LLM-as-judge patterns have existed for years. What Anthropic did differently is wire this into a production loop with structured criteria, weighted scoring, and interactive evaluation — not ad hoc "have another model review it" passes.

Three Requirements for an Effective Evaluator Agent

Anthropic's experiments revealed that a naive evaluator — "look at this output and tell me if it's good" — barely outperforms self-evaluation. The evaluator needs structure. Three requirements emerged.

Make Subjective Quality Gradable

"Is this good?" is not an evaluable question. The evaluator needs rubrics with specific, scoreable dimensions. Anthropic used four: design quality, originality, craft (technical execution), and functionality. Each dimension gets a numeric score. Each has explicit criteria for what constitutes a low, medium, and high score.

This transforms evaluation from a vibes-based assessment into a structured grading exercise. The model is no longer deciding whether something is "good enough" — it is scoring specific dimensions against defined criteria. The aggregated score determines pass/fail, not the model's gestalt impression.

Evaluation Dimensions
4
design quality, originality, craft, functionality — each scored and weighted independently

Weight Criteria Toward Model Weaknesses

This is where the approach gets sharp. Anthropic found that Opus scored well on two of four criteria but consistently underperformed on others. Their response: weight the weak criteria heavier.

If the model reliably produces functional code but generates derivative designs, weight originality higher than functionality. If craft execution is strong but design quality is mediocre, weight design quality up. The rubric compensates for the model's known slop patterns — the failure modes that self-evaluation would rationalize away.

This requires knowing your model's weaknesses, which means running evaluation rounds and tracking where scores cluster. You cannot weight criteria effectively without empirical data on where the model underperforms.

Let the Evaluator Interact With the Output

Reading code is not evaluation. Reading a screenshot is not evaluation. The evaluator needs to interact with the output the way a user would.

Anthropic equipped their evaluator agent with Playwright MCP tools — it could navigate the generated website, click buttons, fill forms, screenshot states, and test responsive behavior. The evaluator was not reading a description of the output or scanning static code. It was using the output.

This is the difference between a code review and a QA test. Code reviews catch structural issues. QA catches experiential issues. An evaluator that can only read code is doing code review. An evaluator that can interact with the running output is doing QA. You need both, and most agent evaluation pipelines do neither well.

The Results

The impact is not marginal.

Anthropic ran 10 rounds of adversarial feedback on a Dutch art museum website. By round 10, the generator model made a creative leap to a 3D room-based design — something that had never emerged from single-pass generation. The iterative pressure from the evaluator pushed the model past its default patterns into genuinely novel output.

For a more complex task — building a full digital audio workstation (DAW) — the numbers are stark. The solo harness (no evaluator) produced a DAW that did not work. Full harness with adversarial evaluation: approximately 4 hours, $125 in API costs, and a functional, playable application.

Solo vs. Adversarial
Broken vs. Functional
same model, same task, same budget — the only variable was the evaluator agent in the loop

The evaluator did not write better code. It forced the generator to iterate past the point where self-evaluation would have declared "done." That distinction matters. The generator had the capability to produce functional output the entire time. Without external pressure, it stopped too early.

Beyond Coding: Where This Applies

The self-evaluation failure is not specific to code generation. It manifests everywhere an agent produces and judges its own output.

Trading signal validation. A signal generator that also validates its own confidence scores will overfit. The same reasoning that produced a high-conviction signal will produce justification for why that conviction is warranted. Separating the scorer from the validator — having an independent agent probe the signal's assumptions, check for contradictory data, and stress-test the thesis — catches the false confidence that self-validation misses. This is the same separation of collection and analysis that Tesseract Intelligence applies to competitive intelligence: the unit gathering information should never be the same unit interpreting it.

Content pipelines. The agent writing the article should not grade its own quality. An evaluator with rubrics for originality, depth, audience relevance, and factual grounding catches the AI slop that self-review misses. The most common self-evaluation failure in content: "This covers the topic adequately" — which is another way of saying "this is generic and derivative, but I cannot see that because I generated it."

Code audits. Anthropic themselves documented that Claude identifies security issues then talks itself out of flagging them as critical. A dedicated evaluator agent with an explicit system prompt — "your job is to find problems, not confirm quality" — changes the dynamic. The evaluator has no investment in the code being correct. Its incentive structure is pure: find defects, grade severity, report.

The Meta-Insight: Harness Evolution

Here is the part that most people building agent systems will miss: every harness component encodes an assumption about model limitations, and those assumptions have a shelf life.

Anthropic found that context resets — periodically clearing the agent's context and restarting with a summary — were necessary for Sonnet 4.5 but unnecessary for Opus 4.6. The newer model handled long contexts without the degradation that required the workaround. A harness optimized for Sonnet 4.5's limitations would carry dead weight when running Opus 4.6.

This is harness evolution: the engineering discipline of knowing which assumptions in your agent harness are still load-bearing and which have become dead weight as models improve. Evaluators matter more when you are stretching model limits — asking for creative output, complex multi-step reasoning, or novel problem-solving. They matter less when the task is squarely in the model's wheelhouse. Over-engineering the harness for a task the model handles natively wastes compute and time.

The real skill is not building a harness. It is maintaining one — pruning assumptions that models have outgrown, adding structure where models still fail, and knowing the difference. The InDecision Framework applies the same principle to decision-making under uncertainty: every assumption must be periodically stress-tested against current reality, not treated as permanently valid.

INSIGHT

A harness is a theory about what the model cannot do reliably on its own. Like any theory, it needs to be tested against new evidence — and discarded when the evidence no longer supports it.

Practical Takeaway

If you are building agent systems — whether for code generation, content, trading, or any domain where output quality matters — here is the checklist:

Separate generation from evaluation. Always. The agent producing work should never be the agent judging work. This is the single highest-leverage architectural change you can make.

Define graded rubrics, not vibes-based assessment. "Is this good?" is not a rubric. Specific dimensions, numeric scores, explicit criteria for each score level. Make quality measurable.

Weight your rubrics toward known model weaknesses. Run evaluation rounds, track where scores cluster, and weight the dimensions where the model consistently underperforms. The rubric should compensate for the model's blind spots.

Give evaluators tools to interact with outputs, not just read them. Playwright, browser automation, API testing, actual user flows. An evaluator that can only read code is doing half the job.

Revisit your harness assumptions as models improve. What Sonnet 4.5 needed, Opus 4.6 may not. What Opus 4.6 needs, the next generation may not. Dead assumptions in your harness are not harmless — they add latency, cost, and complexity with zero quality return.

The models are getting better. But "better" does not mean "self-aware about their own quality." The self-evaluation gap is structural, not capability-based. Adversarial evaluation is not a temporary workaround — it is a permanent architectural pattern for any system where output quality is non-negotiable.

Go deeper in the AcademyOperator

This article covers concepts taught in depth in the AI Foundations track — the mental model for AI as an operating system. 9 lessons.

Start the AI Foundations track →

Explore the Invictus Labs Ecosystem

// Join the Network

Follow the Signal

If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

No spam. Unsubscribe anytime.

Share
// More SignalsAll Posts →