EVALUATING EVALUATOR

Under what conditions does delegating thinking to an LLM genuinely extend and improve human judgment, and when does it replace it with AI slop?

This month we tried to answer a question that has become trickier as the tools have gotten better. The core takeaway was that the bottleneck on producing good work has shifted from production to evaluation.

The good beneath the surface: LLMs excel at adjacent thinking, sense‑making, and structured expansion of what you already know, but they are poor investigators in domains where you lack base knowledge.

The diagnostic lives in the process, and the only part of the system that can carry the consequence of the work is the evaluator.

EXPLORE THE PARTS IN DEPTH

Takeaways: Lessons learned during the session

From Theory:

Establish base knowledge before reaching for the tool: Use the tool to organize what you already know, not to decide on what you don't.
Read the distribution, not the response: Convergence across runs suggests stable ground. Divergence suggests the model is entering a domain where it has no settled position.
Look for contradiction, not confirmation: The places where outputs diverge are the places that need your judgment most.
The evaluator carries the consequence: Design the review process around what the artefact cannot show.

From Exchange:

Open with personal examples before structural arguments: The order matters as much as the content.
Choose a question that requires the group to find the answer together: If the answer is retrievable from individual reflection, the session will be confirmation rather than discovery.

Evaluating Evaluator

We opened by looking at the calculators entering classrooms in the 1970s, when mathematical reasoning was expected to atrophy and estimation was supposed to disappear. The fear was not fully justified. The research at the time showed that students who used calculators without first developing number sense lost the capacity to notice when an answer was wrong, while students who developed number sense first became more capable. The variable was whether the user had the internal model to evaluate the result. LLMs reproduce this shape, scaled up into qualitative domains where the equivalent of number sense is much harder to detect when it is missing.

The Washing Machine

Someone in the room offered a recent example. He had been buying a washing machine and used an LLM to build a comparison table, structured around the variables he could think of: capacity, energy class, spin speed, price. The model produced realistic-looking numbers, and many of them were wrong. Yet the same conversation surfaced something he had not been thinking about at all: how repairable the different models were, which vendors had local specialists, and what the long-tail cost of ownership looked like. The model was unreliable on the figures he asked it to verify, and genuinely useful on the dimensions he had not thought to consider. After exploring this sample even further, he also admitted that the LLM was being used as confirmation infrastructure for a decision already made. The genuine value was the adjacent thinking it triggered, not the comparison it was nominally performing.

This pattern recurred across the other examples. Participants described using AI for sense-making: checking their intuition against perspectives they had not yet considered, constructing comprehensible versions of half-formed ideas, and as a planning aid. The positive samples we dived into all shared that the LLM is good for organising what you already know, and poor at investigating what you don't. Greenfield use, stepping into a domain without prior knowledge and asking the model to navigate it, is where the failure mode lives. Confirmation, expansion, sense-making, devil's advocacy against your own thinking, rapid exploration of adjacent possibilities — all of these are where the tool extends judgment. The model is forced to play in the space the practitioner has already mapped, and the surprise comes from the dimensions the practitioner had not thought to map yet. The washing machine showed both modes at once.

Reading the Distribution

One of the explanations is divergence. Ask a language model the same question twice, and you can get two different answers. Not differently worded versions of one conclusion, but, on harder questions, different conclusions. This is not a quirk that better prompting fixes. Even at that setting, the same prompt produces different conclusions across runs, because the underlying systems batch and compute in ways no user setting controls (Cui and Alexander, 2026). One study generated eighty distinct outputs from a thousand identical queries.

It is also the most revealing thing the model does because the variation exposes uncertainty that the prose itself conceals. Convergence across runs suggests stable ground. Divergence suggests the question is touching something the underlying distribution has not integrated. The cross-check most workflows currently teach — verify the citation, confirm the fact, ask another model looks for confirmation. The useful cross-check looks for contradiction, and the moment of decision is the only place it can usefully be inserted, rather than after the conclusion has already been reached.

Language as the Surface that is Being Measured

A 2023 JAMA study compared physician and ChatGPT responses to real patient questions on a public medical forum. Licensed clinicians preferred the chatbot's responses in 79% of cases and rated them as more empathetic. The finding has been used to argue that AI is already outperforming humans in domains that supposedly required human judgment. More recent controlled work, comparing LLMs and human therapists across full cognitive-behavioral therapy sessions rather than isolated responses, inverts the finding. Human therapists outperform the LLM on agenda-setting, eliciting feedback, applying technique, and building therapeutic alliance. The chatbot's single-turn empathy advantage, the same studies note, comes from the linguistic features of the responses. They are longer, and they convey more positive sentiment.

What is being measured in the single-turn study is not diagnostic accuracy. It is the formulation of the answer, the linguistic surface on which the answer is delivered, the warmth and length of the response. The LLM excels at this because that is what the model does. What it cannot do is sustain a therapeutic alliance, follow the patient's reasoning across sessions, or read the tacit signals that move through a conversation. The empathy advantage is the form looking right. The substance underneath is something else.

Several people in the room, working between Dutch and English with colleagues, noted that conversations feel different in the two languages. What can be said in one is not quite available in the other. The thinking LLM makes available is bounded by the medium it inhabits. The therapy finding is the same observation in a different domain. The LLM is good at the linguistic surface. The linguistic surface is what it is, and the fact that it can make someone assume it outperforms therapists is reason enough to read the good-looking slop one more time, even when the conclusion seems obvious, because individual use, over time, accumulates into a broader social consequence.

Where the Policy Response is Aimed

The institutional response to the past two years has been almost entirely procedural. Written AI policies, governance committees, training programs, prohibitions. The pattern is older than LLMs. The tobacco industry produced peer-reviewed research arguing that smoking was safe for forty years. The sugar industry spent ninety years funding studies that shifted public health attention away from refined carbohydrates and onto fat (Neurath, 2025). Dietary guidelines were shaped on corrupted foundations. While the argument about corrupted policies goes much deeper, if we treat this kind of research as our entry point into investigating a new domain, we can admit that misleading information was no LLM-created phenomenon. The LLM is the most extreme instance of the same pattern, a structurally consequenceless producer whose output enters decision-making processes that were not designed for producers without stakes. The failure is at the moment the artifact is evaluated, and the evaluator is the only part of the system that can carry the consequence.

The job market reflects this already. The skills being selected for are no longer the ones the previous generation accumulated. Pattern-matching to memorised syntax, the kind of capability LeetCode interviews used to test, is now substantially automated, and the consensus that the test was always a poor proxy for actual engineering has caught up with the reality on the ground. What is being selected for instead is calibrated evaluation, the capacity to read whether output is good without being the one who produced it, and to know when to push back against fluent confidence. That capability does not yet have a clean credential or a clean training pipeline, which is part of why hiring at the moment is difficult and why the new generation's gaps are also their advantages. They have less of the old expertise that is being automated, and they have built earlier the meta-skill of working with the tool.

The Skill is Older than the Tool

None of these signals are new. Even the linguistics and its details (formatting, vocabulary, citation density, professional tone), none of these were exclusively distinguishing good from slop.

Nisbett and Wilson showed in 1977 that verbal reports about cognitive processes are often plausible reconstructions rather than direct introspection. Gary Klein's decades of work extended the finding. Firefighters asked how they decided to evacuate a building gave procedural accounts that did not match what the timing of the decision could possibly have allowed. They recognised a pattern, ran a mental simulation, and acted, but the verbal account was a post‑hoc reconstruction. Polanyi named the underlying condition earlier: we know more than we can tell, and we often tell more than we actually know about our own knowing. The LLM asked to explain its reasoning produces the same kind of plausible reconstruction, and reading past it to the shape of the work behind it is the same skill in both cases. The LLM moment is not introducing a new evaluation problem, it is generalising an old one.

Conversation about reasoning cannot happen with an LLM. The only way to know is to be the expert yourself or to verify independently.

Designing for What the Artefact Cannot Show

To use the LLM well, the producer must have base knowledge in the domain. The harder question is how anyone can tell whether the base knowledge is there. Base knowledge cannot be reliably detected from the artifact alone, only from the process: from the questions the producer asked before producing the artefact, from how they engaged with the model's output, from the follow-ups they pursued, from the places they slowed down.

Which has direct consequences for how review processes are designed. Review that consumes artifacts cannot detect slop reliably, because the artefact is exactly what the LLM is best at producing. The diagnostic skill lives in senior judgment that is scarce, unevenly distributed, and slow to transfer. Which means the answer is not better policy but slower process, specifically process that puts the conversation back into the workflow because the conversation is where the diagnostic lives.

One observation the group kept coming back to was that children should not be using LLMs before their own critical thinking has developed, because the calculator finding is visible at this stage. The tool extends the capacity that already exists and substitutes for the capacity that does not. The same observation, transposed, applies to anyone entering a new domain. The tool is genuinely useful to the practitioner who already has base knowledge and integrity, and structurally dangerous to the one who does not, because it produces artefacts indistinguishable from the work of the one who does. The number-sense problem returns at organizational scale. The detection signals are not in the artifact, they are in the process.

Exchange

Closer to the Room

Two things from this session are worth documenting, one encouraging and one worth considering to improve the next session. The encouraging part: opening with personal examples positioned the group closer to their own experience for the rest of the conversation, and the effect was noticeable. We kept returning to personal samples throughout, more than in any session before. Carl Rogers called this the difference between intellectual understanding and felt sense: theory that has been confirmed by personal recognition lands differently than theory that has merely been understood.

When the Answer Arrives Too Early

The harder observation is that I had trouble getting the conversation deep, and on reflection, I think the question itself did not require depth. The answer was clear from early in the session: the LLM is good for what you already know, poor at what you don't, and the diagnostic is whether you have the base knowledge to evaluate. Everyone acknowledged this within the first stretch of conversation, and once it was on the table, there was less for the group to discover together. The remaining time was spent confirming the answer from different angles rather than pushing into new territory.

This is a facilitation problem, not a group problem. Karl Weick's work on sensemaking distinguishes between situations that reward collective inquiry and situations where the answer is already structurally available. A good session question creates the first condition. This one created the second, possibly because the framing was discussed among the people who had been thinking about it individually before, and the room arrived already half‑convinced. The right adjustment for next time is to choose a question whose answer is not retrievable from individual reflection, only from collective work. The signs that worked this session were that personal examples opened the room. The sign that did not was that the room reached its conclusion before the conversation had genuinely tested it.

THELO-X

THEORY-PHILOSOPHY-EXCHANGE