Logo

EVALUATING EVALUATOR

Under what conditions does delegating thinking to an LLM genuinely extend and improve human judgment, and when does it replace it with AI slop?

     This month we tried to answer a question that has become trickier as the tools have gotten better. The core takeaway was that the bottleneck on producing good work has shifted from production to evaluation.

The good beneath the surface: LLMs excel at adjacent thinking, sense‑making, and structured expansion of what you already know, but they are poor investigators in domains where you lack base knowledge.

     The diagnostic lives in the process, and the only part of the system that can carry the consequence of the work is the evaluator.

icon EXPLORE THE PARTS IN DEPTH icon

Takeaways: Lessons learned during the session

From Theory:

  • Establish base knowledge before reaching for the tool: Use the tool to organize what you already know, not to decide on what you don't.
  • Read the distribution, not the response: Convergence across runs suggests stable ground. Divergence suggests the model is entering a domain where it has no settled position.
  • Look for contradiction, not confirmation: The places where outputs diverge are the places that need your judgment most.
  • The evaluator carries the consequence: Design the review process around what the artefact cannot show.

From Exchange:

  • Open with personal examples before structural arguments: The order matters as much as the content.
  • Choose a question that requires the group to find the answer together: If the answer is retrievable from individual reflection, the session will be confirmation rather than discovery.