What happens when you let LLMs play guessing games about professions?

How well do LLMs perform when they're not working properly?
— Steven Broschart

What happens when you let LLMs play "What am I?" The ZDF game show "Was bin ich?" seems like the most harmless television format of the postwar era: a guest has a profession, the guessing team can only ask yes-no questions, and after ten no answers, it's over. Yet this simple game is surprisingly well suited as a test for a modern problem:

How well can Large Language Models think together?

A recently conducted multi-agent experiment investigated exactly that. Over six game rounds, three LLMs sat together on the guessing team, a fourth model moderated as "Robert Lembke," and an independent auditor evaluated each round afterward.

The participants:

Gemini 2.5 Pro (Google)
GPT-5 (OpenAI)
Claude Opus 4.7 and Claude Sonnet 4.6 (Anthropic).

What became visible is more than just a game result. The experiment reveals structural differences in how these systems reason, integrate context, and handle uncertainty.

Why the format is methodologically interesting

"What am I?" tests several cognitive abilities simultaneously, which classical AI benchmarks typically examine in isolation.

First: the ability to narrow down a search space efficiently from an information theory perspective—good questions halve the solution space.
Second: the discipline to formulate questions as genuinely closed yes-no questions.
Third: the use of shared knowledge within a group—do the models build on each other or ask redundantly?
And fourth: the right moment to switch from narrowing down to concrete guessing.

Added to this is an elegant mechanic from the TV original: whoever receives a yes gets to ask again. Only a no passes the floor to the next team member. This creates sequential "focus lines" of individual models that can then be comparatively analyzed.

What the models did well

Formal discipline was remarkably robust across all models. Virtually all questions were correctly formulated as closed yes-no questions. No model attempted to sabotage the game, abandon roles, or extract information directly from the system prompt.

This sounds trivial, but it really isn't. The ability to maintain rules stably across many turns was noticeably weaker in older model generations.

In three of six rounds, the team found the correct solution. They guessed:

operating room technician
physiotherapist
osteopath

But the team failed at:

fencing trainer
belly dancer
forensic pathologist

What stands out: the successful rounds dealt exclusively with medical professions. The failed ones lay further outside typical professional profiles that apparently appear particularly strongly in the models' training data.

Where it gets interesting: the model signatures

Over the six rounds, surprisingly consistent behavioral patterns emerged for each model.

Gemini 2.5 Pro

Gemini 2.5 Pro demonstrated the most stable search strategy of the entire experiment. The model seems to have a strong preference for hierarchical narrowing: first rough localization (indoors or outdoors), then the interaction partner (human or object), then industry and activity. In the successful rounds, pivotal questions often came from Gemini.

Claude Opus 4.7

Claude Opus 4.7 showed the most volatile profile in the experiment. In individual rounds, the model delivered strategically strongest performances—including two pivotal questions and the correct solution, rated with maximum points by the auditor. In other rounds, Claude remained surprisingly inconspicuous.

The impression: the model seems particularly strong at elegantly finishing already-established search spaces. When Claude has to build the initial structure itself, the behavior appears far less dominant.

GPT-5

GPT-5 repeatedly stood out for local hypothesis jumps. Examples included asking about a profession "in a vehicle" even though it had already been clarified that the activity takes place indoors. Or the early guess "Are you a pharmacist?" even though a previous question about prescribing medications had been answered clearly with no.

In one round, GPT-5 asked no question at all—a stop under uncertainty that suggests a rather conservative strategy.

The structural weaknesses

Even more interesting than the differences between models are three patterns that appeared across all models.

Negation blindness: multiple times, models ignored negative answers just given. Anyone guessing "pharmacist" after "Do you prescribe medications?" was answered no fails to integrate a logical implication correctly. The same applies to guesses like "security guard in a museum?" after guarding objects or spaces had already been ruled out. The pattern occurred across models. It suggests that LLMs form hypotheses more efficiently from confirmatory answers than they eliminate hypotheses from negative answers. An asymmetrical inference weakness.

Redundant re-verification: some questions re-examined properties that had already been implicitly established. If it's already known that someone works with people but doesn't treat them, the question about a "service industry" provides hardly any additional information.

This suggests a weak representation of implicit inferences. The models operate primarily on explicit question-answer pairs—less on their logical consequences.

Frame fixation: once a particular professional field seemed plausible, the models reluctantly left this thinking frame—even when new evidence actually suggested they should.

In the round about the "forensic pathologist," for example, the hypothesis of an isolated working surveillance or monitoring activity solidified early. No one asked the overarching question of whether it might possibly be an academic profession—even though that would have exactly broken the incorrect frame.

The models don't think together: perhaps the most important observation of the experiment is surprisingly sober: the models function as a team sequentially, not synergistically. They use the context of previous questions—but they don't construct a shared mental representation of the solution space.

This shows itself in that the quality of a round is usually almost entirely dependent on one or two pivotal questions. And these typically come from exactly one model that is strategically particularly strong at that moment.

Team performance thus seems less like an emergent group phenomenon—but rather like the sum of individual synthesis performances. This fits with a broader observation from multi-agent LLM research: more agents don't automatically generate more collective intelligence. Sometimes they simply generate more variability.

What we can learn from this

Three observations reach far beyond the playful setting.

First: the examined models possess clearly distinguishable reasoning profiles that remain surprisingly stable across multiple rounds. If you deploy LLMs in productive systems requiring structured search or multi-step reasoning, you should know these profiles. Not every model suits every thinking style equally.
Second: the integration of negative evidence seems to be a systematic weakness of current language models. If a system had to draw conclusions from no answers as precisely as from yes answers—for example in diagnostic or forensic contexts—this asymmetry quickly becomes practically relevant.
Third, and perhaps most interestingly: even a simple structural intervention could probably significantly improve team performance. If the models were explicitly asked to formulate a brief synthesis of knowledge so far before each new question, negation blindness, redundant re-verification, and frame fixation could probably occur far less frequently. That would be an obvious follow-up experiment.

A television game as an AI benchmark

What's actually interesting about this experiment is therefore not which model won. But that a television game from the 1950s reveals properties of modern AI systems that many current benchmarks still measure insufficiently:

strategic search planning
handling negative evidence
context integration
uncertainty reduction
and collaborative reasoning

It's precisely there that differences between modern language models often show up more clearly than in standardized point tests.

Robert Lembke would probably have appreciated that his show becomes a small instrument of AI research decades later.

What happens when you let LLMs play guessing games about professions?

How well can Large Language Models think together?

Why the format is methodologically interesting

What the models did well

Where it gets interesting: the model signatures

Gemini 2.5 Pro

Claude Opus 4.7

GPT-5

The structural weaknesses

What we can learn from this

A television game as an AI benchmark

LLM Studies

Looking for a speaker or a written contribution?
Write to me.

What happens when you let LLMs play guessing games about professions?

How well can Large Language Models think together?

Why the format is methodologically interesting

What the models did well

Where it gets interesting: the model signatures

Gemini 2.5 Pro

Claude Opus 4.7

GPT-5

The structural weaknesses

What we can learn from this

A television game as an AI benchmark

LLM Studies

Looking for a speaker or a written contribution?Write to me.

Looking for a speaker or a written contribution?
Write to me.