Teaching AI Agents to Ask Better Questions by Playing "Battleship"

By Alex Shipps · Source: MIT News · Posted: June 13, 2026

MIT researchers use the classic game as a test bed for AI agents, finding a small model can outperform the biggest ones at 1 percent of the cost.

At Data Tribes, we encountered this compelling piece from MIT News and felt it deserved a closer look, especially for anyone working in AI research, data science, or building intelligent agents.

The Problem With How AI Asks Questions

AI agents are everywhere in 2026, handling tasks in customer service, software development, and beyond. But when it comes to high-stakes fields like medical diagnosis or scientific discovery  where an agent must navigate vast uncertainty and ask the right questions today's language models fall short.

Researchers at MIT's CSAIL and Harvard's SEAS decided to investigate why, using an unlikely test bed: the classic board game Battleship.

How the Game Works

The team reframed Battleship as a natural language exercise. One AI plays the "captain," asking questions about where hidden ships are located. Another plays the "spotter," answering in real time. They first had over 40 humans play the game to build a benchmark dataset called BattleshipQA, then tested leading models like GPT-5 and smaller ones like Llama 4 Scout against it.

The finding: top models can beat humans, but smaller models struggle badly not because they can't reason, but because they don't know how to ask useful questions.

Three Things the Research Got Right

The Monte Carlo Advantage: By giving models a Monte Carlo inference strategy essentially a method for weighing which possibilities are most likely after each answer even small models became dramatically better questioners. Llama 4 Scout went from beating humans just 8 percent of the time to 82 percent, while running at roughly 1 percent of the cost of frontier models like GPT-5.

Turning Questions Into Code: Smaller models also struggled to answer questions accurately. The fix was elegant: convert each natural language question into Python code that tells the model exactly how to verify its answer. On average, models saw a 15 percent accuracy boost. GPT-4o-mini improved by nearly 30 percent, and even Claude 4 Opus gained around eight points.

Beyond Battleship: The approach transferred. When tested on Guess Who?, Llama 4 Scout jumped from a 30 percent success rate to over 72 percent, and GPT-4o climbed from 62 percent to 90 percent.

Bottom Line

AI agents still lag behind human experts at complex question-answering and expert Battleship players remain undefeated against all models tested. But the gap is closing fast, and the implications go well beyond board games. Better question-asking could unlock AI's potential in scientific discovery, drug development, and any domain where finding the right answer depends first on asking the right question.

The firms and researchers that treat information-seeking as a core AI capability, not an afterthought, will be the ones best positioned for what comes next.

 

Share this article:
Teaching AI Agents to Ask Better Questions by Playing "Battleship"