<aside>
💡
I’m deeply fascinated by the idea of making AI play video games, not just for entertainment, but as a reflection of its ability to perform long-horizon planning and strategic reasoning. Recently, I’ve been using SoTA language models to see how well they can play games
</aside>
Modern LLMs are designed for heavy tool use and are becoming increasingly agentic. This raises a core question: do they possess the raw intelligence to navigate complex game environments using only raw pixel input, or do they still need external tools? The challenge with giving them tool access in games is generalisation: what works as a tool in Pokémon won’t necessarily work in DOOM.
My initial thought was since these models are already great at coding, maybe they can just build their own tools 🤔 But I quickly realized that’s not so simple. This brings us to the broader question: should we benchmark LLMs on pure screen input performance, or should we allow them to use tools and custom scaffolds?
- Handicapping through no tool use
- The public benchmark used to these game play ability is https://www.vgbench.com/, it supplies only raw screen frames plus trivial auxiliary inputs (hint text for Pokémon, a basic memory store), and avoids any custom agent modules in the prompt to isolate the model’s core reasoning
- By prohibiting internal tool creation or external API calls removes biases from handcrafted scaffolds but this severely limits what VLMs can accomplish, even in relatively simple games
- The role of tool use in real-world AI
- Tools such as code execution, state tracking, and planning utilities are central to how current models solve tasks; removing it may mask their true potential and this unreasonably penalize them
- Generalization vs. specialization
- Allowing richer scaffolding can boost performance on specific benchmarks but it’d risk overfitting to those settings, reducing a model’s ability to generalize across tasks and domains. This creates a separation between benchmark environment and how AI systems operate in practice . All that said, a “pure” evaluation (no scaffolding) yields insight into a model’s latent capabilities.
- Difficulty gradient in game environments
- Even very simple games (like elementary platformers) pose challenges for unscaffolded agents, illustrating that raw reasoning alone may be insufficient for meaningful progress in interactive tasks