<aside> 💡

I’m deeply fascinated by the idea of making AI play video games, not just for entertainment, but as a reflection of its ability to perform long-horizon planning and strategic reasoning. Recently, I’ve been using SoTA language models to see how well they can play games

</aside>

Modern LLMs are designed for heavy tool use and are becoming increasingly agentic. This raises a core question: do they possess the raw intelligence to navigate complex game environments using only raw pixel input, or do they still need external tools? The challenge with giving them tool access in games is generalisation: what works as a tool in Pokémon won’t necessarily work in DOOM.

My initial thought was since these models are already great at coding, maybe they can just build their own tools 🤔 But I quickly realized that’s not so simple. This brings us to the broader question: should we benchmark LLMs on pure screen input performance, or should we allow them to use tools and custom scaffolds?

  1. Handicapping through no tool use
  2. The role of tool use in real-world AI
  3. Generalization vs. specialization
  4. Difficulty gradient in game environments