Why Google Gemini’s Pokémon success isn’t all it’s cracked up to be

Earlier this year, we took a look at how and why Anthropic's Claude large language model was struggling to beat Pokémon Red (a game, let's remember, designed for young children). But while Claude 3.7 is still struggling to make consistent progress at the game weeks later, a similar Twitch-streamed effort using Google's Gemini 2.5 model managed to finally complete Pokémon Blue this weekend across over 106,000 in-game actions, earning accolades from followers, including Google CEO Sundar Pichai.

Before you start using this achievement as a way to compare the relative performance of these two AI models—or even the advancement of LLM capabilities over time—there are some important caveats to keep in mind. As it happens, Gemini needed some fairly significant outside help on its path to eventual Pokémon victory.

Strap in to the agent harness

Gemini Plays Pokémon developer JoelZ (who's ...

Copyright of this story solely belongs to arstechnica.com . To see the full text click HERE

Strap in to the agent harness

Share: