Gaming

xAI’s Grok 4 Performs Poorly On A Dynamic Strategic Challenge, Shows Improvements In Reasoning Abilities

Posted by

On July 14, 2025

This is not investment advice. The author has no position in any of the stocks mentioned. Wccftech.com has a disclosure and ethics policy.

xAI’s Grok 4 AI model is all the rage these days, helped along by the incessant hype created by Elon Musk himself. Yet, under the hood, the model appears specifically gamified to ace AI benchmark tests, and falls flat when it encounters a dynamic, strategic challenge.

xAI’s Grok 4 has already managed to embroil itself in a number of controversies, despite debuting on the market just a few days back.

For instance, Grok 4 attracted eye balls a few days back for perfectly parroting Elon Musk’s well-established, and at times controversial, opinions on immigration and the stimuli behind tensions across the various geopolitical hotspots around the globe.

This rampage, which was triggered by an update to Grok 4’s system prompts, culminated in the LLM referring to itself as “MechaHitler,” replete with a paean for Adolf Hitler himself.

Grok 4 takes fifth place on the Multi-Agent Step Race Benchmark: Collaboration and Deception Under Pressure (TrueSkill score: 7.9). o3 remains in first place with 9.4. pic.twitter.com/mmGmWM23h1

— Lech Mazur (@LechMazur) July 12, 2025

This brings us to the crux of the matter. xAI’s Grok 4 recently took the fifth place on the multi-agent Step Race benchmark, which uses the New York Times’ Connections puzzles to evaluate the performance of various AI models, requiring each model to strategize and “think on its toes.” Even Gemini 2.5 Flash performed better than Grok 4!

When contrasted with Grok 4’s existing high scores across various standardized benchmarks, one is compelled to theorize that the model appears gamified to ace these benchmarks via a process called overfitting, where a model might “rote learn” it’s training data instead of capturing the more important patterns within the data set.

Of course, this does not mean that xAI’s Grok 4 isn’t a very useful model. After all, it’s reasoning abilities appear to have improved dramatically.

Grok 4 Heavy is better than any model available at identifying issues in your codebase. Here’s the JS prompt I use with my game code to have Grok 4 Heavy find the bugs.

Python prompt in Comments👇 pic.twitter.com/HFpW1hGvMM

— Tetsuo (@tetsuoai) July 13, 2025

And, it exceeds almost every other model in identifying coding mistakes and bugs.

I took Grok 4 for a spin this weekend to build this game prototype.

I used SuperGrok Chat to generate the initial game prototype and then brought it over to Cursor to continue coding with Grok 4 MAX.

Grok 4 in Cursor is like a no-nonsense agent. Doesn’t speak much, but… pic.twitter.com/wyib2vRvsd

— Danny Limanseta (@DannyLimanseta) July 13, 2025

People are also using the LLM to generate the code for a game, and then porting that code over to Cursor.

However, the model is still not as capable as Elon Musk would have you believe. Look no further than the betting platform Kakshi, where as we noted in a previous post, Grok 4 is attracting only middling bets so far.

Meanwhile, the Financial Times recently reported that xAI, which also happens to be the parent entity of the X social media platform, is targeting a $200 billion valuation in an upcoming funding round. Bear in mind hat xAI raised $300 million via a secondary stock offering in June, and another $10 billion in early July.

This comes as SpaceX is also reportedly investing $2 billion in xAI from a $5 billion recent funding round.

It’s not up to me. If it was up to me, Tesla would have invested in xAI long ago.

We will have a shareholder vote on the matter.

— Elon Musk (@elonmusk) July 13, 2025

Finally, it looks like Elon Musk is also preparing the ground for Tesla to take a stake in xAI, completing the circular financing “hot potato” game that has been ongoing for a while now between various Musk-linked entities.