xAI’s Grok 4 Performs Poorly On A Dynamic Strategic Challenge, Shows Improvements In Reasoning Abilities

This is not investment advice. The author has no position in any of the stocks mentioned. Wccftech.com has a disclosure and ethics policy.
xAI’s Grok 4 AI model is all the rage these days, helped along by the incessant hype created by Elon Musk himself. Yet, under the hood, the model appears specifically gamified to ace AI benchmark tests, and falls flat when it encounters a dynamic, strategic challenge.
xAI’s Grok 4 has already managed to embroil itself in a number of controversies, despite debuting on the market just a few days back.
For instance, Grok 4 attracted eye balls a few days back for perfectly parroting Elon Musk’s well-established, and at times controversial, opinions on immigration and the stimuli behind tensions across the various geopolitical hotspots around the globe.
This rampage, which was triggered by an update to Grok 4’s system prompts, culminated in the LLM referring to itself as “MechaHitler,” replete with a paean for Adolf Hitler himself.
This brings us to the crux of the matter. xAI’s Grok 4 recently took the fifth place on the multi-agent Step Race benchmark, which uses the New York Times’ Connections puzzles to evaluate the performance of various AI models, requiring each model to strategize and “think on its toes.” Even Gemini 2.5 Flash performed better than Grok 4!
When contrasted with Grok 4’s existing high scores across various standardized benchmarks, one is compelled to theorize that the model appears gamified to ace these benchmarks via a process called overfitting, where a model might “rote learn” it’s training data instead of capturing the more important patterns within the data set.
Of course, this does not mean that xAI’s Grok 4 isn’t a very useful model. After all, it’s reasoning abilities appear to have improved dramatically.
And, it exceeds almost every other model in identifying coding mistakes and bugs.
People are also using the LLM to generate the code for a game, and then porting that code over to Cursor.
However, the model is still not as capable as Elon Musk would have you believe. Look no further than the betting platform Kakshi, where as we noted in a previous post, Grok 4 is attracting only middling bets so far.
Meanwhile, the Financial Times recently reported that xAI, which also happens to be the parent entity of the X social media platform, is targeting a $200 billion valuation in an upcoming funding round. Bear in mind hat xAI raised $300 million via a secondary stock offering in June, and another $10 billion in early July.
This comes as SpaceX is also reportedly investing $2 billion in xAI from a $5 billion recent funding round.
Finally, it looks like Elon Musk is also preparing the ground for Tesla to take a stake in xAI, completing the circular financing “hot potato” game that has been ongoing for a while now between various Musk-linked entities.