Summarize

Musk's xAI may have fudged Grok 3's AI benchmark results

By Dwaipayan Roy

Feb 23, 2025

10:25 am

What's the story

Elon Musk's AI firm, xAI, has been accused by an OpenAI employee of releasing deceptive benchmark results for Grok 3.

The controversy started when xAI shared a graph on its blog, showing Grok 3's performance on AIME 2025. The test is a compilation of math problems from a recent mathematics exam.

The graph showed two versions of Grok 3, beating OpenAI's best model. However, the OpenAI employee pointed out that the graph missed a crucial performance metric for their model.

Benchmark controversy

xAI's graph under scrutiny

The missing data point was the AIME 2025 score at "cons@64" for o3-mini-high, a metric that gives a model multiple attempts to solve each problem in a benchmark.

Some experts even question the validity of AIME as an AI benchmark. However, it is often used to assess a model's mathematical capabilities.

Metric omission

Omission of 'cons@64' could distort comparison

The term "cons@64" refers to "consensus@64," a metric that allows an AI model 64 tries to solve each problem in a benchmark.

The most commonly generated responses are then considered the final ones.

This metric can greatly improve models' benchmark scores and leaving it out of a graph could easily mislead people to believe in one model's superiority over another.

Performance

Grok 3 models trail behind OpenAI's in certain metrics

When assessed at "@1" — the first score the models got on the benchmark — both Grok 3 Reasoning Beta and Grok 3 mini Reasoning performed worse than o3-mini-high.

Grok 3 Reasoning Beta also trailed OpenAI's o1 model at "medium" computing by a small margin.

Nevertheless, xAI still touts Grok 3 as the "world's smartest AI."

Defense stance

Defending company amid AI benchmark controversy

In response to the accusations, xAI's Igor Babushkin defended his company's actions.

He argued that OpenAI has previously released similarly misleading benchmark charts, albeit only comparing the performance of its own models.

He said this in an attempt to justify xAI's omission of certain data points in their graph showcasing Grok 3's performance against OpenAI's models.