The new “o3” language model developed by OpenAI has achieved an impressive IQ score of 136 on a public Mensa Norway intelligence test, exceeding the threshold for entry into the country’s Mensa chapter for the first time. This score places the model above approximately 98 percent of the human population based on a standardized bell-curve IQ distribution. The data was disclosed by independent platform TrackingAI.org, highlighting the trend of closed-source proprietary models outperforming open-source counterparts in cognitive evaluations.

The “o3” model is part of the “o-series” of large language models that have consistently ranked highly in various test types evaluated by TrackingAI. The model excelled in both a proprietary “Offline Test” and the public Mensa Norway test, scoring significantly higher on the latter. The Offline Test included pattern recognition questions specifically designed to avoid any data used in training AI models, showcasing the model’s ability to perform well in different evaluation formats.

However, there are limitations in the transparency of the testing methodology used by TrackingAI. While the standardized prompt format aims to ensure broad AI compliance, crucial details such as prompting strategies and scoring scale conversions are not fully disclosed. This lack of transparency hinders reproducibility and interpretability of the results, raising questions about the validity of the assessments.

The performance spread across different model types was evident in the Mensa Norway test, with the “o3” model taking a clear lead over other models such as GPT-4o. While some open-source submissions like Meta’s Llama 4 Maverick showed decent performance, most Apache-licensed models fell within a lower IQ range, underscoring the limitations of community-built architectures compared to corporate-backed research pipelines.

Multimodal models incorporating image input capabilities showed reduced scores compared to their text-only versions, indicating unresolved reasoning inefficiencies in some multimodal pretraining methods. However, the “o3” model is able to analyze and interpret images at a very high standard, showcasing its superiority over previous models. IQ benchmarks provide only a limited view of a model’s reasoning capability, with broader cognitive behaviors like planning and factual accuracy not fully captured in these tests.

As model developers continue to limit disclosures about internal architectures and training methods, independent evaluators play a crucial role in filling the transparency gap. Organizations like LM-Eval and MLCommons provide third-party assessments that help shape the norms of large language model testing. While the IQ scores achieved by the “o3” model are impressive, they serve more as signals of short-context proficiency rather than definitive indicators of broader capabilities. Additional analysis on format-based performance spreads and evaluation reliability will be necessary to validate current benchmarks and further evolve comparative metrics in the future.

Share.
Leave A Reply

Exit mobile version