Benchmarks
Comprehensive benchmarks comparing Ninja’s performance to leading models across reasoning, coding, and agent tasks.
SuperNinja
Gaia Benchmark
GAIA is a benchmark for evaluating General AI Assistants on solving real-world problems. SuperNinja has surpassed frontier models on level 1.

Deep Research 2.0
Ninja’s Deep Research is rigorously tested against top AI benchmarks. These evaluations confirm its ability to analyze complex topics, adapt its approach, and deliver high-quality research efficiently.
SimpleQA is one of the best proxies for detecting hallucination levels of the model. Ninja scores 91.2% accuracy on the SimpleQA benchmark — Ninja's Deep Research has demonstrated exceptional performance in accurately identifying factual information, surpassing leading models in the field. This performance is based on rigorous tests using a vast collection of several thousand questions specifically designed to assess factuality. One of the reasons that our system is outperforming others is due to the vast amount of user feedback we received when we launched the first iteration of Deep Research, which enabled us to fine tune and improve the quality that’s now showcased by SimpleQA benchmark.
GAIA (General AI Assistants) is a groundbreaking benchmark developed by researchers from Meta, HuggingFace, AutoGPT, and GenAI that significantly advances how we evaluate AI systems' research capabilities. Unlike traditional benchmarks that focus on specialized knowledge or increasingly difficult human tasks, GAIA tests fundamental abilities essential for deep research through a set of carefully crafted questions requiring reasoning, multi-modality, web browsing, and tool-use proficiency.
The benchmark is particularly relevant for measuring the accuracy of deep research systems because it evaluates how well AI can navigate real-world information environments, synthesize data from multiple sources, and produce factual, concise answers—core skills for autonomous research tools.
By focusing on questions that require autonomous planning and execution of complex research workflows—rather than specialized domain expertise—GAIA provides a comprehensive assessment framework that aligns perfectly with evaluating the accuracy and reliability of deep research systems in practical, real-world applications. Ninja Deep Research shows comparable accuracy to OpenAI Deep Research, while offering unlimited tasks for only $15/month.
Provider (Pass @1)
Level 1
Level 2
Level 3
Average
OpenAI's Deep Research
74.29
69.06
47.6
67.36
Ninjas's Deep Research
69.81
56.97
46.15
57.64
Humanity's Last Exam represents a significant advancement in AI evaluation, providing a comprehensive benchmark that effectively measures the accuracy of deep research across multiple domains. The benchmark uses over 3,000 questions spanning a diverse range of more than 100 subjects, including mathematics, science, history, literature, and numerous other areas. It's expert-level questions, designed to test frontier knowledge beyond simple retrieval capabilities, make it uniquely positioned to evaluate how well AI systems can perform accurate, specialized research at the boundaries of human knowledge.
Deep Research, developed by NinjaTech, has achieved a significant breakthrough in artificial intelligence by attaining an 17.47% accuracy score on Humanity's Last Exam. This performance is notably higher than several other leading AI models, including OpenAI o3-mini, o1, DeepSeek-R1, and others.
Reasoning 2.0
Reasoning 2.0 outperformed OpenAI O1 and Sonnet 3.7 in competitive math on the AIME test. It assesses AI’s ability to handle problems requiring logic and advanced reasoning.
Reasoning 2.0 also surpassed human PhD-level accuracy on the GPQA test. It evaluates general reasoning through complex, multi-step questions requiring factual recall, inference, and problem-solving.
Turbo 1.0 & Apex 1.0
Apex 1.0 scored the highest on the industry-standard Arena-Hard-Auto (Chat) test. It measures how well AI can handle complex, real-world conversations, focusing on its ability to navigate scenarios that require nuanced understanding and contextual awareness.
The models also excel in other benchmarks: Math-500, AIME2024 - Reasoning, GPQA - Reasoning, LiveCodeBench - Coding, and LiveCodeBench - Coding - Hard.
Get Started with SuperNinja
Where general AI meets real-world productivity.