Ninja AI Benchmarks

See how Ninja's models perform head-to-head against GPT, Claude, Gemini, and other leading AI models across reasoning, coding, and agentic tasks — with transparent, reproducible results.

Start Building

View Pricing

SuperNinja

Gaia Benchmark

GAIA is a benchmark for evaluating General AI Assistants on solving real-world problems. SuperNinja has surpassed frontier models on level 1.

Deep Research 2.0

Ninja’s Deep Research is rigorously tested against top AI benchmarks. These evaluations confirm its ability to analyze complex topics, adapt its approach, and deliver high-quality research efficiently.

SimpleQA Benchmark

SimpleQA is one of the best proxies for detecting hallucination levels of the model. Ninja scores 91.2% accuracy on the SimpleQA benchmark — Ninja's Deep Research has demonstrated exceptional performance in accurately identifying factual information, surpassing leading models in the field. This performance is based on rigorous tests using a vast collection of several thousand questions specifically designed to assess factuality. One of the reasons that our system is outperforming others is due to the vast amount of user feedback we received when we launched the first iteration of Deep Research, which enabled us to fine tune and improve the quality that’s now showcased by SimpleQA benchmark.

SimpleQA Accuracy (Higher Is Better)

SimpleQA Hallucination Rate (Lower Is Better)

GAIA Benchmark

GAIA (General AI Assistants) is a groundbreaking benchmark developed by researchers from Meta, HuggingFace, AutoGPT, and GenAI that significantly advances how we evaluate AI systems' research capabilities. Unlike traditional benchmarks that focus on specialized knowledge or increasingly difficult human tasks, GAIA tests fundamental abilities essential for deep research through a set of carefully crafted questions requiring reasoning, multi-modality, web browsing, and tool-use proficiency.

The benchmark is particularly relevant for measuring the accuracy of deep research systems because it evaluates how well AI can navigate real-world information environments, synthesize data from multiple sources, and produce factual, concise answers—core skills for autonomous research tools.

By focusing on questions that require autonomous planning and execution of complex research workflows—rather than specialized domain expertise—GAIA provides a comprehensive assessment framework that aligns perfectly with evaluating the accuracy and reliability of deep research systems in practical, real-world applications. Ninja Deep Research shows comparable accuracy to OpenAI Deep Research, while offering unlimited tasks for only $15/month.

Provider (Pass @1)

Level 1

Level 2

Level 3

Average

OpenAI's Deep Research

74.29

69.06

47.6

67.36

Ninjas's Deep Research

69.81

56.97

46.15

57.64

Humanity’s Last Exam (HLE) Benchmark

Humanity's Last Exam represents a significant advancement in AI evaluation, providing a comprehensive benchmark that effectively measures the accuracy of deep research across multiple domains. The benchmark uses over 3,000 questions spanning a diverse range of more than 100 subjects, including mathematics, science, history, literature, and numerous other areas. It's expert-level questions, designed to test frontier knowledge beyond simple retrieval capabilities, make it uniquely positioned to evaluate how well AI systems can perform accurate, specialized research at the boundaries of human knowledge.

Deep Research, developed by NinjaTech, has achieved a significant breakthrough in artificial intelligence by attaining an 17.47% accuracy score on Humanity's Last Exam. This performance is notably higher than several other leading AI models, including OpenAI o3-mini, o1, DeepSeek-R1, and others.

Reasoning 2.0

Reasoning 2.0 outperformed OpenAI O1 and Sonnet 3.7 in competitive math on the AIME test. It assesses AI’s ability to handle problems requiring logic and advanced reasoning.

Reasoning 2.0 also surpassed human PhD-level accuracy on the GPQA test. It evaluates general reasoning through complex, multi-step questions requiring factual recall, inference, and problem-solving.

Competition Math (AIME 2024)

PhD-level Science Questions (GPQA Diamond)

Competition Code (Codeforces)

Turbo 1.0 & Apex 1.0

Apex 1.0 scored the highest on the industry-standard Arena-Hard-Auto (Chat) test. It measures how well AI can handle complex, real-world conversations, focusing on its ability to navigate scenarios that require nuanced understanding and contextual awareness.

The models also excel in other benchmarks: Math-500, AIME2024 - Reasoning, GPQA - Reasoning, LiveCodeBench - Coding, and LiveCodeBench - Coding - Hard.

Arena-Hard (Auto) - Chat

Math - 500

Reasoning - AIME 2024

Reasoning - GPQA

LiveCodeBench - Coding

LiveCodeBench - Coding - Hard

Build Your First App in Minutes

Describe the task. Ninja turns it into an app that runs step by step for you. No credit card required.

Try for Free

View Pricing

Ninja's SuperNinja interface showcasing the chat and tasks