At NinjaTech AI, we're constantly pushing the boundaries of what's possible with autonomous AI agents. SuperNinja, our advanced general agent platform, deploys a dedicated Cloud Computer (VM) for each task, enabling a complete cycle of Research → Build → Deploy for complex code, live dashboards, websites, and more. Our scaffold is specifically designed to leverage long-horizon tool calling, coding, and reasoning—capabilities that involve multi-step information retrieval, what we call Deep Research.
Today, we're thrilled to share our comprehensive analysis of Anthropic's newly launched Sonnet 4.5 model as the core intelligence powering SuperNinja. After rigorous testing across our internal benchmarks and real-world customer scenarios, we can confidently say: Sonnet 4.5 is a magnificent beast that represents a significant step change for autonomous agent performance.
Key Findings at a Glance
- 12.5% higher completion rate in our internal tests compared to Sonnet 4.0
- 20% faster task completion due to fewer mistakes and better reasoning
- 18.2% cost savings through more efficient token usage
- Visibly higher quality outputs with improved instruction following
- Best-performing model we've tested to date on our benchmarks
Why This Matters for SuperNinja Users
SuperNinja's unique architecture demands exceptional performance from its underlying language model. Unlike traditional chatbots that handle simple queries, SuperNinja tackles complex, multi-stage workflows that can involve dozens or even hundreds of sequential decisions. Each task requires the model to plan strategically, execute precisely, verify results, and adapt dynamically when challenges arise.
The improvements we're seeing with Sonnet 4.5 translate directly into tangible benefits for our users. Faster completion times mean you get results sooner. Higher completion rates mean fewer failed tasks and less frustration. Better quality outputs mean more polished, production-ready deliverables. And reduced token usage means lower costs without sacrificing capability.

Benchmark Testing
Phase 1: GAIA Smoke Test
We begin our model evaluation process with the GAIA benchmark—a challenging test of multi-step reasoning and tool use designed to measure real-world agent capabilities. Sonnet 4.5 achieved approximately 5% improvement in accuracy compared to Sonnet 3.7 and around 7% improvement compared to Sonnet 4.0 on this benchmark. This makes it the best-performing model we've tested to date on GAIA.
Phase 2: Internal Benchmark Suite
After passing the GAIA smoke test, we moved to our proprietary internal benchmark— Our analysis revealed that AgencyBench [1,2] closely represents the distribution of real customer queries we observe in production. Leveraging this alignment, we developed our internal test suite following AgencyBench's structure and distribution while scaling it to include additional scenarios and defining multiple evaluation rubrics to capture nuanced performance dimensions. The following table shows the distribution of the domain and categories in the dataset:

Sonnet 4.5 demonstrated a 12.5% higher completion rate compared to the previous state-of-the-art model (Sonnet 4.0), with outputs that were consistently more visually appealing and better aligned with user intent.
Performance improvements varied significantly by task type. In deep-research tasks—complex workflows requiring extensive information gathering and synthesis—Sonnet 4.5 achieved approximately 10% accuracy improvements over Sonnet 4.0. The gains were even more dramatic in coding agent scenarios, where accuracy increased from 80% to 96%, representing a 16 percentage point improvement.
Beyond accuracy, Sonnet 4.5 demonstrated superior efficiency. In 81% of test cases, the model required fewer or equal steps to complete tasks, indicating more direct problem-solving approaches and reduced computational overhead.
Real-World Performance: The Stock Analyzer Challenge
To demonstrate the practical impact of these improvements, we conducted a comprehensive real-world test using an identical prompt across multiple leading AI models. The task was complex and representative of the types of challenges SuperNinja users face daily:
"Build a web-based modern & professional stock analyzer for Mag7 with charts with forecasts. Give me suggestions with different risk factors on how to allocate $1M in order to double it in the next 6 months via Mag7 and provide rationale for it. Summarize the top latest news around each company and make sure all external links are working correctly. Think & add useful features to better learn & analyze for the web application. Build, test and then deploy a permanent link for it."
Comparative Results
Note: All models were tested with identical zero-shot prompts (no examples or fine-tuning). Links to view actual deployed results are provided below.
The Power of Parallel Tool Calling
One of the most exciting capabilities of Sonnet 4.5 is its support for parallel tool calling—a feature that was notably absent in previous versions. Our analysis shows that approximately 20% of SuperNinja tasks can benefit significantly from this capability. Parallel tool calling enables the model to execute multiple independent operations simultaneously rather than sequentially.

Cost Efficiency: Doing More with Less
In addition to performance improvements, Sonnet 4.5 delivers meaningful cost savings. Our analysis shows approximately 15% reduction in overall costs when running SuperNinja tasks with Sonnet 4.5 compared to previous models. These savings come from multiple sources: reduced number of steps, lower error rates, and improved efficiency.
FAQs
Q1: What are the key performance improvements in Anthropic Sonnet 4.5 over previous models?
A: Anthropic Sonnet 4.5 achieves higher completion rates, faster and more accurate reasoning, and more efficient workflow execution compared to Sonnet 4.0 and Sonnet 3.7, as shown in SuperNinja’s benchmark testing.
Q2: How does Sonnet 4.5 enhance agentic capabilities and tool use for autonomous workflows?
A: Sonnet 4.5 introduces advanced parallel tool calling and improved context management, enabling agents to run multi-step tasks and leverage multiple tools simultaneously, resulting in better output quality and reliability in research, coding, and automation tasks.
Q3: What benchmark tests demonstrate Sonnet 4.5’s real-world advantages?
A: SuperNinja’s analysis highlights a 12.5% higher completion rate and superior handling of deep research workflows in GAIA and AgencyBench benchmarks, with significantly fewer task failures and errors compared to competing models.
Q4: How does Sonnet 4.5 compare to other leading AI models in practical performance?
A: In side-by-side testing, Sonnet 4.5 required fewer steps, delivered higher quality code and analysis, and was more cost-effective than models like GPT-5, Gemini 2.5 Pro, and open-source alternatives for complex tasks such as stock analyzers and agentic web applications
Q5: What technical features and context window sizes does Sonnet 4.5 support for advanced use cases?
A: Sonnet 4.5 offers smart context window management with up to 1,000,000 beta tokens, persistent agent memory across sessions, extended autonomous operation, and up to 64K output tokens for complex programming and data analysis scenarios.
References and Further Reading
[1] AgencyBench: Benchmarking Agentic AI Systems - https://arxiv.org/abs/2509.17567
[2] AgencyBench Leaderboard - https://agencybench.opensii.ai/
[3] GAIA Benchmark - https://arxiv.org/abs/2311.12983
[4] SuperNinja Platform - https://super.myninja.ai/