Announcements

Anthropic's 4.5 sonnet: A magnificent beast powering SuperNinja's next evolution

By The Ninja AI Team

Monday, September 29, 2025

8 min read

SuperNinja & Anthropic Claude 4.5 Sonnet

At NinjaTech AI, we're constantly pushing the boundaries of what's possible with autonomous AI agents. SuperNinja, our advanced general agent platform, deploys a dedicated Cloud Computer (VM) for each task, enabling a complete cycle of Research → Build → Deploy for complex code, live dashboards, websites, and more. Our scaffold is specifically designed to leverage long-horizon tool calling, coding, and reasoning—capabilities that involve multi-step information retrieval, what we call Deep Research.

Today, we're thrilled to share our comprehensive analysis of Anthropic's newly launched Sonnet 4.5 model as the core intelligence powering SuperNinja. After rigorous testing across our internal benchmarks and real-world customer scenarios, we can confidently say: Sonnet 4.5 is a magnificent beast that represents a significant step change for autonomous agent performance.

Key Findings at a Glance

12.5% higher completion rate in our internal tests compared to Sonnet 4.0
20% faster task completion due to fewer mistakes and better reasoning
18.2% cost savings through more efficient token usage
Visibly higher quality outputs with improved instruction following
Best-performing model we've tested to date on our benchmarks

‍

Why This Matters for SuperNinja Users

SuperNinja's unique architecture demands exceptional performance from its underlying language model. Unlike traditional chatbots that handle simple queries, SuperNinja tackles complex, multi-stage workflows that can involve dozens or even hundreds of sequential decisions. Each task requires the model to plan strategically, execute precisely, verify results, and adapt dynamically when challenges arise.

The improvements we're seeing with Sonnet 4.5 translate directly into tangible benefits for our users. Faster completion times mean you get results sooner. Higher completion rates mean fewer failed tasks and less frustration. Better quality outputs mean more polished, production-ready deliverables. And reduced token usage means lower costs without sacrificing capability.

Visual of parallel tasks being completed by SuperNinja

‍

Benchmark Testing

Phase 1: GAIA Smoke Test

We begin our model evaluation process with the GAIA benchmark—a challenging test of multi-step reasoning and tool use designed to measure real-world agent capabilities. Sonnet 4.5 achieved approximately 5% improvement in accuracy compared to Sonnet 3.7 and around 7% improvement compared to Sonnet 4.0 on this benchmark. This makes it the best-performing model we've tested to date on GAIA.

‍

Phase 2: Internal Benchmark Suite

After passing the GAIA smoke test, we moved to our proprietary internal benchmark— Our analysis revealed that AgencyBench [1,2] closely represents the distribution of real customer queries we observe in production. Leveraging this alignment, we developed our internal test suite following AgencyBench's structure and distribution while scaling it to include additional scenarios and defining multiple evaluation rubrics to capture nuanced performance dimensions. The following table shows the distribution of the domain and categories in the dataset:

‍

Sonnet 4.5 demonstrated a 12.5% higher completion rate compared to the previous state-of-the-art model (Sonnet 4.0), with outputs that were consistently more visually appealing and better aligned with user intent.

Performance improvements varied significantly by task type. In deep-research tasks—complex workflows requiring extensive information gathering and synthesis—Sonnet 4.5 achieved approximately 10% accuracy improvements over Sonnet 4.0. The gains were even more dramatic in coding agent scenarios, where accuracy increased from 80% to 96%, representing a 16 percentage point improvement.

Beyond accuracy, Sonnet 4.5 demonstrated superior efficiency. In 81% of test cases, the model required fewer or equal steps to complete tasks, indicating more direct problem-solving approaches and reduced computational overhead.

Try SuperNinja

Where General AI meets real world productivity

View plans Try SuperNinja

‍

Real-World Performance: The Stock Analyzer Challenge

To demonstrate the practical impact of these improvements, we conducted a comprehensive real-world test using an identical prompt across multiple leading AI models. The task was complex and representative of the types of challenges SuperNinja users face daily:

"Build a web-based modern & professional stock analyzer for Mag7 with charts with forecasts. Give me suggestions with different risk factors on how to allocate $1M in order to double it in the next 6 months via Mag7 and provide rationale for it. Summarize the top latest news around each company and make sure all external links are working correctly. Think & add useful features to better learn & analyze for the web application. Build, test and then deploy a permanent link for it."

‍

Comparative Results

Note: All models were tested with identical zero-shot prompts (no examples or fine-tuning). Links to view actual deployed results are provided below.

Model	Steps required	Quality	Key observations	View result
Sonnet 4.5	57 steps	Excellent	Fast, modern, instantly usable. Highest quality information retrieval	🔗 View
Sonnet 4.0	67 steps	Good	Modern results, but site is buggy and charts crash the browser	🔗 View
Sonnet 3.7	67 steps	Fair	Slow, somewhat outdated website, charts crash the browser	🔗 View
Kimi-K2-0905 (open-source)	126 steps	Good	Very usable with rich graphs and deep information retrieval	🔗 View
GPT-5	500 steps	Poor	Results were not usable despite extensive processing	🔗 View
GLM 4.5 (open source)	742 steps	Fair	Usable, some empty charts, but rich with deep information	🔗 View
Gemini 2.5 Pro	3,678 steps	Poor	Did not finish. Super expensive and not usable	N/A

‍

The Power of Parallel Tool Calling

One of the most exciting capabilities of Sonnet 4.5 is its support for parallel tool calling—a feature that was notably absent in previous versions. Our analysis shows that approximately 20% of SuperNinja tasks can benefit significantly from this capability. Parallel tool calling enables the model to execute multiple independent operations simultaneously rather than sequentially.

SupeNinja parallel tool calling using Sonnet 4.5

‍

Cost Efficiency: Doing More with Less

In addition to performance improvements, Sonnet 4.5 delivers meaningful cost savings. Our analysis shows approximately 15% reduction in overall costs when running SuperNinja tasks with Sonnet 4.5 compared to previous models. These savings come from multiple sources: reduced number of steps, lower error rates, and improved efficiency.

‍

FAQs

Q1: What are the key performance improvements in Anthropic Sonnet 4.5 over previous models?

A: Anthropic Sonnet 4.5 achieves higher completion rates, faster and more accurate reasoning, and more efficient workflow execution compared to Sonnet 4.0 and Sonnet 3.7, as shown in SuperNinja’s benchmark testing.

‍

Q2: How does Sonnet 4.5 enhance agentic capabilities and tool use for autonomous workflows?

A: Sonnet 4.5 introduces advanced parallel tool calling and improved context management, enabling agents to run multi-step tasks and leverage multiple tools simultaneously, resulting in better output quality and reliability in research, coding, and automation tasks.

‍

Q3: What benchmark tests demonstrate Sonnet 4.5’s real-world advantages?

A: SuperNinja’s analysis highlights a 12.5% higher completion rate and superior handling of deep research workflows in GAIA and AgencyBench benchmarks, with significantly fewer task failures and errors compared to competing models.

‍

Q4: How does Sonnet 4.5 compare to other leading AI models in practical performance?

A: In side-by-side testing, Sonnet 4.5 required fewer steps, delivered higher quality code and analysis, and was more cost-effective than models like GPT-5, Gemini 2.5 Pro, and open-source alternatives for complex tasks such as stock analyzers and agentic web applications

‍

Q5: What technical features and context window sizes does Sonnet 4.5 support for advanced use cases?

A: Sonnet 4.5 offers smart context window management with up to 1,000,000 beta tokens, persistent agent memory across sessions, extended autonomous operation, and up to 64K output tokens for complex programming and data analysis scenarios.

‍

References and Further Reading

[1] AgencyBench: Benchmarking Agentic AI Systems - https://arxiv.org/abs/2509.17567

[2] AgencyBench Leaderboard - https://agencybench.opensii.ai/

[3] GAIA Benchmark - https://arxiv.org/abs/2311.12983

[4] SuperNinja Platform - https://super.myninja.ai/

Share this post

Get Started with SuperNinja

Where general AI meets real-world productivity.

Try SuperNinja See our plans

Ninja's SuperNinja interface showcasing the chat and tasks

Introducing Ninja AI's ew iOS app!

Experience Ninja AI like never before with our new native iOS app! Access all your favorite features, agents, and models seamlessly on your iPhone—at no extra cost.

In the long run, AI is a job creator – Not a destroyer

AI is going to take all of our jobs!: If your news feed is anything like mine, it is hard to avoid the articles and posts about how AI is going to displace countless jobs in the workforce.