How Claude Opus 4.5 Outscored Humans at Software Engineering

Share this article
Share this article
Prioritise Us on Google
Mario Rodriguez Headshot
Anthropic launches Claude Opus 4.5, calling it the best coding AI yet, with agentic systems, safety controls and strong performance across real-world tools

Anthropic unveils its latest and most capable AI system: Claude Opus 4.5. 

Built to outperform human engineers, manage multi-agent systems and operate safely across common computing tasks, the model is described by the company as “the best model in the world for coding, agents and computer use”.

In a technical assessment usually reserved for top engineering candidates, Claude Opus 4.5 doesn’t just pass – it scores higher than any human has managed in the same test.

According to Anthropic, this benchmark places it ahead of other AI models in software engineering, showing superior performance across code generation, reasoning and tool use.

Beyond coding, Claude Opus 4.5 demonstrates strong mathematical and creative problem-solving skills, making it useful in situations that demand both precision and invention. 

It also excels at agentic AI – systems in which it coordinates multiple AI ‘sub-agents’ to carry out longer or more complex tasks.

Claude Opus 4.5 significantly outperforms prominent AI models in software engineering tasks | Credit: Anthropic

Solving problems and working with tools

Claude Opus 4.5 is designed to work across real-world applications. 

When acting as a support agent in a test scenario, it responds to a distressed customer by finding an unintended workaround. 

Although the benchmark flags this as technically incorrect, it highlights the model’s creative ability to navigate rules in ways a human might.

This form of solution-seeking behaviour is sometimes called ‘reward hacking’ – a term in AI safety for when models optimise for a goal by bending or subverting the rules. 

But Anthropic states that this is the most “robustly aligned” model the company has released.

Claude Opus 4.5 records low scores for unsafe or concerning behaviour. This includes its own autonomous actions as well as any misuse through interaction with humans.

The model displays creative problem solving ability by finding a loophole to help the customer. Although the benchmark flagged this as a failure as the model did not technically perform as intended, this scenario portrays Claude Opus 4.5's capability to find innovative solutions to real world problems | Credit: Anthropic

It also shows a strong resistance to prompt injection attacks – a technique used to manipulate AI outputs through malicious inputs.

The model’s capabilities are already being integrated into multiple products. 

Claude for Chrome is a browser-based agent that can perform tasks across tabs, such as clicking buttons, filling forms and summarising online meetings.

It operates from a user prompt, running in the background to complete simple or repetitive tasks.

Claude for Excel takes advantage of Opus 4.5’s improvements in handling structured data, letting users work more efficiently with complex spreadsheets. 

Both tools rely on Claude’s ability to interpret user requests clearly and follow through with minimal friction.

Anthropic says that the Claude Opus 4.5 is the most robustly aligned model they have released to date | Credit: Anthropic

Smarter code with less trial and error

The Claude Developer Platform benefits from upgrades as well. Claude Code – the part of the platform focused on software generation – now asks questions before starting work. 

This allows users to shape the model’s understanding early, reviewing the plan before execution starts.

The aim is to reduce the number of cycles needed to reach the desired result. Claude Code builds a plan, checks in with the user, then produces output tailored to the specification. 

Rahul Patil, CTO at Anthropic says that Claude Opus 4.5 can bring 20% accuracy improvements in financial modelling

According to Rahul Patil, Chief Technology Officer at Anthropic, this step reduces wasted effort and improves the precision of the final code.

Rahul says: “Claude Opus 4.5 is powerful enough for Rakuten's agents to autonomously refine themselves in four iterations.

“It’s precise enough for 20% accuracy improvements in financial modelling.”

Rahul adds that the model handles front-end design more effectively and that the platform now supports agents that run for longer periods.

Claude Code is also available as a desktop application in research preview and an updated ‘Plan Mode’ feature is being launched to help users create step-by-step instructions more easily.

Youtube Placeholder
Turning Claude into your thinking partner

Adoption and early feedback

Early adopters include development platforms such as GitHub, Cursor, Replit and Windsurf.

Mario Rodriguez, Chief Product Officer at GitHub, says Claude Opus 4.5 integrates well with GitHub Copilot and offers strong support for agent-based workflows.

He says: “Claude Opus 4.5 delivers high-quality code and excels at powering heavy-duty agentic workflows with GitHub Copilot. 

“Early testing shows it surpasses internal coding benchmarks while cutting token usage in half – and is especially well-suited for tasks like code migration and code refactoring.”

Anthropic positions Claude Opus 4.5 not just as a better AI coder, but as a practical, safe and adaptable thinking partner for a growing set of real-world applications. 

Its performance across engineering, modelling, tool use and coordination marks another step in the shift towards intelligent agents that work across systems with minimal oversight.

Company portals

Executives