How Claude Opus 4.5 Outscored Humans at Software Engineering
Anthropic unveils its latest and most capable AI system: Claude Opus 4.5.
Built to outperform human engineers, manage multi-agent systems and operate safely across common computing tasks, the model is described by the company as “the best model in the world for coding, agents and computer use”.
In a technical assessment usually reserved for top engineering candidates, Claude Opus 4.5 doesn’t just pass – it scores higher than any human has managed in the same test.
According to Anthropic, this benchmark places it ahead of other AI models in software engineering, showing superior performance across code generation, reasoning and tool use.
Beyond coding, Claude Opus 4.5 demonstrates strong mathematical and creative problem-solving skills, making it useful in situations that demand both precision and invention.
It also excels at agentic AI – systems in which it coordinates multiple AI ‘sub-agents’ to carry out longer or more complex tasks.
Solving problems and working with tools
Claude Opus 4.5 is designed to work across real-world applications.
When acting as a support agent in a test scenario, it responds to a distressed customer by finding an unintended workaround.
Although the benchmark flags this as technically incorrect, it highlights the model’s creative ability to navigate rules in ways a human might.
This form of solution-seeking behaviour is sometimes called ‘reward hacking’ – a term in AI safety for when models optimise for a goal by bending or subverting the rules.
But Anthropic states that this is the most “robustly aligned” model the company has released.
Claude Opus 4.5 records low scores for unsafe or concerning behaviour. This includes its own autonomous actions as well as any misuse through interaction with humans.
It also shows a strong resistance to prompt injection attacks – a technique used to manipulate AI outputs through malicious inputs.
The model’s capabilities are already being integrated into multiple products.
Claude for Chrome is a browser-based agent that can perform tasks across tabs, such as clicking buttons, filling forms and summarising online meetings.
It operates from a user prompt, running in the background to complete simple or repetitive tasks.
Claude for Excel takes advantage of Opus 4.5’s improvements in handling structured data, letting users work more efficiently with complex spreadsheets.
Both tools rely on Claude’s ability to interpret user requests clearly and follow through with minimal friction.
Smarter code with less trial and error
The Claude Developer Platform benefits from upgrades as well. Claude Code – the part of the platform focused on software generation – now asks questions before starting work.
This allows users to shape the model’s understanding early, reviewing the plan before execution starts.
The aim is to reduce the number of cycles needed to reach the desired result. Claude Code builds a plan, checks in with the user, then produces output tailored to the specification.
According to Rahul Patil, Chief Technology Officer at Anthropic, this step reduces wasted effort and improves the precision of the final code.
Rahul says: “Claude Opus 4.5 is powerful enough for Rakuten's agents to autonomously refine themselves in four iterations.
“It’s precise enough for 20% accuracy improvements in financial modelling.”
Rahul adds that the model handles front-end design more effectively and that the platform now supports agents that run for longer periods.
Claude Code is also available as a desktop application in research preview and an updated ‘Plan Mode’ feature is being launched to help users create step-by-step instructions more easily.
Adoption and early feedback
Early adopters include development platforms such as GitHub, Cursor, Replit and Windsurf.
Mario Rodriguez, Chief Product Officer at GitHub, says Claude Opus 4.5 integrates well with GitHub Copilot and offers strong support for agent-based workflows.
He says: “Claude Opus 4.5 delivers high-quality code and excels at powering heavy-duty agentic workflows with GitHub Copilot.
“Early testing shows it surpasses internal coding benchmarks while cutting token usage in half – and is especially well-suited for tasks like code migration and code refactoring.”
Anthropic positions Claude Opus 4.5 not just as a better AI coder, but as a practical, safe and adaptable thinking partner for a growing set of real-world applications.
Its performance across engineering, modelling, tool use and coordination marks another step in the shift towards intelligent agents that work across systems with minimal oversight.

