Anthropic’s Claude Opus 4.5 Sets New Coding Benchmark

Share this article
Share this article
Prioritise Us on Google
Claude Opus 4.5, the company says is currently the best model in the world for coding, agents and computer use | Credit: Anthropic
Anthropic's Claude Opus 4.5 shows strong gains in software engineering and agentic AI, beating every human candidate in a tough internal coding test

Anthropic has released its latest artificial intelligence model, Claude Opus 4.5.

According to Anthropic, the model is its most token-efficient, safest and robustly-aligned AI to date, featuring advanced capabilities in coding and agentic AI.

It is designed to handle a wide range of everyday tasks and tools.

Anthropic reports that Claude Opus 4.5 is currently the leading model for coding agents and general computer use.

In a highly challenging engineering test administered to Anthropic's potential engineering candidates, the Opus 4.5 model reportedly achieved a score higher than any human candidate.

This could suggest a development in AI’s application in software engineering.

Claude Opus 4.5 significantly outperforms prominent AI models in software engineering tasks | Credit: Anthropic

Advanced coding and agentic capabilities

Beyond software engineering, the model shows proficiency in mathematical reasoning and creative problem-solving.

Claude Opus 4.5 is also skilled at managing a team of sub-agents, enabling more effective coordination within multi-agent systems.

This approach could help achieve user objectives more quickly and with higher precision for long-running tasks.

In one test scenario, the model was tasked with assisting a distressed customer as an airline service agent.

The model identified a creative loophole to help the customer. While the benchmark flagged this as a failure because the model did not perform as technically intended, this scenario highlights Claude Opus 4.5’s potential for finding innovative solutions to real-world problems.

The model displays creative problem solving ability by finding a loophole to help the customer. Although the benchmark flagged this as a failure as the model did not technically perform as intended, this scenario portrays Claude Opus 4.5's capability to find innovative solutions to real world problems | Credit: Anthropic

Model safety and AI alignment

The ability of an AI model to work around established rules could, in some contexts, be viewed as ‘reward hacking’ where an AI ‘games the rules’ to achieve a reward.

This behaviour raises questions regarding the AI alignment problem. However, Anthropic states that Claude Opus 4.5 is its most robustly aligned model to date.

Anthropic says that the Claude Opus 4.5 is the most robustly aligned model they have released to date | Credit: Anthropic

The model exhibits low scores for concerning behaviours, which include both undesirable actions taken by the model and cooperation with human misuse.

According to Anthropic’s findings, Claude Opus 4.5 also shows resilience against prompt injection attacks, surpassing most other prominent AI models. This could indicate a higher level of security in its operational framework.

Youtube Placeholder
Turning Claude into your thinking partner

Integration and developer tools

Anthropic has also introduced Claude for Chrome, a model capable of handling tasks across multiple browser tabs.

It operates in the background to autonomously perform user-prompted tasks such as clicking buttons, filling forms and summarising meetings.

Additionally, Claude for Excel leverages the performance of Opus 4.5 to help users work with complex spreadsheets more easily.

The Opus 4.5 model brings upgrades to the Claude Developer Platform by improving the performance of Claude Code.

This is achieved by the model asking clarifying questions and creating a plan of execution that users can review and edit before execution begins.

Rahul Patil, CTO at Anthropic says that Claude Opus 4.5 can bring 20% accuracy improvements in financial modelling

Rahul Patil, Chief Technology Officer at Anthropic, says he is excited to see what developers build next.

"GitHub, Cursor Replit and Windsurf are already integrating it," Rahul explains.

“Claude Opus 4.5 is powerful enough for Rakuten's agents to autonomously refine themselves in 4 iterations and precise enough for 20% accuracy improvements in financial modelling. The model is also much better at frontend design. Our platform now supports longer-running agents. Claude Code is available on desktop (research preview), and we’re also launching an updated Plan Mode.”

Mario Rodriguez, CPO at GitHub

Mario Rodriguez, Chief Product Officer at GitHub, says: “Claude Opus 4.5 delivers high-quality code and excels at powering heavy-duty agentic workflows with GitHub Copilot. Early testing shows it surpasses internal coding benchmarks while cutting token usage in half – and is especially well-suited for tasks like code migration and code refactoring.”

Company portals

Executives