Anthropic has released a new update to its AI model, Claude 3.5 sonnet, which introduces several significant improvements. The updated model enhances coding and reasoning abilities, setting higher standards in performance across various tasks.

Additionally, Claude 3.5 sonnet now has a unique feature—learning to interact with computers as a human would. This breakthrough in automation and task management offers exciting possibilities for businesses and individuals.

Let’s dive deeper into the story and know the capabilities of the new Claude 3.5 sonnet and what benefits it can bring to your daily workflow.

Claude 3.5 Sonnet Upgrade

Anthropic New Claude 3.5 Sonnet

The Claude 3.5 Sonnet model has received an impressive upgrade, bringing improvements across the board. One area where it really shines is coding performance.

  • On industry benchmarks like SWE-bench Verified, Claude 3.5 Sonnet now scores 49%, outperforming all other publicly available models. This includes reasoning models like OpenAI o1-preview and specialized systems built for agentic coding.

  • On TAU benchmarks, it has improved task performance from 62.6% to 69.2% in the retail domain and from 36.0% to 46.0% in the more challenging airline domain.

  • Despite these gains, the upgraded model maintains the same affordable price and fast speed as the previous version.

Early feedback from customers has been very positive.

  • GitLab tested the model for DevSecOps tasks and found it delivered stronger reasoning (up to 10% better across use cases) with no added wait time.

  • Cognition uses Claude 3.5 Sonnet for autonomous AI testing and saw big improvements in coding, planning, and problem-solving.

  • The Browser Company noted the model beat every other one they've tried for automating web workflows.

To ensure responsible deployment, the new Claude 3.5 Sonnet underwent joint pre-release testing by the US AI Safety Institute and the UK Safety Institute. Anthropic also evaluated the model for catastrophic risks and determined the ASL-2 Standard from their Responsible Scaling Policy is still the right fit.

New Claude 3.5 Haiku Model

Anthropic New Claude 3.5 Haiku

The new Claude 3.5 Haiku model is an impressive feat, delivering the same high performance as the much larger Claude 3 Opus at the cost and speed of the previous Claude 3 Haiku.

  • On many intelligence benchmarks, Claude 3.5 Haiku actually surpasses what Claude 3 Opus could do.

  • The model particularly excels at coding tasks. On the SWE-bench Verified test, it scores 40.6%, beating out many other state-of-the-art models like the original Claude 3.5 Sonnet and GPT-4o.

Thanks to its fast speed, low cost, and strong instruction following, Claude 3.5 Haiku is a great fit for a wide range of applications:

  • Powering user-facing products

  • Handling specialized sub-tasks as part of a larger AI system

  • Generating personalized experiences from huge datasets like purchase history, pricing, or inventory

Anthropic plans to make Claude 3.5 Haiku available later this month through several channels:

The model will start out as text-only, with image input capabilities to be added later.

Comparisons to Other AI Models and Providers

Anthropic Claude vs Other Models

When it comes to AI models, how do Anthropic's latest offerings stack up against the competition? In short, the new Claude 3.5 models are leading the pack on a variety of tests that measure different aspects of intelligence.

  • For complex reasoning at the graduate student level, Claude 3.5 Sonnet outdoes models like GPT-4o, with the Gemini models also putting in a strong showing.

  • On writing computer code, both Claude 3.5 Sonnet and Haiku outperform GPT-4o.

  • For math problem-solving, Claude 3.5 Sonnet comes out ahead of the GPT-4o models, though the Gemini models have the edge here.

  • On answering questions about images, Claude 3.5 Sonnet is neck-and-neck with GPT-4o and trails just slightly behind Gemini.

  • For using code to complete open-ended tasks, the new Claude models show major improvements over their predecessors, with Claude 3.5 Sonnet in the lead.

While it's tricky to compare Anthropic's models directly to those from OpenAI due to some fundamental differences, overall the new Claude 3.5 models are demonstrating industry-leading performance across the board.

They're showing that Anthropic is at the forefront of creating AI that can reason, analyze, and problem-solve at an extremely high level.

Teaching Claude to Use Computers Like a Person

Anthropic Claude Computer Use

Anthropic is taking a groundbreaking approach to expand Claude's capabilities: teaching it general computer skills, just like a person would learn, instead of building narrow tools for specific tasks.

  • Through a new API, Claude can now perceive and interact with computer interfaces.

  • Developers can integrate this API so Claude can turn high-level instructions (e.g., "fill out this form using data from my computer and the web") into a series of computer actions (opening files and browsers, navigating pages, entering data, etc.).

This opens up exciting possibilities for developers to automate repetitive workflows, build and test software, and conduct open-ended research tasks.

Early results are promising: on the OSWorld benchmark which tests AI models' ability to use computers like humans, Claude 3.5 Sonnet scored 14.9% using screenshots only, beating the next-best system's 7.8%. With more steps allowed, Claude reached 22%.

However, Claude's computer use skills are still maturing. Actions that are easy for people, like scrolling and zooming, can be tricky for Claude. Anthropics recommends starting with low-risk tasks while the technology develops.

They're also proactively implementing safety measures, like classifiers to detect potential misuse, as computer use could open up new avenues for threats like spam or fraud.

Use Cases for Claude's Computer Use Skills

Anthropic New Claude 3.5 Sonnet Computer Use

Anthropic has been hard at work testing the groundbreaking new computer use capability of the Claude. To showcase the potential of this feature, the Anthropic team has created a series of demo videos highlighting three key use cases: automating operations, coding, and orchestrating tasks. Let's take a closer look at each one.

Claude's Computer use for Automating Operations

In this demo, Sam from the Anthropic research team shows how Claude can automate the tedious task of filling out a vendor request form. With information scattered across a spreadsheet and CRM, Claude seamlessly navigates between the two, finds the relevant data, and transfers it to the form—all without human intervention. This example illustrates how Claude can take over the drudge work that eats up valuable time in many businesses.

Claude's Computer use for Coding

Next up, Alex from developer relations demonstrates Claude's coding possibilities. Starting with a simple prompt to create a 90s-themed homepage, Claude generates the code within the claude.ai interface. But the real magic happens when Alex asks Claude to download the file, open it in VS Code, start a server, identify and fix errors, and rerun the website—all through natural language commands. While it takes a few prompts, this demo hints at a future where Claude could handle such coding tasks end-to-end.

Claude's computer use for Orchestrating Tasks

Finally, Pujaa from the research team showcases a more everyday use case: planning a sunrise hike with a friend. By asking Claude to find a good viewing spot, calculate travel time, look up the sunrise time, and create a calendar invite, Pujaa is able to offload all the logistical legwork to the AI assistant. This simple example illustrates the potential for Claude to serve as a personal organizer and task manager.

During the recording of these demos, Claude accidentally stopped a screen recording and even got sidetracked by photos of Yellowstone. But the Anthropic team sees these errors as part of the learning process.

They're actively
inviting developer feedback to rapidly improve the speed, reliability, and usefulness of computer use, while also working closely with their safety teams to implement appropriate safeguards.

As computer use matures, Anthropic believes it will become an indispensable tool for businesses and individuals alike.

Conclusion

The introduction of Claude 3.5 Sonnet and the revolutionary computer use capability marks a significant milestone for Anthropic and the field of AI. The upgraded Claude 3.5 Sonnet and the new affordable, high-speed Claude 3.5 Haiku demonstrate remarkable performance across various benchmarks, often outpacing industry leaders.

The groundbreaking computer use feature allows Claude to perceive and interact with interfaces, opening up exciting possibilities for automation, coding, and task management. Early demos showcase the potential for businesses and individuals, highlighting areas for improvement. As

Anthropic actively seeks developer feedback and collaborates with safety teams, they remain committed to responsibly delivering transformative AI capabilities. With Claude 3.5 Sonnetbas, Anthropic is pushing the boundaries of what's possible, paving the way for AI to become an indispensable part of our daily lives and work.

FAQs

1. What is Claude's computer use capability?

Claude's computer use capability allows it to interact with computer interfaces like a human, performing tasks such as filling forms, navigating websites, and automating workflows through natural language commands.

2. How safe is Claude's computer use feature?

Claude's computer use feature includes built-in safety measures and classifiers to detect potential misuse, with Anthropic recommending starting with low-risk tasks while the technology develops.

3. Is Claude 3.5 Haiku faster than Claude 3 Opus?

Yes, Claude 3.5 Haiku delivers the same high performance as Claude 3 Opus but operates at faster speeds and lower costs, making it ideal for everyday tasks.

4. Can Claude 3.5 sonnet write and debug code?

Yes, Claude 3.5 sonnet can write, test, and debug code. It can generate code within its interface, identify errors, fix bugs, and manage development tasks through simple text instructions.