I use chatbots like ChatGPT and Claude almost daily to answer quick questions, help me solve problems, fix terrible code, and figure out the word that’s on the tip of my tongue. But one of the big downsides to current AI chatbots is that they’re largely limited to their conversational interface.
Claude computer use and ChatGPT Operator promise to change that.
They use a combination of the built-in language model, screenshots, and a virtual machine to mimic how humans use computers—effectively controlling your computer (with your permission). While they’re still far from fully autonomous, they’re the first real move toward creating accessible general use AI agents that can act independently.
Here’s what you need to know.
Why are Claude computer use and OpenAI’s Operator a big deal?
AI computer agents like Claude computer use and OpenAI Operator (which uses the new Computer-Using Agent [CUA] model) are becoming more prominent, so it’s worth understanding what things look like without AI agents—that can help us see how big a deal these advances are.
Aside from the main chatbot function, almost every feature of an AI chatbot relies on APIs. These can be built by the chatbot’s developers, as is the case with stuff like ChatGPT Search, or third-party developers, using tools like custom GPTs.
For example, Kayak, a travel booking service, has a custom GPT that you can try for yourself. It’s fairly barebones. It uses ChatGPT to pull the relevant details from your prompt, send them over to Kayak using the API, and then display the results. It works, but it’s not very flexible, and I can’t ask ChatGPT to check a different flight comparison site instead—or even see what price I’d get by booking directly from the airline.
There are a couple of other downsides to AI tools relying exclusively on APIs. It requires whatever site or service you’re trying to access to have an API, for a start, and then it requires that the site or service offers all the features you want through the API. While I can view flights through Kayak’s GPT, I can’t get it to actually book a flight or change my account email address or do countless other things that I can do through the website.
Having AI computer agents that can browse any website, use any app, and work with any file would be an amazing step up. You could, say, have your AI agent search and price a trip on Kayak for three different weekends and tell you which is cheapest. It could perhaps even book the trip for you, though that goes far beyond what the current AI computer agents can be trusted to do.
How do AI computer agents work?
AI computer agents pull together a few recent advances in AI, including the multimodal models that can understand more than just text and reasoning models that are able to solve more complicated problems.
Here’s how they work:
-
They use screenshots to look at a computer screen and understand what’s happening.
-
They break up complex instructions into a series of logical steps, try them out, and self-correct if things don’t work as expected.
-
They’re able to use a virtual mouse and keyboard to navigate a normal user interface in a virtual machine.
This breaks down into a simple and repeatable AI workflow:
-
Take a screenshot.
-
Decide on the next computer action that gets closer to the goal.
-
Execute the action.
-
Take a screenshot.
-
Decide on the next computer action that gets closer to the goal.
-
Execute the action.
-
Repeat until you reach the goal.
Of course, things are a lot more complicated under the hood. The AI agents had to be trained on the basics of human-computer interaction, and a technique for accurately counting pixels on a screenshot so the AI could know where to move its cursor and click needed to be developed before any of this started to work.
The AI agents are also being trained on specific platforms like Uber, OpenTable, and DoorDash so they’ll be able to work with real-world services “while respecting established norms.” (I assume this means without ordering four Ubers at once.)
Right now, both Claude computer use and ChatGPT Operator are very much in beta. While the building blocks of AI computer agents are starting to come together, they’re far from reliable enough for major real-world use.
What can AI computer agents do?
The big breakthrough is that AI computer agents can use a computer like a human—though slower and less accurately. These aren’t the kinds of bots that scalp Taylor Swift tickets. Still, even in demos, they show a lot of promise.
Here are some of the things that Anthropic and OpenAI have shown their computer-using agents can do from a text prompt:
-
Navigating Windows, Mac, and Linux systems, pulling up browsers and other apps, and navigating and searching the web.
-
Filling in forms by pulling in data from spreadsheets, CRMs, and different data sources.
-
Finding information about a sunrise hike on Google, working out the distance using Google Maps, and creating a Google Calendar event at the required time to leave.
-
Creating projects and shopping lists in to-do apps.
-
Finding a recipe on Allrecipes and adding the ingredients to an Instacart shopping cart.
-
Downloading files, combining PDFs, and exporting images.
-
Solving online quizzes.
-
Finding specific customer information in mock eCommerce backends.
Here’s an example demo from Claude computer use.
But this is just the stuff they can do right now. The exciting thing is what they could do, once they get good enough. Off the top of my head, that’s things like:
-
All the boring accounting drudge work you can imagine, like sending invoices, logging hours, reconciling accounts, submitting expenses, and the like.
-
Working with spreadsheets to pull data in from all kinds of sources.
-
Watching out-of-stock products on online stores and placing an order when they’re available.
-
Booking movie tickets or getting restaurant reservations as soon as they open.
-
Scanning your spam folder to make sure there isn’t anything important you’ve missed.
-
Dealing with online support agents and chatbots.
And honestly, those are only the things I thought up in 30 seconds of brainstorming. There are literally countless ways an AI computer agent could be useful.
How good are AI computer agents right now?
In its Computer-Using Agent (CUA) announcement, OpenAI claims that its model achieves 38.1%, a new state-of-the-art performance on the OSWorld benchmark. Claude’s computer use attained 22% on the same benchmark in October last year.
The catch: a regular human gets 72.4%.
Similarly, in its launch announcement, Anthropic highlighted that, while they were preparing the demo videos, Claude computer use accidentally clicked stop on a long screen recording, wiping all the footage.
And things are similar when it comes to speed. Currently, computer-using agents take dozens or hundreds of steps to perform moderately simple actions like downloading a series of lectures, combining PDFs, or finding the customer with the most cancellations in a eCommerce portal. While it’s very impressive that it can perform these actions at all, existing tools (or even just doing it yourself) are almost certainly faster. It’s hands-off flexibility that’s going to make these AI agents useful, not speed.
It’s also worth noting that both Anthropic and OpenAI are making a big deal about safety, and it’s easy to understand why. Even when constrained to a chatbot interface, previous AI models have created all the wrong kinds of headlines. With full access to a web browser, there are essentially no limits to what adversarial behavior an unrestricted AI model could be made to get up to or what harm it could cause with its mistakes.
Neither of them is yet able to operate fully autonomously: when ChatGPT Operator encounters a login, CAPTCHA, or payment details, it kicks control of the virtual computer back to the user. In this situation, I feel it’s good that the developers are moving slowly.
And this is the crux of where AI computer agents are at now. They’re incredibly impressive and show a huge amount of promise, but they’re very slow and still make a lot of mistakes, especially with unfamiliar interfaces or more complex tasks. The safety concerns are also very real. It probably won’t be long before they’re legitimately useful for some low-risk tasks, but I think it will be a while before it’s sensible to give them your credit card details and let them go shopping on Amazon.
Despite all my caveats, this is the AI development I’m most excited about.
Can I try Claude computer use or ChatGPT Operator?
Both Claude Computer Use and ChatGPT Operator are available to the public, though testing them out isn’t quite so simple.
-
Claude computer use is only available via API. If you have the technical skills, you can get it running in a dev environment and have some fun.
-
ChatGPT Operator is in public preview but only for ChatGPT Pro subscribers—and that will put you back $200/month.
Related reading: