No hosted browser sessions. No bespoke connectors per app. Just a clean computer tool in the API — screenshot in, structured actions out.
The benchmark leap is real.
Early CUA models struggled with multi-step workflows.
GPT-5.4 now scores 75% on OSWorld, outperforming the human baseline of 72.4%.
On property tax portal evaluations, 95% success rate on first attempt. 100% within three.
The interface becomes the API. Point the model at a screen, describe the goal in natural language, and it figures out the clicks, types, and scrolls.
I break down the architecture, the action vocabulary, the self-correcting agent loop, and how to build your own harness around it.








