~$ whoami
Tony Li
@atom2ueki
GitHub Instagram Email

$ ls ./pinned/

$ ls ./posts/

$ ls ./links/

Why Web Agents Need Vision

Once the limits of DOM-only agents become visible, the next question follows naturally:

if the browser cannot expose the right state, how else should the agent understand the interface?

That question pulled me toward vision.

Browser Use clarified the next step

One of the most important transitions for me came from Browser Use.

Its most valuable idea was not simply browser automation. It was the idea that raw browser state is usually a poor interface for an LLM, and that a normalized, model-friendly representation is often much better than raw DOM.

That insight still matters.

Looking at the Browser Use documentation now, the system clearly supports configurable vision behavior and screenshot-based context in the loop. That validates the broader direction: browser agents were already moving beyond text-only page snapshots.

But during the period when I was exploring it, the vision side still felt early. What I took away at the time was not that Browser Use had already solved multimodal computer use. It was that better interfaces between browser state and model reasoning were essential.

My bridge into vision was YOLO

Before mature computer-use systems became the obvious path, I explored a more manual route.

I tried using YOLO-style detection with coordinates to give the model a visual view of the interface.

The logic was straightforward:

detect UI regions or components visually
attach coordinates to those detections
let the LLM reason over the detected layout
ground actions back onto the screen

Why YOLO was a bridge, not the destination

This worked better than I expected in exactly the places where DOM-first systems were weakest.

That mattered because it exposed the real shape of the problem: some browser-agent failures are not tool failures. They are perception failures.

The downside was equally clear. The moment the interface changed, or the page moved outside the assumptions of the detector, the system became expensive to adapt. It was an interesting bridge, but not a durable solution on its own.

Why multimodal computer use changed the category

Later, stronger computer-use systems made the broader direction unmistakable.

Anthropic helped define the category by making screenshot-based interaction, cursor control, typing, and scrolling feel like a coherent product capability rather than an isolated demo. Google pushed the same direction further with Gemini computer use. Qwen3-VL reinforced that GUI interaction was becoming a serious multimodal capability across model families, not a one-vendor novelty.

This changed the architecture question.

The real question was no longer whether DOM or vision would win.

The better question became:

which representation is best for the current obstacle?

That is a much more useful way to think about browser agents.

The lesson that survives the hype cycle

Vision is not a replacement for structured browser state.

It is a necessary complement to it.

DOM remains valuable because it is fast, grounded, and easier to convert into stable tests.

Vision matters because it can recover what the DOM does not express well.

That is why the strongest browser-agent systems are not DOM-only and not vision-only.

They are routed systems.

They know when visual understanding should take over.

And once that becomes clear, the next question is no longer just how to improve perception. It is how to improve the interface the agent consumes in the first place.

References

Browser Use agent settings: https://docs.browser-use.com/open-source/customize/agent/all-parameters
Browser Use discussion on vision with non-vision models: https://github.com/browser-use/browser-use/discussions/1621
Anthropic computer use docs: https://docs.anthropic.com/en/docs/build-with-claude/computer-use
Anthropic computer use announcement: https://www.anthropic.com/news/3-5-models-and-computer-use
Google Gemini 2.5 Computer Use announcement: https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-computer-use-model/
Qwen3-VL repository: https://github.com/QwenLM/Qwen3-VL
Comet updates: https://www.perplexity.ai/comet/whats-new/what-we-shipped-september-19th