Agents and Browsers
In recent months, both Anthropic and OpenAI have shown how an AI agent can independently use a web browser to interact with the Internet. I am optimistic in having agents use the browser to enable new use cases, and want to share some thoughts and observations based on my recent learnings in this area.
The most obvious is the dual option of accessing websites through the browser UI or through the API. Historically, machine access leaned on the API side, where all interactions are under specific protocols, and outcomes exact. However, API support is rarely complete, leaving the browser path as the first class citizen.
Now that AI agents can use a browser to access the web, just like how a human would, the complete set of information available can now be accessed. Without the rigor that API offers, there is a layer of uncertainty e.g. from hallucination. But, arguably this is not specific to AIs as humans can also get lost on a website, click on a wrong link, or generally take a while to figure something out.
An adaptive model is likely the answer here, where APIs are used for the mainstream sites and interactions, and the browser for the long tail. In this realm, we are seeing new crawlers such as Firecrawl that can convert a website to a YAML structure, and Stainless which can generate APIs automatically. So it’s possible that the future will still be API-based, but dynamically generated on-demand.
I also think that browser-based interactions will deal a blow to companies whose moats are their collection of API integrations. It used to be the case that maintaining APIs is a tedious job and once you have a collection of them it makes a potential competitor harder to enter into the race. Browsers reset the race, and the library of APIs are now worth way less than previously thought.
There are a few different ways to have an AI agent interact with a browser. The most classical is the DOM, which is generated by the browser and Javascript engine. This is how past automated QA testing tools such as Selenium, Playwright, and Puppeteer use. Accessibility Tree is a related layout provided by the browser to make websites more accessible. Lastly, the Anthropic Computer Use demo, from late 2024, uses screenshots and asks the LLM for the next action, for example typing and clicking at specific locations on the screen.
I am optimistic on the Accessibility Tree but it has a few main features to overcome: performing actions through the Accessibility Tree, as I believe it is currently read-only; and a component-level mapping between the Accessibility Tree and the DOM, allowing fast switching between the two maps.
No clear winner has emerged yet this early, but this is an interesting space to watch.
I look forward to browser-level enhancements to help AI agents navigate the browser. For example, one problem that I am seeing today is that LLMs interpreting a screenshot have trouble realizing dropdowns, scroll bars, scroll bars within dropdown boxes, and other UI components that humans have grown accustomed to. It is probably possible to do fine tune training on UI components, but I suspect there is also a long tail of custom styles (or maybe I shouldn’t underestimate AI?).
Since screenshots lose too much information on UI components, we will see annotation of UI components (I believe Stagehand is trying this using CSS), or ways to take smart screenshots that are not just 2D graphics. The browser knows exactly what and where the UI components are, so it shouldn't be difficult to relay that information to the LLM.
Agents and Compliance
Somewhat related, there are some very good use cases in (security) compliance that are now within reach, and we’ve seen some companies starting to tackle them. No doubt more companies will follow, as well as established companies expanding into these.
Audit support agent to collect evidence: Historically already possible through companies like Vanta and Drata, but browser-based approach will remove auditor concerns for data integrity, and more importantly no longer rely on vendors’ API integrations. Since there are so many screenshots and spreadsheets to provide for a security audit of a non-trivial environment, there are many hours to be saved here using automation.
Crafting best controls based on existing process/evidence: In the opposite direction, an agent can observe what is actually being done in terms of process and configurations, and come up with the best and leanest set of controls that represent the given environment. Two use cases that this would help: in many startups which are moving too quickly to properly document, their security practice may actually be more advanced than what is documented. Secondly, many complex environments have redundant and overlapping controls through historical growth and M&A activities, to the point where people are afraid of changing key configurations or processes in the fear of non-compliance. Such an agent can assess how a proposed change will impact compliance controls.
Continuous cross-checking of actual state vs written policies & procedures: This is the general case of the two use cases above, and in a way an AI-native solution doing what Vanta and Drata are already doing, but more suitable for complex environments. Think of a GRC agent who can continuously perform housekeeping tasks. I believe Zania is heading towards this direction.
An auditor that is actually AI: Most people who talk about “AI auditor” today probably mean an audit on an AI system. I am personally most excited about replacing human auditors with AI. Let’s put aside auditing accreditation bodies such as AICPA for a moment, there is so much that can already be automated when reviewing audit evidence such as documents, spreadsheets, and screenshots, and the last 10-20% is also within reach with some development. In fact, there are only so many email providers, cloud providers, and code hosting providers that companies use, that I am surprised we have not seen more automation and AI used in evidence review. LLM, by definition, is about languages, and compliance should be the perfect use case for it within the cybersecurity sector.
I really would like to see an AI agent performing security audits (maybe with a senior human to override and allow exceptions in the rare cases), as a truly impartial entity that is also not constrained by scheduling logistics and turnaround times. You also won’t need to teach a junior auditor what is Kubernetes only for them to then audit if you’ve configured your Kubernetes correctly, as I have experienced. I actually think this AI auditor can send a stronger signal than human auditors because of this true independence. Ideally, for turf and commercial reasons, this AI auditor performs assessments on a neutral framework, likely NIST if the company is based in the US. My favorite is NIST 800-171.
AI-based auditor to connect into a system and assess, skipping the evidence stage altogether: This is the continuous compliance version of the use case above, but skipping all the intermediary artifacts such as screenshots. Though it sounds nice, there are practical barriers such as the auditee not wanting an auditor watching in their environment 24/7.