Give Your AI More Access
Most AI assistants today live inside a text box. They can read documents you paste in, call APIs you wired up, and run tools their host application exposes. The open web, the largest interface humans ever built, is mostly off-limits to them. Anything that requires clicking a button, accepting a cookie banner, or navigating a multi-step flow is out of reach.
Before getting into how to widen that reach, it’s worth being honest about why this is hard.
The “you’re not a bot, you’re using a tool” problem
The moment you point any automated browser at a real website, you walk into a wall of defenses:
- Cookie banners. Every site, every visit. Nothing renders until you dismiss them.
- CAPTCHAs and challenge pages. Cloudflare, hCaptcha, Turnstile. Designed to keep bots out.
- Login walls and “Continue with Google” popups. Often in iframes, often with shifting selectors.
- Bot fingerprinting. Sites sniff for headless Chrome, automation flags, WebDriver hooks, weird timing patterns.
- Rate limits and soft blocks that look like normal pages but return empty results.
One that bit me especially hard: Google sign-in. If Google detects you’re inside a bot-controlled browser, it actively blocks the OAuth flow. You hit the “Sign in with Google” button, the popup opens, and instead of the password prompt you get a “This browser or app may not be secure” page. No retry, no override. Half the modern web sits behind “Continue with Google”, so this is not a small footgun.
Most of these defenses exist for good reasons, sites are drowning in scrapers, credential stuffers, and abuse traffic. That’s fair. But the AI sitting on your laptop, opening a single tab to help you book a flight or pull a quote out of a PDF, is not the threat those defenses were built for. You are not a bot. You are a person using a tool, and your tool happens to need a browser.
The trick is building that tool so it looks and behaves like the browser you’d be driving yourself. That shaped a lot of the design decisions below.
I started building this package
I started building @peteqian/browser-agent to close that gap. It’s a TypeScript library that gives an AI a real Chrome browser to drive. An LLM decides what to click, type, scroll, and read at every step. Hand it a task and a starting URL; it works the page until the task is done.
- npm:
@peteqian/browser-agent - Repo and benchmarks: github.com/peteqian/agent-browser
- MIT licensed. Node 18+ or Bun 1.3+, plus any Chrome-based browser.
Install:
npm install @peteqian/browser-agent
# or
bun add @peteqian/browser-agent
The smallest useful program:
import { Agent, Browser } from "@peteqian/browser-agent";
const browser = new Browser();
const agent = new Agent({
task: "Go to example.com and report the H1 text.",
browser,
startUrl: "https://example.com",
});
try {
console.log((await agent.run()).summary);
} finally {
await browser.close();
}
Default provider auto-resolves to whatever is signed in locally, Codex CLI, Claude CLI, or an API key in the environment. You can pin one explicitly:
const agent = new Agent({
task: "Find the top Hacker News story.",
browser,
startUrl: "https://news.ycombinator.com",
llm: { provider: "openai", model: "gpt-4.1-mini" },
});
That’s the surface. The rest of this post is how it works under the hood and why I built it the way I did.
Why raw Chrome DevTools Protocol
Most browser-automation libraries in the JavaScript world sit on top of Puppeteer or Playwright. Both are excellent for testing apps you control. Both also leave fingerprints, because they were designed for QA, not for blending in.
browser-agent skips that layer and talks directly to Chrome over the Chrome DevTools Protocol, the same JSON-RPC interface DevTools uses when you open the inspector. The reasons:
- Smaller surface, fewer leaks. No
navigator.webdriver, no automation library shim, no driver process announcing itself. Just a WebSocket to a normal Chrome. - Faster startup. No driver handshake, no node-side browser wrapper. Spawn Chrome, attach, go.
- Direct control. Want to set a specific user agent, reuse a real Chrome profile with your existing logins, or attach to a browser that’s already running? Easy at this layer.
The tradeoff: anything the higher-level libraries give you for free, locator strategies, auto-waiting, frame management, you write yourself. With an LLM in the loop, that’s fine. The model is the locator strategy. It looks at the page and decides what to do next.
The decision loop
The runtime topology is small:
CLI / SDK / MCP
│
▼
Agent ──── Browser
│ │
▼ ▼
runAgent ─ BrowserSession ─ CDPClient (WS)
│
▼
DecideFn ◄── LLM adapter (OpenAI / Anthropic / Codex / Claude)
Each iteration:
- Capture a
BrowserStateSummaryfrom the active page: DOM serialization, URL, screenshot if the provider is multimodal. - Build a
DecisionInputfrom that state plus a compacted history (head + tail, so the agent keeps the original task and the last few steps without paying for the middle). - Call the model. It returns one or more actions from a fixed catalog.
- Execute those actions through the action registry against the current
Page. - Repeat until the model emits
done(...), or a terminal condition fires (max steps, loop detection, timeout, abort).
The action catalog is intentionally finite: navigate, click, type, scroll, wait, send_keys, select_option, upload_file, wait_for_text, find_elements, extract_content, screenshot, done, and a handful more. Custom actions plug in via createDefaultActionRegistry() plus an ActionDefinition.
Typed terminal output
A summary string is fine for a demo. For anything you actually want to consume downstream, the agent validates its terminal done(data=...) against a Zod schema:
import { z } from "zod";
import { Agent, Browser } from "@peteqian/browser-agent";
const Result = z.object({ heading: z.string() });
const result = await new Agent({
task: "Report the page heading via done(data=...).",
browser: new Browser(),
startUrl: "https://example.com",
outputSchema: Result,
}).run();
if (result.success) console.log(result.data?.heading);
If the model emits done with data that does not match the schema, the run terminates with reason === "schema_violation". No silent garbage flowing into your pipeline.
AgentResult.reason matters more than success
success is true only when the model called done cleanly. In production code, branch on reason:
completed · failed · max_steps · max_failures · loop_detected · aborted · stopped · step_timeout · decision_timeout · schema_violation · decide_error
Loop detection is the one I find most useful. Agents get stuck. They click the same button, get the same modal, click again. The loop detector recognizes the cycle and bails out rather than burning tokens until max_steps.
CLI and MCP
Same library, three entry points. From the terminal:
browser-agent "Find the top result on Hacker News and print its title."
browser-agent "..." --provider openai --model gpt-4.1-mini
browser-agent "..." --verbose
The other entry point is an MCP server. MCP, the Model Context Protocol, is an open standard from Anthropic for how AI assistants discover and call external tools. If you’ve ever added a “tool” or “integration” to Claude Desktop or Cursor and watched the AI use it, that’s MCP under the hood. The host application speaks MCP; the server you add exposes a list of tools the AI can invoke.
browser-agent ships an MCP server. Drop this into your claude_desktop_config.json (or the equivalent for any MCP-aware client) and the AI gets a browser:
{
"mcpServers": {
"browser-agent": {
"command": "npx",
"args": ["-y", "-p", "@peteqian/browser-agent", "browser-agent-mcp"]
}
}
}
Tools exposed: launch session, navigate, click, type, extract, screenshot, run agent, list artifacts, close. The AI sees them in the same tool list it sees everything else and decides when to reach for the browser.
When it makes sense
This isn’t a Playwright replacement for deterministic E2E suites. For that, keep using Playwright, the auto-waiting and locator API earn their keep when you control the page and the test.
The agent shape pays off when:
- The target site changes, or you do not control it.
- The task is described in natural language and the path to completing it varies.
- You want an LLM in your stack to actually do things, not just describe them.
In that setting, the LLM is the locator strategy. The browser is the side-effect surface. CDP is the wire. Your agent gets access to the rest of the web.
Links
- Repo and benchmarks: github.com/peteqian/agent-browser
- npm:
@peteqian/browser-agent - MIT licensed.