Package detail

page-agent

alibaba1.1kMIT1.1.0

GUI agent for web applications - add intelligent automation to any webpage with a single script

ai, automation, ui-agent, GUI-agent, browser-automation, web-agent, llm, dom-interaction, web-automation, GUI-simulation

readme

PageAgent 🤖🪄

<picture> <source media="(prefers-color-scheme: dark)" srcset="https://img.alicdn.com/imgextra/i4/O1CN01qKig1P1FnhpFKNdi6_!!6000000000532-2-tps-1280-256.png"> Page Agent Banner

</picture>

The GUI Agent Living in Your Webpage. Control web interfaces with natural language.

🌐 English | 中文

👉 🚀 Demo | 📖 Documentation

✨ Features

🎯 Easy Integration
- No python. No headless browser. No browser extension. Just in-page scripts.
🔐 Client-Side Processing
🧠 DOM Extraction
💬 Natural Language Interface
🎨 UI with Human in the loop

And 😉

🧪 cross-page control with an experimental chrome extension - packages/extension

👉 🗺️ Roadmap

🚀 Quick Start

One-line integration

Fastest way to try PageAgent with our free Demo LLM:

<script
    src="https://cdn.jsdelivr.net/npm/page-agent@1.1.0/dist/iife/page-agent.demo.js"
    crossorigin="true"
></script>

⚠️ For technical evaluation only. Demo LLM has rate limits and usage restrictions. May change without notice.

🌷 Bring your own LLM API.

Mirrors	URL
Global	https://cdn.jsdelivr.net/npm/page-agent@1.1.0/dist/iife/page-agent.demo.js
China	https://registry.npmmirror.com/page-agent/1.1.0/files/dist/iife/page-agent.demo.js

NPM Installation

npm install page-agent

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
    model: 'deepseek-chat',
    baseURL: 'https://api.deepseek.com',
    apiKey: 'YOUR_API_KEY',
    language: 'en-US',
})

await agent.execute('Click the login button')

🏗️ Structure

PageAgent adopts a simplified monorepo structure:

packages/
├── core/                # ** Core agent logic without UI(npm: @page-agent/core) **
├── page-agent/          # Exported agent and demo(npm: page-agent)
├── llms/                # LLM client (npm: @page-agent/llms)
├── page-controller/     # DOM operations & Visual Mask (npm: @page-agent/page-controller)
├── ui/                  # Panel & i18n (npm: @page-agent/ui)
└── website/             # Demo & Documentation site

🤝 Contributing

We welcome contributions from the community! Follow our instructions in CONTRIBUTING.md for environment setup and local development.

Please read Code of Conduct before contributing.

👏 Acknowledgments

This project builds upon the excellent work of browser-use.

PageAgent is designed for client-side web enhancement, not server-side automation.

DOM processing components and prompt are derived from browser-use:

Browser Use
Copyright (c) 2024 Gregor Zunic
Licensed under the MIT License

Original browser-use project: <https://github.com/browser-use/browser-use>

We gratefully acknowledge the browser-use project and its contributors for their
excellent work on web automation and DOM interaction patterns that helped make
this project possible.

Third-party dependencies and their licenses can be found in the package.json
file and in the node_modules directory after installation.

📄 License

MIT License

⭐ Star this repo if you find PageAgent helpful!

changelog

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.0.0] - 2026-01-19

🎉 First Stable Release

PageAgent is now ready for production use. The API is stable and breaking changes will follow semantic versioning.

Features

Core

PageAgent - Main entry class with built-in UI Panel
PageAgentCore - Headless agent class for custom UI or programmatic use
DOM Analysis - Text-based DOM extraction with high-intensity dehydration
LLM Support - Works with OpenAI, Claude, DeepSeek, Qwen, and other OpenAI-compatible APIs
Tool System - Built-in tools for click, input, scroll, select, and more
Custom Tools - Extend agent capabilities with your own tools (experimental)
Lifecycle Hooks - Hook into agent execution (experimental)
Instructions System - System-level and page-level instructions to guide agent behavior
Data Masking - Transform page content before sending to LLM

Page Controller

Element Interactions - Click, input text, select options, scroll
Visual Mask - Blocks user interaction during automation
DOM Tree Extraction - Efficient page structure extraction for LLM consumption

UI

Interactive Panel - Real-time task progress and agent thinking display
Ask User Tool - Agent can ask users for clarification
i18n Support - English and Chinese localization

Configuration

interface PageAgentConfig {
    // LLM Configuration (required)
    baseURL: string
    apiKey: string
    model: string
    temperature?: number
    maxRetries?: number
    customFetch?: typeof fetch

    // Agent Configuration
    language?: 'en-US' | 'zh-CN'
    maxSteps?: number // default: 20
    customTools?: Record<string, PageAgentTool> // experimental
    instructions?: InstructionsConfig
    transformPageContent?: (content: string) => string | Promise<string>
    experimentalScriptExecutionTool?: boolean // default: false

    // Lifecycle Hooks (experimental)
    onBeforeTask?: (agent, result) => void
    onAfterTask?: (agent, result) => void
    onBeforeStep?: (agent, stepCount) => void
    onAfterStep?: (agent, history) => void
    onDispose?: (agent, reason?) => void

    // Page Controller Configuration
    enableMask?: boolean // default: true
    viewportExpansion?: number
    interactiveBlacklist?: Element[]
    interactiveWhitelist?: Element[]
}

Packages

Package	Description
`page-agent`	Main entry with UI Panel
`@page-agent/core`	Core agent logic without UI
`@page-agent/llms`	LLM client with retry logic
`@page-agent/page-controller`	DOM operations and visual feedback
`@page-agent/ui`	Panel and i18n

Known Limitations

Single-page application only (cannot navigate across pages)
No visual recognition (relies on DOM structure)
Limited interaction support (no hover, drag-drop, canvas operations)
See Limitations for details

Acknowledgments

This project builds upon the excellent work of browser-use. DOM processing components and prompts are adapted from browser-use (MIT License).