152 lines
6.2 KiB
Markdown
152 lines
6.2 KiB
Markdown
# browser_action
|
|
|
|
The `browser_action` tool enables web automation and interaction via a Puppeteer-controlled browser. It allows Roo to launch browsers, navigate to websites, click elements, type text, and scroll pages with visual feedback through screenshots.
|
|
|
|
## Parameters
|
|
|
|
The tool accepts these parameters:
|
|
|
|
- `action` (required): The action to perform:
|
|
* `launch`: Start a new browser session at a URL
|
|
* `click`: Click at specific x,y coordinates
|
|
* `type`: Type text via the keyboard
|
|
* `scroll_down`: Scroll down one page height
|
|
* `scroll_up`: Scroll up one page height
|
|
* `close`: End the browser session
|
|
- `url` (optional): The URL to navigate to when using the `launch` action
|
|
- `coordinate` (optional): The x,y coordinates for the `click` action (e.g., "450,300")
|
|
- `text` (optional): The text to type when using the `type` action
|
|
|
|
## What It Does
|
|
|
|
This tool creates an automated browser session that Roo can control to navigate websites, interact with elements, and perform tasks that require browser automation. Each action provides a screenshot of the current state, enabling visual verification of the process.
|
|
|
|
## When is it used?
|
|
|
|
- When Roo needs to interact with web applications or websites
|
|
- When testing user interfaces or web functionality
|
|
- When capturing screenshots of web pages
|
|
- When demonstrating web workflows visually
|
|
|
|
## Key Features
|
|
|
|
- Provides visual feedback with screenshots after each action and captures console logs
|
|
- Supports complete workflows from launching to page interaction to closing
|
|
- Enables precise interactions via coordinates, keyboard input, and scrolling
|
|
- Maintains consistent browser sessions with intelligent page loading detection
|
|
- Operates in two modes: local (isolated Puppeteer instance) or remote (connects to existing Chrome)
|
|
- Handles errors gracefully with automatic session cleanup and detailed messages
|
|
- Optimizes visual output with support for various formats and quality settings
|
|
- Tracks interaction state with position indicators and action history
|
|
|
|
## Browser Modes
|
|
|
|
The tool operates in two distinct modes:
|
|
|
|
### Local Browser Mode (Default)
|
|
- Downloads and manages a local Chromium instance through Puppeteer
|
|
- Creates a fresh browser environment with each launch
|
|
- No access to existing user profiles, cookies, or extensions
|
|
- Consistent, predictable behavior in a sandboxed environment
|
|
- Completely closes the browser when the session ends
|
|
|
|
### Remote Browser Mode
|
|
- Connects to an existing Chrome/Chromium instance running with remote debugging enabled
|
|
- Can access existing browser state, cookies, and potentially extensions
|
|
- Faster startup as it reuses an existing browser process
|
|
- Supports connecting to browsers in Docker containers or on remote machines
|
|
- Only disconnects (doesn't close) from the browser when session ends
|
|
- Requires Chrome to be running with remote debugging port open (typically port 9222)
|
|
|
|
## Limitations
|
|
|
|
- While the browser is active, only `browser_action` tool can be used
|
|
- Browser coordinates are viewport-relative, not page-relative
|
|
- Click actions must target visible elements within the viewport
|
|
- Browser sessions must be explicitly closed before using other tools
|
|
- Browser window has configurable dimensions (default 900x600)
|
|
- Cannot directly interact with browser DevTools
|
|
- Browser sessions are temporary and not persistent across Roo restarts
|
|
- Works only with Chrome/Chromium browsers, not Firefox or Safari
|
|
- Local mode has no access to existing cookies; remote mode requires Chrome with debugging enabled
|
|
|
|
## How It Works
|
|
|
|
When the `browser_action` tool is invoked, it follows this process:
|
|
|
|
1. **Action Validation and Browser Management**:
|
|
- Validates the required parameters for the requested action
|
|
- For `launch`: Initializes a browser session (either local Puppeteer instance or remote Chrome)
|
|
- For interaction actions: Uses the existing browser session
|
|
- For `close`: Terminates or disconnects from the browser appropriately
|
|
|
|
2. **Page Interaction and Stability**:
|
|
- Ensures pages are fully loaded using DOM stability detection via `waitTillHTMLStable` algorithm
|
|
- Executes requested actions (navigation, clicking, typing, scrolling) with proper timing
|
|
- Monitors network activity after clicks and waits for navigation when necessary
|
|
|
|
3. **Visual Feedback**:
|
|
- Captures optimized screenshots using WebP format (with PNG fallback)
|
|
- Records browser console logs for debugging purposes
|
|
- Tracks mouse position and maintains paginated history of actions
|
|
|
|
4. **Session Management**:
|
|
- Maintains browser state across multiple actions
|
|
- Handles errors and automatically cleans up resources
|
|
- Enforces proper workflow sequence (launch → interactions → close)
|
|
|
|
## Workflow Sequence
|
|
|
|
Browser interactions must follow this specific sequence:
|
|
|
|
1. **Session Initialization**: All browser workflows must start with a `launch` action
|
|
2. **Interaction Phase**: Multiple `click`, `type`, and scroll actions can be performed
|
|
3. **Session Termination**: All browser workflows must end with a `close` action
|
|
4. **Tool Switching**: After closing the browser, other tools can be used
|
|
|
|
## Examples When Used
|
|
|
|
- When creating a web form submission process, Roo launches a browser, navigates to the form, fills out fields with the `type` action, and clicks submit.
|
|
- When testing a responsive website, Roo navigates to the site and uses scroll actions to examine different sections.
|
|
- When capturing screenshots of a web application, Roo navigates through different pages and takes screenshots at each step.
|
|
- When demonstrating an e-commerce checkout flow, Roo simulates the entire process from product selection to payment confirmation.
|
|
|
|
## Usage Examples
|
|
|
|
Launching a browser and navigating to a website:
|
|
```
|
|
<browser_action>
|
|
<action>launch</action>
|
|
<url>https://example.com</url>
|
|
</browser_action>
|
|
```
|
|
|
|
Clicking at specific coordinates (e.g., a button):
|
|
```
|
|
<browser_action>
|
|
<action>click</action>
|
|
<coordinate>450,300</coordinate>
|
|
</browser_action>
|
|
```
|
|
|
|
Typing text into a focused input field:
|
|
```
|
|
<browser_action>
|
|
<action>type</action>
|
|
<text>Hello, World!</text>
|
|
</browser_action>
|
|
```
|
|
|
|
Scrolling down to see more content:
|
|
```
|
|
<browser_action>
|
|
<action>scroll_down</action>
|
|
</browser_action>
|
|
```
|
|
|
|
Closing the browser session:
|
|
```
|
|
<browser_action>
|
|
<action>close</action>
|
|
</browser_action>
|
|
```
|