SkillWeaver is a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities.
It is recommended to first create a virtual environment:
conda create -n skillweaver python=3.10
conda activate skillweaver
pip install -r requirements.txt
playwright install
# OpenAI API
export OPENAI_API_KEY=<your_openai_api_key>
# If you'd love to use Azure-hosted OpenAI models instead
export AZURE_OPENAI=1
export AZURE_OPENAI_gpt-4o_ENDPOINT=<endpoint>
export AZURE_OPENAI_gpt-4o_API_KEY=<endpoint API key>
To attempt a task, you can use the following command:
python -m skillweaver.attempt_task <start-url> <task> [...options]
Arguments:
start-url
: The URL to start the task from. You can use__REDDIT__
,__MAP__
,__SHOPPING__
,__SHOPPING_ADMIN__
, and__GITLAB__
as the prefix if you would like to load the address from environment variables and perform the login step (required for most WebArena tasks).task
: The task to attempt. This should be a string that describes the task to be attempted.--agent-lm-name [lm_name]
: The name of the LLM to use for the agent. Default:gpt-4o
.--max-steps
: The agent's time limit to complete the task, as measured in generated actions. Default: 10.--knowledge-base-path-prefix
: The path to the synthesized APIs (without_code.py
). For example,logs/explore-reddit-gpt4o/iter_79/kb_post
.--headless
: Whether to attempt the task in headless mode.
For example, to try a task on the reddit
website, you could use the following command:
python -m skillweaver.attempt_task __REDDIT__ "Post to the gaming forum to ask about the best games of the year" --knowledge-base-path-prefix skill_library/reddit/reddit_kb_post
To compare the performance without the knowledge base, remove the --knowledge-base-path-prefix
argument:
python -m skillweaver.attempt_task __REDDIT__ "Post to the gaming forum to ask about the best agmes of the year"
This is an experimental version that uses the agent from Browser-Use. It converts our knowledge base into a Browser-Use Controller
object that can be used to extend the action space of an existing agent.
To attempt a task, you can use the following command:
python -m skillweaver.attempt_task_browser_use <start-url> <task> [...options]
Arguments:
start-url
: The URL to start the task from. You can use__REDDIT__
,__MAP__
,__SHOPPING__
,__SHOPPING_ADMIN__
, and__GITLAB__
as the prefix if you would like to load the address from environment variables and perform the login step (required for most WebArena tasks).task
: The task to attempt. This should be a string that describes the task to be attempted.--agent-lm-name [lm_name]
: The name of the LLM to use for the agent. Default:gpt-4o
.--knowledge-base-path-prefix
: The path to the synthesized APIs (without_code.py
). For example,logs/explore-reddit-gpt4o/iter_79/kb_post
.--headless
: Whether to attempt the task in headless mode.
For example, to try a task on the reddit
website, you could use the following command:
python -m skillweaver.attempt_task_browser_use __REDDIT__ "Post to the gaming forum to ask about the best games of the year" --knowledge-base-path-prefix skill_library/reddit/reddit_kb_post
To compare the performance without the knowledge base, remove the --knowledge-base-path-prefix
argument:
python -m skillweaver.attempt_task_browser_use __REDDIT__ "Post to the gaming forum to ask about the best agmes of the year"
Once you have set up your virtual environment and created a .env
file with the appropriate configuration, you can explore a website using the following command:
python -m skillweaver.explore [website] [out_dir] --iterations [niter] (... options ...)
Arguments:
website
: The URL or name of the website to explore. You can specify a WebArena website by passing in the name of the website (e.g.,shopping
). The available WebArena environments areshopping
,shopping_admin
,map
,reddit
, andgitlab
.out_dir
: The directory to save the exploration results. Note that if a directory already exists at the specified path, the exploration will not start.--iterations [niter]
: The number of iterations to run the exploration for. Default: 10.--agent-lm-name [lm_name]
: The name of the LLM to use for the agent. Default:gpt-4o
.--api-synthesis-lm-name [lm_name]
: The name of the LLM to use for API synthesis.--success-check-lm-name [lm_name]
: The name of the LLM to use for success checking. Default:gpt-4o
.--explore-schedule
: How to perform exploration and testing iterations. Can be of the formattest_probability:X
to test a generated API (if possible) with probabilityX
, orexplore:X,test:Y
to alternate betweenX
iterations of exploration andY
iterations of testing.--allow-recovery
: Whether to allow the agent to "patch" APIs that throw exceptions during testing. Default:--allow-recovery
. This can be disabled with--no-allow-recovery
. Here is an example command:
python -m skillweaver.explore reddit logs/explore-reddit-gpt4o --agent-lm-name gpt-4o --api-synthesis-lm-name gpt-4o --iterations 160
WebArena recommends using Docker containers to host the websites that are being evaluated. We recommend taking a look at their guide to download the containers. We have an automated way to run evaluations using these containers once downloaded, but you can also run the containers manually, or even specify a custom URL to evaluate with instead of using the containers.
We orchestrate multiple docker container to allow running experiments in parallel. The Orchestrator Server should run outside of Docker (e.g., with a virtualenv). It exposes REST endpoints on port 5125, used internally by the containers context manager.
Before running experiments, we need to run the orchestrator.
python -m skillweaver.containerization.serve
ORCHESTRATOR_PORT=5128 python -m skillweaver.containerization.serve
The containers will be routed to port 8000
, 8001
, 8002
, etc. Ensure that these ports are accessible externally if you are using a cloud environment. Make sure the IP
variable is set correctly in your .env
file if using a cloud environment; otherwise, the containers may redirect you to 127.0.0.1
, which will be incorrect if you are using a server (e.g. AWS) to run the test.
To evaluate a single website using an existing container, set the following environment variables in your .env
file:
SHOPPING=(hostname)
SHOPPING_ADMIN=(hostname)
REDDIT=(hostname)
GITLAB=(hostname)
MAP=(hostname)
CONTAINER_SETUP=manual
Use CONTAINER_SETUP=manual
to use your existing container. If you would like to use the containerization framework, omit this line. The orchestrator server will automatically spin up containers as needed.
To run the evaluation, use the following command:
python -m skillweaver.evaluate_benchmark [website] [out_dir] (... options ...)
Arguments:
website
: The name of the website to evaluate. This can be one ofshopping
,shopping_admin
,reddit
,gitlab
, ormap
.out_dir
: The directory to save the evaluation results. Note that if a directory already exists at the specified path, the evaluation will not start.--time-limit [time_limit]
: The agent's time limit to complete each evaluation task. Default: 10 actions.--knowledge-base-path-prefix [prefix]
: The prefix of the knowledge base to use for the evaluation. Default:None
(no knowledge base). This should be of the format/path/to/iteration/dir/kb_post
.--lm-name [lm_name]
: The name of the LLM to use for the agent. Default:gpt-4o-2024-08-06
.--pool-size [pool_size]
: The number of subprocesses for evaluation. Each subprocess gets its own Docker container. Default: 8.--use-debugger-eval
: Whether to use the modified WebArena debugger which adds additional information about why a test case failed. Default:True
.--allow-recovery
: Whether to allow the agent to "patch" APIs that throw exceptions during testing. Default:True
.--reduced-set
: Whether to use a reduced set of test cases (one test case per unique "intent template" provided in the WebArena benchmark). Default:True
.--allow-unverified-apis
: Whether to allow the agent to use APIs that have not been executed without a runtime error. Default:False
.--selected-tasks [task1,task2,...] OR reduced_set
: A list of task indices to evaluate. If specified asreduced_set
, will select one of eachintent_template
from the WebArena benchmark (approximately 20-40 out of 100+ tasks). If specified as a list of integers, will select tasks by index from the WebArena benchmark. Default:None
, which will evaluate all tasks in the benchmark for that website.
The prompts have all been organized into separate .md files and put under skillweaver/templates
.
This dataset was collected and released solely for research purposes, with the goal of making the web more accessible via language technologies. The authors are strongly against any potential harmful use of the data or technology to any party.