forked from Significant-Gravitas/AutoGPT
-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Remove the submodule, reference OpenAI directly rather than running i…
…t on the command line, fix logging (#16) * Removed submodule, refactor, docker on pip, async docker logging, running our own tool on CLI rather than OpenAIs
- Loading branch information
1 parent
f00ced6
commit 625d6e7
Showing
12 changed files
with
452 additions
and
99 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -127,3 +127,5 @@ dmypy.json | |
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
/data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +0,0 @@ | ||
[submodule "Auto-GPT"] | ||
path = auto_gpt_benchmarking/Auto-GPT | ||
url = https://github.com/Significant-Gravitas/Auto-GPT.git | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,69 +1,97 @@ | ||
# Auto-GPT-Benchmarks | ||
A set of standardised benchmarks to assess the performance of Auto-GPTs. | ||
A set of standardised benchmarks to assess the performance of Auto-GPT. | ||
This currently uses the OpenAI Evals framework to run the benchmarks. | ||
|
||
# What is next? | ||
## Setup | ||
|
||
- [ ] Build longer form tasks, (code fix backed by testing) | ||
- [ ] Explicitly note the common failure modes in the test harness and fix them. Most of these appear to be failure modes with the core AutoGPT project | ||
- [ ] Switch to a ubuntu container so it can do more things (git, bash, etc) | ||
- [ ] Lower priority, but put this in a webserver backend so we have a good API rather than doing container and file management for our interface between evals and our agent. | ||
- [ ] Get token counting data from the model Add scores to result files based on pricing associated with tokens and models used | ||
- [ ] Think about how this can be applied to other projects besides AutoGPT so we can be THE agent evaluation framework. | ||
- [ ] Copy the OpenAI Eval files from the tmp file they are saved to somewhere we can track the results | ||
- [ ] Support multi-threaded evals. OpenAI has great support for this. The docker system built here doesn't. | ||
You must add the auto_gpt_benchmarking dir to the python path | ||
Do this with a path file in your venv. OpenAI evals needs to import it. | ||
|
||
These instructions currently assume ubuntuy 22.04. | ||
They should be fairly adaptable to the windows/MacOS equivalents. Please submit a PR if you would like to see your OS | ||
documented. | ||
|
||
## Understanding OpenAI Evals | ||
Clone the repo with: | ||
|
||
The Evals docs are here and very good: https://github.com/openai/evals/tree/main/docs | ||
`git clone [email protected]:Significant-Gravitas/Auto-GPT-Benchmarks.git` | ||
`cd Auto-GPT-Benchmarks` | ||
|
||
The basic idea is this: | ||
1. Use a completion function to point to the language model or in our case AutoGPT, the model you want to test. | ||
2. Register that completion function with the evals framework with a yaml in a `completion_fns` dir. | ||
3. Run the evals against the completion function. | ||
Create a venv with | ||
|
||
Then you can make more yaml defined evals and run them against the completion function as needed. | ||
`python3.9 -m venv venv` | ||
|
||
### Completions Functions | ||
|
||
See our yaml file in `completion_fns` dir for the registration of the completion function. | ||
See our completion function itself in CompletionFn.py | ||
That points to the AutoGPT model we want to test which is spun up dynamically in a docker container in AutoGPTAgent.py | ||
Activate it with | ||
|
||
`source venv/bin/activate` | ||
|
||
## Setup | ||
Install the requirements with: | ||
|
||
You must add the auto_gpt_benchmarking dir to the python path | ||
Do this with a path file in your venv. OpenAI evals needs to import it. | ||
`pip install -r requirements.txt` | ||
|
||
Create a venv with | ||
If you haven't already clone the AutoGPT repo somewhere else on your machine. | ||
DO NOT CLONE IT INTO A SUBDIR OF THIS REPO. | ||
|
||
`python3.9 -m venv venv` | ||
`cd somewhere/else` | ||
`git clone [email protected]:Significant-Gravitas/Auto-GPT.git` | ||
|
||
Activate it with | ||
You will need to update the .env file in the Auto-GPT repo to have your OpenAI api key. The file in question is at: | ||
|
||
`Auto-GPT/.env` | ||
|
||
`source venv/bin/activate` | ||
Finally, we assume you have a docker container built from the Dockerfile in the Auto-GPT repo. | ||
|
||
Add a file to `venv/lib/python3.9/site-packages/benchmarking.pth` with the contents: | ||
`/PATH/TO/REPO/Auto-GPT-Benchmarks-fork` | ||
Build this with: | ||
|
||
This is because evals tries to import it directly. | ||
`cd Auto-GPT` | ||
`docker build -t autogpt .` | ||
|
||
Install the requirements with | ||
If you want to run with redis as your memory system, you can stand up a redis image in the AutoGPT repo with | ||
|
||
`docker compose up` | ||
|
||
`pip install -r requirements.txt` | ||
Then you will need to adjust some variables in your .env file to use the redis memory backend. | ||
See the AutoGPT docs on how to do that. | ||
|
||
You must have a docker container built corresponding to the submodule below or the docker run command starting the agent will fail. | ||
Run your first eval with: | ||
|
||
Cd into the AutoGPT submodule and build/tag the dockerfile so the agent can be instantiated. | ||
`cd auto_gpt_benchmarks/Auto-GPT` | ||
`cd Auto-GPT-Benchmarks` | ||
`python3 -m auto_gpt_benchmarking test-match --auto-gpt-path /your/path/to/Auto-GPT` | ||
|
||
Build the container so we can run it procedurally! | ||
`docker build -t autogpt .` | ||
You should only need to use the --auto-gpt-path flag the first time you run it. Afterwards, that will be saved in | ||
|
||
## Running the tests | ||
`auto_gpt_benchmarking/completion_fns/auto_gpt_completion_fn.yaml`. | ||
|
||
EVALS_THREADS=1 EVALS_THREAD_TIMEOUT=600 oaieval auto_gpt_completion_fn test-match --registry_path $PWD/auto_gpt_benchmarking | ||
To see a full list of available flags you can use run `python3 -m auto_gpt_benchmarking --help` | ||
Some of these are inherited from the openAI evals framework and do not work quite as intended as they are not applicable | ||
to this use case. | ||
|
||
This saves a file in `Auto-GPT-Benchmarks/data/records.jsonl` | ||
This file is currently a default that is configurable with --record_path flag. You will have to specify the fully | ||
qualified path. | ||
|
||
## Currently Supported Benchmarks: | ||
From OpenAI Evals | ||
- [x] test-match | ||
- [x] test-fuzzy-match | ||
- [ ] Everything else they have... | ||
|
||
## Understanding OpenAI Evals | ||
|
||
The Evals docs are here and very good: https://github.com/openai/evals/tree/main/docs | ||
|
||
The basic idea is this though: | ||
1. Use a completion function to point to the language model or in our case AutoGPT, the model you want to test. | ||
2. Register that completion function with the evals framework with a yaml in a `completion_fns` dir. | ||
3. Run the evals against the completion function. | ||
|
||
Then you can make more also, yaml defined evals and run them against the completion function as needed. | ||
|
||
### Completions Functions | ||
|
||
See our yaml file in `completion_fns` dir for the registration of the completion function. | ||
See our completion function itself in CompletionFn.py | ||
That points to the AutoGPT model we want to test which is spun up dynamically in a docker container in AutoGPTAgent.py | ||
|
||
|
||
# Example final output: | ||
|
@@ -79,3 +107,12 @@ EVALS_THREADS=1 EVALS_THREAD_TIMEOUT=600 oaieval auto_gpt_completion_fn test-mat | |
{"run_id": "230417220821DPM75QNS", "event_id": 5, "sample_id": "test-match.s1.0", "type": "match", "data": {"correct": false, "expected": "time", "picked": null, "sampled": "Once upon a time", "options": ["time"]}, "created_by": "", "created_at": "2023-04-17 22:12:04.691064+00:00"} | ||
(venv) douglas@douglas-XPS-15-9500:~/AGI/Auto-GPT-Benchmarks-fork$ | ||
|
||
# What is next? | ||
|
||
- [ ] Run the rest of the OpenAI Evals Especially the modelgraded ones | ||
- [ ] Build longer form tasks, (code fix backed by testing) | ||
- [ ] Explicitly note the common failure modes in the test harness and fix them. Most of these appear to be failure modes with the core AutoGPT project | ||
- [ ] Get token counting data from the model Add scores to result files based on pricing associated with tokens and models used | ||
- [ ] Think about how this can be applied to other projects besides AutoGPT so we can be THE agent evaluation framework. | ||
- [ ] Figure our how the OpenAI Evals results are saved... | ||
- [ ] Support multi-threaded evals. OpenAI has great support for this. The docker system built here doesn't. |
Submodule Auto-GPT
deleted from
97d62c
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
ai_goals: | ||
- Evaluate the prompt in `prompt.txt` and find the best answer in the format provided. | ||
- Get the correct answer to the question in the fewest number of steps possible. You are scored first on if you get the correct answer, and second on how many tokens you take to get the right answer so keep your thinking and tool usage as minimal as possible while still ensuring you get the correct answer. | ||
- Save the final answer and output to the `output.txt` file, the only file you should write to then immediately exit the program. | ||
- Save the final answer and output to the `output.txt` file, the only file you should write to, then immediately exit the program because you are done. | ||
ai_name: EvaluationAgent | ||
ai_role: an ai that is tested on how effectively it can efficiently evaluate questions and answer them correctly while using as few resources as possible |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
""" | ||
The evaluator class actually executes the evals. | ||
""" | ||
from evals.cli import oaieval | ||
from evals.registry import Registry | ||
from pathlib import Path | ||
from typing import List, Optional, Tuple | ||
import sys | ||
|
||
|
||
class OAIRunArgs: | ||
def __init__( | ||
self, | ||
completion_fn: str, | ||
eval: str, | ||
extra_eval_params: str = "", | ||
max_samples: int = None, | ||
cache: bool = True, | ||
visible: bool = None, | ||
seed: int = 20220722, | ||
user: str = "", | ||
record_path: str = None, | ||
log_to_file: str = None, | ||
debug: bool = False, | ||
local_run: bool = True, | ||
dry_run: bool = False, | ||
dry_run_logging: bool = True, | ||
): | ||
self.completion_fn = completion_fn | ||
self.eval = eval | ||
self.extra_eval_params = extra_eval_params | ||
self.max_samples = max_samples | ||
self.cache = cache | ||
self.visible = visible | ||
self.seed = seed | ||
self.user = user | ||
self.record_path = record_path | ||
self.log_to_file = log_to_file | ||
self.debug = debug | ||
self.local_run = local_run | ||
self.dry_run = dry_run | ||
self.dry_run_logging = dry_run_logging | ||
# create the record and logging paths if they don't exist | ||
Path(self.record_path).parent.mkdir(parents=True, exist_ok=True) | ||
# Path(self.log_to_file).parent.mkdir(parents=True, exist_ok=True) | ||
# Registry path should be the auto_gpt_benchmarking folder | ||
self.registry_path = None | ||
|
||
|
||
class Evaluator: | ||
def __init__(self, oai_run_args: OAIRunArgs): | ||
self.oai_run_args = oai_run_args | ||
registry_path = Path(__file__).parent | ||
|
||
# add registry path to the python system path | ||
sys.path.append(str(registry_path)) | ||
self.oai_run_args.registry_path = [registry_path] | ||
# self.registry = Registry([registry_path]) | ||
|
||
def run(self): | ||
oaieval.run(self.oai_run_args) |
Oops, something went wrong.