Skip to content

An automated pipeline for evaluating LLMs for role-playing.

License

Notifications You must be signed in to change notification settings

boson-ai/RPBench-Auto

Repository files navigation

RPBench-Auto

Leaderboard | Blog

An automated pipeline for evaluating LLMs for role-playing.

RPBench Auto

Installation

pip install -r requirements.txt

Usage

First, set the environment variable OPENAI_API_KEY for the judge model and to the path of the RPBench dataset.

export OPENAI_API_KEY=<API_KEY>

Then, add the model config file for the model you want to evaluate. Currently we support OpenAI API (and compatible APIs) and Anthropic API. Edit config/api_config.yaml to add the model config.

Finally, run the pipeline.

python run_character_eval.py --model_1 <CONFIG_NAME>  # Evaluate the model on the character subset
python run_scene_eval.py --model_1 <CONFIG_NAME>  # Evaluate the model on the scene subset

Generate the leaderboard.

python generate_leaderboard.py

How to contribute

After running all commands above, you can add your model to the leaderboard by creating a pull request with the updated leaderboard files, leaderboard.csv and leaderboard_for_display.csv, plus the .jsonl files in /results/character and /results/scene. The leaderboard will be updated automatically when the PR is merged.

Acknowledgements

This benchmark is heavily inspired by ArenaHard and AlpacaEval. Some code implementations are borrowed from these repositories.