Skip to content

🧙🏻Code and benchmark for our Findings of ACL 2024 paper - "TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models"

License

Notifications You must be signed in to change notification settings

ahnjaewoo/timechara

Repository files navigation

TimeChara

This is the official repository of our ACL 2024 Findings paper:
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

timechara example

Please cite our work if you found the resources in this repository useful:

@inproceedings{ahn2024timechara,
    title={TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models},
    author={Jaewoo Ahn and Taehyun Lee and Junyoung Lim and Jin-Hwa Kim and Sangdoo Yun and Hwaran Lee and Gunhee Kim},
    booktitle={Findings of ACL},
    year=2024
}

For a brief summary of our paper, please see this webpage.

TimeChara

You can load TimeChara from the HuggingFace hub as the following:

from datasets import load_dataset

dataset = load_dataset("ahnpersie/timechara")
Details on TimeChara

(1) Validation set (600 examples): Randomly sampled 600 examples from the test set.

(2) Test set (10,895 examples): All datasets, including the validation set.

(3) We provide create_dataset.py to automatically construct TimeChara. Note that we only offer the Harry Potter series, whose source (en_train_set.json) can be obtained from the HPD dataset.

python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fact_event_summary
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fact_freeform_question
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fake_event_summary
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_fake_freeform_question
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode create_single_turn_dataset
python create_dataset.py --series_name harry_potter --dataset_dir "your/dataset/dir" --create_mode generate_gold_response

(3-1) To use the OpenAI API for GPT-4, you need to export your OPENAI_API_KEY:

export OPENAI_API_KEY='your-openai-api-key'

Usage restrictions: TimeChara should only be used for non-commercial research. For more details, refer to the Ethics Statement in the paper.

Running evaluation on TimeChara

We recommend using Anaconda. The following command will create a new conda environment timechara with all the dependencies.

conda env create -f environment.yml

​ To activate the environment:

conda activate timechara

First, you can generate a response given a question by running the following command:

python generate.py --model_name gpt-4o-2024-05-13 --method_name zero-shot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name zero-shot-cot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name few-shot
# python generate.py --model_name gpt-4o-2024-05-13 --method_name self-refine
# python generate.py --model_name gpt-4o-2024-05-13 --method_name rag-cutoff
# python generate.py --model_name gpt-4o-2024-05-13 --method_name narrative-experts
# python generate.py --model_name gpt-4o-2024-05-13 --method_name narrative-experts-rag-cutoff
Details on Generation

(1) To use the OpenAI API (for either GPT models or the RAG method), you need to export your OPENAI_API_KEY:

export OPENAI_API_KEY='your-openai-api-key'

(2) To use RAG, you should manually download the Chroma DB files directly by clicking this link:

unzip chroma_db_files.zip
mv text-embedding-ada-002 methods/rag

Finally, you can evaluate a response by running the following command:

python evaluate.py --eval_model_name gpt-4-1106-preview --model_name gpt-4o-2024-05-13 --method_name zero-shot
Details on Evaluation

(1) Since we don't support AlignScore directly, use an independent GitHub repository (AlignScore) to evaluate generated responses via AlignScore instead of GPT-4 judges:

from alignscore import AlignScore

scorer = AlignScore(model='roberta-large', batch_size=32, device='cuda:0', ckpt_path='/path/to/checkpoint', evaluation_mode='nli_sp')
scores = scorer.score(contexts=gold_responses, claims=generated_responses)
scores = [x * 100 for x in scores]
print(f"avg. AlignScore (# {len(scores)}) = {sum(scores)/len(scores)}")

All generation & evaluation results will be saved under outputs.

🏆 Leaderboard

We present the spatiotemporal consistency results for the newer models on the validation set, ranked by the Average scores.

Model Average [%] Future [%] Past-absence [%] Past-presence [%] Past-only [%]
o1-preview-2024-09-12 (zero-shot) 80.5 82.5 83.0 88.0 73.5
GPT-4o-2024-05-13 (zero-shot) 64.5 46.0 74.0 90.0 65.5
GPT-4-turbo-1106-preview (zero-shot) 62.7 46.5 75.0 90.0 59.0
Mistral-7b-instruct-v0.2 (zero-shot) 46.8 44.5 53.0 63.0 38.0
GPT-3.5-turbo-1106 (zero-shot) 44.2 29.0 33.0 91.0 41.5

Have any questions?

Please contact Jaewoo Ahn at jaewoo.ahn at vision.snu.ac.kr

License

This repository is MIT licensed. See the LICENSE file for details.

About

🧙🏻Code and benchmark for our Findings of ACL 2024 paper - "TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages