Skip to content

Commit

Permalink
Update RecEval with vLLM and chat_template (#70)
Browse files Browse the repository at this point in the history
* .

* update and edit Chinese

* del some files and fix chinese

* del some files and fix chinese

* del some files and fix chinese

* del some files and fix chinese

* del some files and fix chinese

* update requirements.txt

---------

Co-authored-by: Cecch <[email protected]>
Co-authored-by: YuxuanLei <[email protected]>
  • Loading branch information
3 people committed Sep 25, 2024
1 parent 757e4ca commit 2fe6d63
Show file tree
Hide file tree
Showing 15 changed files with 667 additions and 235 deletions.
9 changes: 1 addition & 8 deletions RecLM-eval/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,7 @@
**/__pycache__/
output/
data/
!data/chatbot/question.jsonl
# !data/steam/conversation.jsonl
# !data/steam/explanation.jsonl
# !data/steam/metadata.json
# !data/steam/negative_samples.txt
# !data/steam/ranking.jsonl
# !data/steam/retrieval.jsonl
# !data/steam/sequential_data.txt

eval_models/*
!eval_models
*.log
63 changes: 45 additions & 18 deletions RecLM-eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,55 @@
This is a project to evaluate how various LLMs perform on recommendation tasks, including retrieval, ranking, explanation, conversation, and chatbot ability. The whole workflow is depicted as the following:
![Figure Caption](evaluation_framework.jpg)

# Quick start

1. **Download Steam Data**:
[Download link](https://drive.google.com/file/d/1745XoSvkSG2C_1WOFM6PV6DjezrlXa8z/view?usp=drive_link)
Unzip to `./data/` folder.

```bash
unzip path_to_downloaded_file.zip -d ./data/
```

2. **Navigate to Project**:
```bash
cd RecLM-eval
```

3. **Configure API**:
Edit `openai_api_config.yaml`, add your API key:

```yaml
API_BASE: "if-you-have-different-api-url"
API_KEY: "your-api-key"
```

4. **Run**:
```bash
bash main.sh
```

# Usage

## Environment
```bash
conda create -n receval python==3.8
conda create -n receval python==3.9
conda activate receval
pip install -r requirements
pip install -r requirements.txt
```

## Set OpenAI API Environment
If you want to use OpenAI API, you need to fill the content in `openai_api_config.yaml`.

* If you want to use the OpenAI API, you need to fill in your API key in the openai_api_config.yaml file.
* If you are using models not pre-defined in the project, add their cost information to the api_cost.jsonl file.
## Prepare your test data
For data preparation details, please refer to [[preprocess]](preprocess/data-preparation.md).
For you convenience, there is a toy example dataset derived from the Steam dataset (A simple combination of https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data, https://github.com/kang205/SASRec/blob/master/data/Steam.txt and https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset). Please download it from (https://drive.google.com/file/d/1oliigNX_ACRZupf1maFEkJh_uzl2ZUKm/view?usp=sharing) and unzip it to the ./data/ folder.
* For data preparation details, please refer to [[preprocess]](preprocess/data-preparation.md).
* For you convenience, there is a toy example dataset derived from the Steam dataset (A simple combination of https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data, https://github.com/kang205/SASRec/blob/master/data/Steam.txt and https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset).
* Please download it from (https://drive.google.com/file/d/1745XoSvkSG2C_1WOFM6PV6DjezrlXa8z/view?usp=drive_link) and unzip it to the ./data/ folder.

## Evaluate
You can specify the evaluation tasks through the `task-names` parameter. These values are avaliable: `ranking`, `retrieval`, `explanation`, `conversation`, `embedding_ranking`, `embedding_retrieval`, `chatbot`.
* You can specify the evaluation tasks through the `task-names` parameter.
* These values are avaliable: `ranking`, `retrieval`, `explanation`, `conversation`, `embedding_ranking`, `embedding_retrieval`, `chatbot`.


### Ranking/Retrieval
Parameters:
Expand All @@ -30,13 +61,9 @@ example:
```bash
python eval.py --task-names ranking retrieval \
--bench-name steam \
--model_path_or_name facebook/opt-1.3b
--model_path_or_name Qwen/Qwen2.5-7B-Instruct
```
optional parameters (only for huggingface model):
- `--nodes NODES`: The number of nodes for distributed inference
- `--gpus GPUS`: The number gpus per node.
- `--nr NR`: Then ranking within the nodes.
- `--master_port MASTER_PORT`: The port of the master node.
optional parameters (only for vllm model):
- `--max_new_tokens MAX_NEW_TOKENS`: The maximum number of tokens to generate, prompt+max_new_tokens should be less than your model's max length.
- `--batch_size BATCH_SIZE`: The batch size during inference.
Expand All @@ -51,13 +78,13 @@ Parameters:
Example:
```bash
python eval.py --task-names embedding_ranking embedding_retrieval \
--model_path_or_name text-embedding-ada-002 \
--model_path_or_name text-embedding-3-small \
--bench-name steam \
--user_emb_type title \
--item_emb_type title
python eval.py --task-names embedding_ranking embedding_retrieval \
--model_path_or_name text-embedding-ada-002 \
--model_path_or_name text-embedding-3-small \
--bench-name steam \
--user_emb_type summary \
--summary-model gpt-3.5-turbo \
Expand All @@ -73,7 +100,7 @@ Parameters:
example:
```bash
python eval.py --task-names chatbot \
--model_path_or_name facebook/opt-1.3b \
--model_path_or_name Qwen/Qwen2.5-7B-Instruct \
--judge-model gpt-3.5-turbo \
--baseline-model gpt-3.5-turbo
```
Expand All @@ -87,7 +114,7 @@ Parameters:
```bash
python eval.py --task-names explanation \
--bench-name steam \
--model_path_or_name facebook/opt-1.3b \
--model_path_or_name Qwen/Qwen2.5-7B-Instruct \
--judge-model gpt-3.5-turbo \
--baseline-model gpt-3.5-turbo
```
Expand All @@ -102,7 +129,7 @@ example:
```bash
python eval.py --task-names conversation \
--bench-name steam \
--model_path_or_name facebook/opt-1.3b \
--model_path_or_name Qwen/Qwen2.5-7B-Instruct \
--simulator-model gpt-3.5-turbo \
--max_turn 5
```
14 changes: 14 additions & 0 deletions RecLM-eval/api_cost.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{"chatgpt-4o-latest": {"input": 5.0, "output": 15.0}}
{"gpt-4-turbo": {"input": 10.0, "output": 30.0}}
{"gpt-4": {"input": 30.0, "output": 60.0}}
{"gpt-4-vision-preview": {"input": 10.0, "output": 30.0}}
{"gpt-35-turbo": {"input": 0.5, "output": 1.5}}
{"davinci-002": {"input": 2.0, "output": 2.0}}
{"babbage-002": {"input": 0.4, "output": 0.4}}
{"gpt-4o": {"input": 5.0, "output": 15.0}}
{"gpt-4o-mini": {"input": 0.15, "output": 0.6}}
{"o1-preview": {"input": 15.0, "output": 60.0}}
{"text-embedding-ada-002": {"input": 0.1, "output": 0}}
{"text-embedding-3-small": {"input": 0.02, "output": 0}}
{"text-embedding-3-large": {"input": 0.13, "output": 0}}
{"ada v2": {"input": 0.1, "output": 0}}
128 changes: 15 additions & 113 deletions RecLM-eval/call_models/huggingface_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,44 +14,6 @@
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModel

DEFAULT_SYSTEM_PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."
class ChatDataset(Dataset):
def __init__(self, test_dataset, tokenizer, max_seq_len, world_size, rank, system_prompt) -> None:
super().__init__()
self.test_dataset = test_dataset
self.tokenizer = tokenizer
self.max_seq_len = max_seq_len
self.world_size = world_size
self.rank = rank
if system_prompt:
self.system_prompt = system_prompt
else:
self.system_prompt = DEFAULT_SYSTEM_PROMPT

def __len__(self):
length = (len(self.test_dataset)+self.world_size-1) // self.world_size
return length

def __getitem__(self, idx):
if idx * self.world_size + self.rank < len(self.test_dataset):
data = self.test_dataset[idx * self.world_size + self.rank]
else:
data = self.test_dataset[-1]

if isinstance(data["prompt"], str):
inputs = f"USER: {data['prompt'].strip()} "
else:
inputs = ""
for text in data["prompt"]:
if text["role"] == "assistant":
inputs += "ASSISTANT: " + text["content"] + ' '
else:
inputs += "USER: " + text["content"] + ' '
tokens = self.tokenizer.tokenize(f"{self.system_prompt} {inputs}")

tokens = tokens[:self.max_seq_len-len(self.tokenizer.tokenize("ASSISTANT:"))]
truncated_prompt = self.tokenizer.convert_tokens_to_string(tokens) + "ASSISTANT:"

return truncated_prompt.strip()

class EmbDataset(Dataset):
def __init__(self, test_dataset, tokenizer, max_seq_len, world_size, rank) -> None:
Expand All @@ -75,100 +37,48 @@ def __getitem__(self, idx):
tokens = self.tokenizer.tokenize(data['prompt'].strip())[:self.max_seq_len]
truncated_prompt = self.tokenizer.convert_tokens_to_string(tokens)

return truncated_prompt.strip()
return truncated_prompt

def run_chat(local_gpu_rank, model_path_or_name, question_file, answer_file, args, system_prompt):
args.rank = args.nr * args.gpus + local_gpu_rank
args.device = torch.device("cuda", local_gpu_rank)
torch.cuda.set_device(args.device)
dist.init_process_group(backend='nccl', init_method="env://", world_size=args.world_size, rank=args.rank)

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
model_path_or_name, fast_tokenizer=True, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# load model
model_config = AutoConfig.from_pretrained(args.model_path_or_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
args.model_path_or_name,
from_tf=bool(".ckpt" in args.model_path_or_name),
config=model_config,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
trust_remote_code=True,
# use_safetensors=False
)

model.config.end_token_id = tokenizer.eos_token_id
model.config.pad_token_id = model.config.eos_token_id

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=args.device)
greedy_config = GenerationConfig(return_full_text=False, max_new_tokens=args.max_new_tokens)

# load test dataset
test_data = []
for line in open(question_file):
test_data.append(json.loads(line))
test_dataset = ChatDataset(test_data, tokenizer, model_config.max_position_embeddings - args.max_new_tokens - 100, args.world_size, args.rank, system_prompt) # 1700 < 2048 - max new tokens

if args.rank == 0:
pbar = tqdm(total=len(test_dataset), desc="inferencing")
result_lists = []

with torch.no_grad():
for raw_result in generator(test_dataset, generation_config=greedy_config, batch_size=args.batch_size):
result = raw_result[0]['generated_text'].split("ASSISTANT:")[-1].strip()
if "<|endoftext|>" in result:
result = result.split('<|endoftext|>')[0].strip()
if "</s>" in result:
result = result.split('</s>')[0].strip()
result = (args.rank, result)
gather_data = [None for _ in range(args.world_size)]
dist.all_gather_object(gather_data, result)
if args.rank == 0:
gather_data = sorted(gather_data, key=lambda x:x[0])
for result in gather_data:
result_lists.append(result[1])
pbar.update(1)

if args.rank == 0:
os.makedirs(os.path.dirname(answer_file), exist_ok=True)
fd = open(answer_file, "w", encoding='utf-8')
for data, result in zip(test_data, result_lists):
data["answer"] = result
fd.write(json.dumps(data, ensure_ascii=False) + '\n')

def run_embedding(local_gpu_rank, model_path_or_name, question_file, answer_file, args):
args.rank = args.nr * args.gpus + local_gpu_rank
args.device = torch.device("cuda", local_gpu_rank)
torch.cuda.set_device(args.device)
dist.init_process_group(backend='nccl', init_method="env://", world_size=args.world_size, rank=args.rank)

model_config = AutoConfig.from_pretrained(args.model_path_or_name)
model_config = AutoConfig.from_pretrained(args.model_path_or_name, trust_remote_code=True)
if "CausalLM" in model_config.architectures[0]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, use_fast=False, padding_side='left')
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, use_fast=False, padding_side='left', trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# load model
model = AutoModelForCausalLM.from_pretrained(
args.model_path_or_name,
from_tf=bool(".ckpt" in args.model_path_or_name),
config=model_config,
low_cpu_mem_usage=True
low_cpu_mem_usage=True,
trust_remote_code=True
).to(args.device)
model.config.end_token_id = tokenizer.eos_token_id
model.config.pad_token_id = model.config.eos_token_id
model_type = "CausalLM"
else:
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)
model = AutoModel.from_pretrained(model_path_or_name).to(args.device)
tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path_or_name, trust_remote_code=True).to(args.device)
model_type = "Other"

# load test dataset
test_data = []
for line in open(question_file):
test_data.append(json.loads(line))
test_dataset = EmbDataset(test_data, tokenizer, model_config.max_position_embeddings - 100, args.world_size, args.rank) # 1700 < 2048 - max new tokens
try:
max_seq_length = model_config.max_position_embeddings
except:
try:
max_seq_length = model_config.seq_length
except:
max_seq_length = 4096
test_dataset = EmbDataset(test_data, tokenizer, max_seq_length - 100, args.world_size, args.rank) # 1700 < 2048 - max new tokens
dataloader = DataLoader(test_dataset, batch_size=args.batch_size)

if args.rank == 0:
Expand Down Expand Up @@ -206,14 +116,6 @@ def run_embedding(local_gpu_rank, model_path_or_name, question_file, answer_file
data["answer"] = result
fd.write(json.dumps(data, ensure_ascii=False) + '\n')

def gen_model_chat_answer(model_path_or_name, question_file, answer_file, args, system_prompt):
if args.gpus < 0:
args.gpus = torch.cuda.device_count()
args.world_size = args.nodes * args.gpus
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = args.master_port
mp.spawn(run_chat, nprocs=args.gpus, args=(model_path_or_name, question_file, answer_file, args, system_prompt))

def gen_model_embedding_answer(model_path_or_name, question_file, answer_file, args):
if args.gpus < 0:
args.gpus = torch.cuda.device_count()
Expand Down
Loading

0 comments on commit 2fe6d63

Please sign in to comment.