Update RecEval with vLLM and chat_template (#70)

* . * update and edit Chinese * del some files and fix chinese * del some files and fix chinese * del some files and fix chinese * del some files and fix chinese * del some files and fix chinese * update requirements.txt --------- Co-authored-by: Cecch <[email protected]> Co-authored-by: YuxuanLei <[email protected]>
microsoft · Sep 25, 2024 · 2fe6d63 · 2fe6d63
1 parent 757e4ca
commit 2fe6d63
Show file tree

Hide file tree

Showing 15 changed files with 667 additions and 235 deletions.
diff --git a/RecLM-eval/.gitignore b/RecLM-eval/.gitignore
@@ -3,14 +3,7 @@
 **/__pycache__/
 output/
 data/
-!data/chatbot/question.jsonl
-# !data/steam/conversation.jsonl
-# !data/steam/explanation.jsonl
-# !data/steam/metadata.json
-# !data/steam/negative_samples.txt
-# !data/steam/ranking.jsonl
-# !data/steam/retrieval.jsonl
-# !data/steam/sequential_data.txt
+
 eval_models/*
 !eval_models
 *.log
diff --git a/RecLM-eval/README.md b/RecLM-eval/README.md
@@ -2,24 +2,55 @@
 This is a project to evaluate how various LLMs perform on recommendation tasks, including retrieval, ranking, explanation, conversation, and chatbot ability. The whole workflow is depicted as the following:
 ![Figure Caption](evaluation_framework.jpg)
 
+# Quick start
+
+1. **Download Steam Data**:  
+   [Download link](https://drive.google.com/file/d/1745XoSvkSG2C_1WOFM6PV6DjezrlXa8z/view?usp=drive_link)  
+   Unzip to `./data/` folder.
+
+    ```bash
+    unzip path_to_downloaded_file.zip -d ./data/
+    ```
+
+2. **Navigate to Project**:  
+    ```bash
+    cd RecLM-eval
+    ```
+
+3. **Configure API**:  
+   Edit `openai_api_config.yaml`, add your API key:
+
+    ```yaml
+    API_BASE: "if-you-have-different-api-url"
+    API_KEY: "your-api-key"
+    ```
+
+4. **Run**:  
+    ```bash
+    bash main.sh
+    ```
+
 # Usage
 
 ## Environment
 ```bash
-conda create -n receval python==3.8
+conda create -n receval python==3.9
 conda activate receval
-pip install -r requirements
+pip install -r requirements.txt
 ```
 
 ## Set OpenAI API Environment
-If you want to use OpenAI API, you need to fill the content in `openai_api_config.yaml`.
-
+* If you want to use the OpenAI API, you need to fill in your API key in the openai_api_config.yaml file.
+* If you are using models not pre-defined in the project, add their cost information to the api_cost.jsonl file.
 ## Prepare your test data
-For data preparation details, please refer to [[preprocess]](preprocess/data-preparation.md).
-For you convenience, there is a toy example dataset derived from the Steam dataset (A simple combination of https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data, https://github.com/kang205/SASRec/blob/master/data/Steam.txt and https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset). Please download it from (https://drive.google.com/file/d/1oliigNX_ACRZupf1maFEkJh_uzl2ZUKm/view?usp=sharing) and unzip it to the ./data/ folder.
+* For data preparation details, please refer to [[preprocess]](preprocess/data-preparation.md).
+* For you convenience, there is a toy example dataset derived from the Steam dataset (A simple combination of https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data, https://github.com/kang205/SASRec/blob/master/data/Steam.txt and https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset). 
+* Please download it from (https://drive.google.com/file/d/1745XoSvkSG2C_1WOFM6PV6DjezrlXa8z/view?usp=drive_link) and unzip it to the ./data/ folder.
 
 ## Evaluate
-You can specify the evaluation tasks through the `task-names` parameter. These values are avaliable: `ranking`, `retrieval`, `explanation`, `conversation`, `embedding_ranking`, `embedding_retrieval`, `chatbot`.
+* You can specify the evaluation tasks through the `task-names` parameter. 
+* These values are avaliable: `ranking`, `retrieval`, `explanation`, `conversation`, `embedding_ranking`, `embedding_retrieval`, `chatbot`.
+
 
 ### Ranking/Retrieval
 Parameters:
@@ -30,13 +61,9 @@ example:
 ```bash
 python eval.py --task-names ranking retrieval \
     --bench-name steam \
-    --model_path_or_name facebook/opt-1.3b
+    --model_path_or_name Qwen/Qwen2.5-7B-Instruct
 ```
-optional parameters (only for huggingface model):
--  `--nodes NODES`: The number of nodes for distributed inference
--  `--gpus GPUS`: The number gpus per node.
--  `--nr NR`: Then ranking within the nodes.
--  `--master_port MASTER_PORT`: The port of the master node.
+optional parameters (only for vllm model):
 -  `--max_new_tokens MAX_NEW_TOKENS`: The maximum number of tokens to generate, prompt+max_new_tokens should be less than your model's max length.
 -  `--batch_size BATCH_SIZE`: The batch size during inference.
 
@@ -51,13 +78,13 @@ Parameters:
 Example:
 ```bash
 python eval.py --task-names embedding_ranking embedding_retrieval \
-    --model_path_or_name text-embedding-ada-002 \
+    --model_path_or_name text-embedding-3-small \
     --bench-name steam \
     --user_emb_type title \
     --item_emb_type title
 
 python eval.py --task-names embedding_ranking embedding_retrieval \
-    --model_path_or_name text-embedding-ada-002 \
+    --model_path_or_name text-embedding-3-small \
     --bench-name steam \
     --user_emb_type summary \
     --summary-model gpt-3.5-turbo \
@@ -73,7 +100,7 @@ Parameters:
 example:
 ```bash
 python eval.py --task-names chatbot \
-    --model_path_or_name facebook/opt-1.3b \
+    --model_path_or_name Qwen/Qwen2.5-7B-Instruct \
     --judge-model gpt-3.5-turbo \
     --baseline-model gpt-3.5-turbo
 ```
@@ -87,7 +114,7 @@ Parameters:
 ```bash
 python eval.py --task-names explanation \
     --bench-name steam \
-    --model_path_or_name facebook/opt-1.3b \
+    --model_path_or_name Qwen/Qwen2.5-7B-Instruct \
     --judge-model gpt-3.5-turbo \
     --baseline-model gpt-3.5-turbo
 ```
@@ -102,7 +129,7 @@ example:
 ```bash
 python eval.py --task-names conversation \
     --bench-name steam \
-    --model_path_or_name facebook/opt-1.3b \
+    --model_path_or_name Qwen/Qwen2.5-7B-Instruct \
     --simulator-model gpt-3.5-turbo \
     --max_turn 5
 ```
diff --git a/RecLM-eval/api_cost.jsonl b/RecLM-eval/api_cost.jsonl
@@ -0,0 +1,14 @@
+{"chatgpt-4o-latest": {"input": 5.0, "output": 15.0}}
+{"gpt-4-turbo": {"input": 10.0, "output": 30.0}}
+{"gpt-4": {"input": 30.0, "output": 60.0}}
+{"gpt-4-vision-preview": {"input": 10.0, "output": 30.0}}
+{"gpt-35-turbo": {"input": 0.5, "output": 1.5}}
+{"davinci-002": {"input": 2.0, "output": 2.0}}
+{"babbage-002": {"input": 0.4, "output": 0.4}}
+{"gpt-4o": {"input": 5.0, "output": 15.0}}
+{"gpt-4o-mini": {"input": 0.15, "output": 0.6}}
+{"o1-preview": {"input": 15.0, "output": 60.0}}
+{"text-embedding-ada-002": {"input": 0.1, "output": 0}}
+{"text-embedding-3-small": {"input": 0.02, "output": 0}}
+{"text-embedding-3-large": {"input": 0.13, "output": 0}}
+{"ada v2": {"input": 0.1, "output": 0}}
diff --git a/RecLM-eval/call_models/huggingface_models.py b/RecLM-eval/call_models/huggingface_models.py
@@ -14,44 +14,6 @@
 from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, AutoModel
 
 DEFAULT_SYSTEM_PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."
-class ChatDataset(Dataset):
-    def __init__(self, test_dataset, tokenizer, max_seq_len, world_size, rank, system_prompt) -> None:
-        super().__init__()
-        self.test_dataset = test_dataset
-        self.tokenizer = tokenizer
-        self.max_seq_len = max_seq_len
-        self.world_size = world_size
-        self.rank = rank
-        if system_prompt:
-            self.system_prompt = system_prompt
-        else:
-            self.system_prompt = DEFAULT_SYSTEM_PROMPT
-
-    def __len__(self):
-        length = (len(self.test_dataset)+self.world_size-1) // self.world_size
-        return length
-
-    def __getitem__(self, idx):
-        if idx * self.world_size + self.rank < len(self.test_dataset):
-            data = self.test_dataset[idx * self.world_size + self.rank]
-        else:
-            data = self.test_dataset[-1]
-
-        if isinstance(data["prompt"], str):
-            inputs = f"USER: {data['prompt'].strip()} "
-        else:
-            inputs = ""
-            for text in data["prompt"]:
-                if text["role"] == "assistant":
-                    inputs += "ASSISTANT: " + text["content"] + ' '
-                else:
-                    inputs += "USER: " + text["content"] + ' '
-        tokens = self.tokenizer.tokenize(f"{self.system_prompt} {inputs}")
-
-        tokens = tokens[:self.max_seq_len-len(self.tokenizer.tokenize("ASSISTANT:"))]
-        truncated_prompt = self.tokenizer.convert_tokens_to_string(tokens) + "ASSISTANT:"
-
-        return truncated_prompt.strip()
 
 class EmbDataset(Dataset):
     def __init__(self, test_dataset, tokenizer, max_seq_len, world_size, rank) -> None:
@@ -75,100 +37,48 @@ def __getitem__(self, idx):
         tokens = self.tokenizer.tokenize(data['prompt'].strip())[:self.max_seq_len]
         truncated_prompt = self.tokenizer.convert_tokens_to_string(tokens)
 
-        return truncated_prompt.strip()
+        return truncated_prompt
 
-def run_chat(local_gpu_rank, model_path_or_name, question_file, answer_file, args, system_prompt):
-    args.rank = args.nr * args.gpus + local_gpu_rank
-    args.device = torch.device("cuda", local_gpu_rank)
-    torch.cuda.set_device(args.device)
-    dist.init_process_group(backend='nccl', init_method="env://", world_size=args.world_size, rank=args.rank)
-
-    # load tokenizer
-    tokenizer = AutoTokenizer.from_pretrained(
-        model_path_or_name, fast_tokenizer=True, trust_remote_code=True)
-    tokenizer.pad_token = tokenizer.eos_token
-    # load model
-    model_config = AutoConfig.from_pretrained(args.model_path_or_name, trust_remote_code=True)
-    model = AutoModelForCausalLM.from_pretrained(
-        args.model_path_or_name,
-        from_tf=bool(".ckpt" in args.model_path_or_name),
-        config=model_config,
-        low_cpu_mem_usage=True,
-        torch_dtype=torch.float16,
-        trust_remote_code=True,
-        # use_safetensors=False
-    )
-
-    model.config.end_token_id = tokenizer.eos_token_id
-    model.config.pad_token_id = model.config.eos_token_id
-
-    generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=args.device)
-    greedy_config = GenerationConfig(return_full_text=False, max_new_tokens=args.max_new_tokens)
-
-    # load test dataset
-    test_data = []
-    for line in open(question_file):
-        test_data.append(json.loads(line))
-    test_dataset = ChatDataset(test_data, tokenizer, model_config.max_position_embeddings - args.max_new_tokens - 100, args.world_size, args.rank, system_prompt) # 1700 < 2048 - max new tokens
-
-    if args.rank == 0:
-        pbar = tqdm(total=len(test_dataset), desc="inferencing")
-        result_lists = []
-
-    with torch.no_grad():
-        for raw_result in generator(test_dataset, generation_config=greedy_config, batch_size=args.batch_size):
-            result = raw_result[0]['generated_text'].split("ASSISTANT:")[-1].strip()
-            if "<|endoftext|>" in result:
-                result = result.split('<|endoftext|>')[0].strip()
-            if "</s>" in result:
-                result = result.split('</s>')[0].strip()
-            result = (args.rank, result)
-            gather_data = [None for _ in range(args.world_size)]
-            dist.all_gather_object(gather_data, result)
-            if args.rank == 0:
-                gather_data = sorted(gather_data, key=lambda x:x[0])
-                for result in gather_data:
-                    result_lists.append(result[1])
-                pbar.update(1)
-
-    if args.rank == 0:
-        os.makedirs(os.path.dirname(answer_file), exist_ok=True)
-        fd = open(answer_file, "w", encoding='utf-8')
-        for data, result in zip(test_data, result_lists):
-            data["answer"] = result
-            fd.write(json.dumps(data, ensure_ascii=False) + '\n')
 
 def run_embedding(local_gpu_rank, model_path_or_name, question_file, answer_file, args):
     args.rank = args.nr * args.gpus + local_gpu_rank
     args.device = torch.device("cuda", local_gpu_rank)
     torch.cuda.set_device(args.device)
     dist.init_process_group(backend='nccl', init_method="env://", world_size=args.world_size, rank=args.rank)
 
-    model_config = AutoConfig.from_pretrained(args.model_path_or_name)
+    model_config = AutoConfig.from_pretrained(args.model_path_or_name, trust_remote_code=True)
     if "CausalLM" in model_config.architectures[0]:
         # load tokenizer
-        tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, use_fast=False, padding_side='left')
+        tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, use_fast=False, padding_side='left', trust_remote_code=True)
         tokenizer.pad_token = tokenizer.eos_token
         # load model
         model = AutoModelForCausalLM.from_pretrained(
             args.model_path_or_name,
             from_tf=bool(".ckpt" in args.model_path_or_name),
             config=model_config,
-            low_cpu_mem_usage=True
+            low_cpu_mem_usage=True, 
+            trust_remote_code=True
         ).to(args.device)
         model.config.end_token_id = tokenizer.eos_token_id
         model.config.pad_token_id = model.config.eos_token_id
         model_type = "CausalLM"
     else:
-        tokenizer = AutoTokenizer.from_pretrained(model_path_or_name)
-        model = AutoModel.from_pretrained(model_path_or_name).to(args.device)
+        tokenizer = AutoTokenizer.from_pretrained(model_path_or_name, trust_remote_code=True)
+        model = AutoModel.from_pretrained(model_path_or_name, trust_remote_code=True).to(args.device)
         model_type = "Other"
 
     # load test dataset
     test_data = []
     for line in open(question_file):
         test_data.append(json.loads(line))
-    test_dataset = EmbDataset(test_data, tokenizer, model_config.max_position_embeddings - 100, args.world_size, args.rank) # 1700 < 2048 - max new tokens
+    try:
+        max_seq_length = model_config.max_position_embeddings
+    except:
+        try:
+            max_seq_length = model_config.seq_length
+        except:
+            max_seq_length = 4096
+    test_dataset = EmbDataset(test_data, tokenizer, max_seq_length - 100, args.world_size, args.rank) # 1700 < 2048 - max new tokens
     dataloader = DataLoader(test_dataset, batch_size=args.batch_size)
 
     if args.rank == 0:
@@ -206,14 +116,6 @@ def run_embedding(local_gpu_rank, model_path_or_name, question_file, answer_file
             data["answer"] = result
             fd.write(json.dumps(data, ensure_ascii=False) + '\n')
 
-def gen_model_chat_answer(model_path_or_name, question_file, answer_file, args, system_prompt):
-    if args.gpus < 0:
-        args.gpus = torch.cuda.device_count()
-    args.world_size = args.nodes * args.gpus
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = args.master_port
-    mp.spawn(run_chat, nprocs=args.gpus, args=(model_path_or_name, question_file, answer_file, args, system_prompt))
-
 def gen_model_embedding_answer(model_path_or_name, question_file, answer_file, args):
     if args.gpus < 0:
         args.gpus = torch.cuda.device_count()