MMSearch 🔥🔍: Benchmarking the Potential of Large Models as Multi-modal Search Engines

Official repository for "MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines".

🌟 For more details, please refer to the project page with dataset exploration and visualization tools.

[🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset] [🏆 Leaderboard] [🔍 Visualization]

💥 News

[2024.09.30] 🌏 We add MMSearch-Engine (for any new query) command line demo here!
[2024.09.25] 🌟 MMSearch now supports evaluation in lmms-eval! Details are here.
[2024.09.25] 🌟 The evaluation code now supports directly use models implemented in VLMEvalKit!
[2024.09.22] 🔥 We release the evaluation code, which you only need to add an inference API of your LMM!
[2024.09.20] 🚀 We release the arXiv paper and all MMSearch data samples in huggingface dataset.

📌 ToDo

Coming soon: MMSearch-Engine demo

👀 About MMSearch

The capabilities of Large Multi-modal Models (LMMs) in multimodal search remain insufficiently explored and evaluated. To fill the blank of a framework for LMM to conduct multimodal AI search engine, we first design a delicate pipeline MMSearch-Engine to facilitate any LMM to function as a multimodal AI search engine

To further evaluate the potential of LMMs in the multimodal search domain, we introduce MMSearch, an all-around multimodal search benchmark designed for assessing the multimodal search performance. The benchmark contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching.

An overview of MMSearch.

In addition, we propose a step-wise evaluation strategy to better understand the LMMs' searching capability. The models are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. The final score is weighted by the four tasks.

Outline of Evaluation Tasks, Inputs, and Outputs.

🔍 An example of LMM input, output, and ground truth for four evaluation tasks

Evaluation

📈 Evaluation by yourself

Setup Environment

The environment is mainly for interacting with the search engine and crawling the website:

pip install requirements.txt
playwright install

Get your LMMs ready

(a). ✨ Evaluation with models implemented in VLMEvalKit

We now support directly using the models implemented in VLMEvalKit. You need to first install VLMEvalKit with the following command, or follow the guidance in its repo:

git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

Then, you can directly use the models implemented in VLMEvalKit, the available name list of the model is here.

To use the model, simply add the prefix vlmevalkit_ in front of the model name in the list. For example, to use llava_onevision_qwen2_7b_ov, your input model_type should be vlmevalkit_llava_onevision_qwen2_7b_ov. We provide an example of the rerank task in scripts/run_requery_vlmevalkit.sh.

Note that, several models in VLMEvalKit do not support text-only inference, so it may not support end2end task (some queries in round1 do not have image input).

(b). 💪 Evaluation with custom LMMs

Here, we support evaluation of any custom LMMs with only very little effort. To evaluate your LMM, you only need to provide an infer function, which takes the image files and text instructions as input and outputs the model response.

We implement the code of LLaVA-OneVision in models/llava_model.py. Adding a model is very simple with only two steps:

Implement a class for the model. The model class must implement the infer function, which takes image files and text instructions as input. Please refer to models/llava_model.py for the illustration of input variable types.
Add the model type in models/load.py. Then you can specify the model_type in your bash file and use your model!

Begin evaluation!

Note that there are four tasks for computing the final score of MMSearch: end2end, requery, rerank, and summarization.

The requery task is automatically evaluated when conducting the end2end task. Therefore, to evaluate all the tasks in MMSearch, you only need to conduct evaluation on the end2end, rerank and summarization tasks. The evaluation codes are as follows:

# end2end task
bash scripts/run_end2end.sh
# rerank task
bash scripts/run_rerank.sh
# summarization task
bash scripts/run_summarization.sh

After the three scripts complete, run the following code to get the final score:

bash scripts/run_get_final_score.sh

Here are some important notes:

How to set the parameters?
- We provide the example input args in the bash file mentioned above.
- The end2end task needs to interact with the Internet and the search engine. Please adjust the timeout time in constants.py for loading the website according to your network status.
Evaluation time and multiple gpus inference

Typically, the end2end task takes the longest time since it conducts three rounds sequentially and needs to interacte with the Internet. We provide a very basic mechanism for inference with multiple GPUs, where we provide an example in scripts/run_rerank_parallel.sh . However, we do not recommend running end2end task with too many GPUs since it will hit the rate limit of the search engine API and refuse to respond. Normally, running end2end task will take up 3-5 hours for a single GPU.

Evaluation with lmms-eval

You need also to set up the environment specified above. Then you can simply run the evaluation with lmms-eval commands. Note that, lmms-eval now only supports evaluating MMSearch with LLaVA-OneVision. More models will be supported very soon!

Demo

We provide a command line demo of MMSearch-Engine for any new queries.

Prepare query

We provide query examples in demo/query_cli.json. For queries with image, you need to specify the path to the query_image and an URL of the query_image since Google Lens here only supports url input. An easy way to get an URL of an image is to upload it to any public GitHub repository. Then simply substitute blob with raw of the image URL:

{
    "query": "When is the US release date for this movie?",
    "query_image": "demo/demo.png",
    "query_image_url": "https://github.com/CaraJ7/MMSearch/raw/main/demo/demo.png"
}

For queries without image, you only need to specify the query and set query_image as null:

{
    "query": "When is the US release date for Venom: The Last Dance?",
    "query_image": null
}

Get the search result

To successfully search the image in Google Lens, make sure the search engine that playwright opens is in English. Otherwise, it will throw an error. To get the search result of your queries, simply run the following command. The parameters have the same meaning as the parameters in the end2end evaluation task script.

bash demo/run_demo_cli.sh

🏆 Leaderboard

Contributing to the Leaderboard

🚨 The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!

Data Usage

We release the MMSearch data for benchmarking on the leaderboard, which contains 300 queries and the middle results for step-wise evaluation.

You can download the dataset from the 🤗 Huggingface by the following command (make sure that you have installed related packages):

from datasets import load_dataset

dataset = load_dataset("CaraJ/MMSearch")

✅ Citation

If you find MMSearch useful for your research and applications, please kindly cite using this BibTeX:

@article{jiang2024mmsearch,
  title={MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines},
  author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Wu, Yanmin and Lei, Jiayi and Qiu, Pengshuo and Lu, Pan and Chen, Zehui and Song, Guanglu and Gao, Peng and others},
  journal={arXiv preprint arXiv:2409.12959},
  year={2024}
}

🧠 Related Work

Explore our additional research on Vision-Language Large Models:

[MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
[MathVista] MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
[LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
[LLaMA-Adapter V2] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
[ImageBind-LLM] Imagebind-LLM: Multi-modality Instruction Tuning
[SPHINX] The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal LLMs
[SPHINX-X] Scaling Data and Parameters for a Family of Multi-modal Large Language Models
[Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
[PerSAM] Personalize segment anything model with one shot
[CoMat] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
customs		customs
demo		demo
figs		figs
models		models
prompts		prompts
retrieve_content		retrieve_content
score		score
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
eval_end2end.py		eval_end2end.py
eval_requery.py		eval_requery.py
eval_rerank.py		eval_rerank.py
eval_summarization.py		eval_summarization.py
get_final_scores.py		get_final_scores.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMSearch 🔥🔍: Benchmarking the Potential of Large Models as Multi-modal Search Engines

💥 News

📌 ToDo

👀 About MMSearch

Evaluation

📈 Evaluation by yourself

Setup Environment

Get your LMMs ready

Begin evaluation!

Evaluation with lmms-eval

Demo

Prepare query

Get the search result

🏆 Leaderboard

Contributing to the Leaderboard

Data Usage

✅ Citation

🧠 Related Work

About

Releases

Packages

Contributors 2

Languages

CaraJ7/MMSearch

Folders and files

Latest commit

History

Repository files navigation

MMSearch 🔥🔍: Benchmarking the Potential of Large Models as Multi-modal Search Engines

💥 News

📌 ToDo

👀 About MMSearch

Evaluation

📈 Evaluation by yourself

Setup Environment

Get your LMMs ready

Begin evaluation!

Evaluation with lmms-eval

Demo

Prepare query

Get the search result

🏆 Leaderboard

Contributing to the Leaderboard

Data Usage

✅ Citation

🧠 Related Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages