This repo contains data and evaluation code for MC-Bench, a new benchmark for multi-context visual grounding.
Abstract: While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and grouped into three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, as well as a simple yet effective stepwise baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some interesting observations that suggest potential future directions. We hope MC-Bench and our empirical findings can encourage the research community to further explore and enhance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts.
# install COCO API
pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
# install SciPy
pip install scipy
# download MC-Bench evaltion code
git clone https://github.com/XuYunqiu/MC-Bench.git
- Download our dataset from Google Drive | Baidu Pan
- Unzip the
MC-Bench_images.zip
file - The MC-Bench dataset directory should to have the following directory structure:
MC-Bench/
├── mc-bench_v0.2_val.json/
└── mc-bench_images/
├── COCO/
├── D3/
├── winogavil_images/
├── BLINK_val_multiview/
├── Mantis-Eval_val/
└── ...
The annotation format of MC-Bench is similar to COCO. The annotations are stored using JSON. The MC-Bench API can be used to access and manipulate annotations. The JSON file has the following format:
{
"info" :info,
"images" :[image],
"annotations" :[annotation],
"descriptions" :[description],
"categories" :categories,
}
description{
"id" :int,
"images_id" :[int],
"text" :str,
"positive_sample" :bool,
"text_style" :str,
}
image{
"id" :int,
"text_id" :int,
"inter_img_id" :int,
"file_name" :str,
"height" :int,
"width" :int,
}
annotation{
"id" :int,
"image_id" :int,
"text_id" :int,
"catgory_id" :int,
"area" :int,
"bbox" :[x,y,w,h],
"iscrowd" :0 or 1,
}
Note: box coordinates are floats measured from the top left image corner (and are 0-indexed).
The results format is similar to the ground-truth annotation format
[{
"image_id" :int,
"category_id" :int,
"bbox" :[x,y,w,h],
"score" :float,
}]
To calculate the metrics used in MC-Bench, use
python eval_mc_bench.py \
--gt_json_path ../MC-Bench/MC-Bench_coco_format.json.yaml \
--dt_json_path /path/to/results_file \
--eval_type 'all'
You can set --eval_type
to instance
, image
or all
for evaluation over instance-level, image-level and all metrics.
- Refactor evalutaion code based on COCO API
- Support evaluation of more than 2 images
- MC-Bench v0.5
If you find our data or code useful in your research, please use the following BibTeX entry.
@article{xu2024mc,
title={MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs},
author={Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
journal={arXiv preprint arXiv:2410.12332},
year={2024}
}
If you have any questions, please drop me an email: imyunqiuxu@gmail.com