VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Overview

VisualPuzzles is a multimodal benchmark specifically designed to evaluate reasoning abilities in large models while deliberately minimizing reliance on domain-specific knowledge.

Key features:

1168 diverse puzzles
5 reasoning categories: Algorithmic, Analogical, Deductive, Inductive, Spatial
Difficulty labels: Easy, Medium, Hard
Less knowledge-intensive than existing benchmarks (e.g., MMMU)
More reasoning-complex than existing benchmarks (e.g., MMMU)

Key Findings

All models perform worse than humans; most can't surpass even 5th-percentile human performance.
Strong performance on knowledge-heavy benchmarks does not transfer well.
Larger models and structured "thinking modes" don't guarantee better results.
Scaling model size does not ensure stronger reasoning

Dataset

The dataset is available on HuggingFace 🤗.

Model Outputs

Outputs of all models we evaluated are available on Zeno.

Experiments

We gratefully use the lmms-eval package to evaluate VisualPuzzles.

To reproduce experimental results on VisualPuzzles, run the following commands:

Installation:

git clone https://github.com/neulab/VisualPuzzles.git
cd lmms-eval
pip install -e .

Experiments:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model model_type \ # for example, llava
    --model_args pretrained=model_name \ # for example, "liuhaotian/llava-v1.5-7b"
    --tasks VisualPuzzles_cot \ # use VisualPuzzles_cot if you are evaluating CoT performance, or use VisualPuzzles_direct if not.
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix VisualPuzzles \
    --output_path ./logs/

Knowledge Intensity Evaluation of MMMU v.s. VisualPuzzles

This experiment investigates

the extent to which solving problems in the VisualPuzzles benchmark relies on domain-specific knowledge, compared to the widely-used MMMU dataset; and
whether models already possess the knowledge required to solve VisualPuzzles, as compared to MMMU.

Knowledge Checklist Generation

We prompted GPT-4o to generate "knowledge concept checklists" for 50 randomly selected questions from each of MMMU and VisualPuzzles.

The knowledge concept checklists we generated for MMMU and VisualPuzzles could be found in knowledge/mmmu_questions.json and knowledge/puzzle_questions.json respectively.

Run the following command to reproduce this experiment.

python get_knowledge_checklists.py

Note that we went through manual validation as dicussed in the paper.

Knowledge Accuracy

We measured models' knowledge accuracy - their ability to answer the knowledge checklist questions correctly - on both benchmarks. We used llm-as-a-judge with GPT-4o to evaluate whether models answered the knowledge checklist questions correctly. Model outputs and judge outputs could be found in knowledge/knowledge_eval_output.

After generating model responses for the knowledge checklist questions knowledge/mmmu_questions.json and knowledge/puzzle_questions.json, run the following command to reproduce this experiment on models' knowledge accuracy.

cd knowledge
python get_knowledge_scores.py

Citation

@misc{song2025visualpuzzlesdecouplingmultimodalreasoning,
  title={VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge},
  author={Yueqi Song and Tianyue Ou and Yibo Kong and Zecheng Li and Graham Neubig and Xiang Yue},
  year={2025},
  eprint={2504.10342},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2504.10342}
}

Acknowledgements

This project was supported in part by a grant from DSTA Singapore and the Carnegie Bosch Institute. The authors would like to thank CMU NeuLab colleagues for their constructive comments. The authors would also like to thank all volunteers who participated in the human evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
knowledge		knowledge
lmms-eval		lmms-eval
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Overview

Key Findings

Dataset

Model Outputs

Experiments

Knowledge Intensity Evaluation of MMMU v.s. VisualPuzzles

Knowledge Checklist Generation

Knowledge Accuracy

Citation

Acknowledgements

About

Releases

Packages

Languages

License

neulab/VisualPuzzles

Folders and files

Latest commit

History

Repository files navigation

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Overview

Key Findings

Dataset

Model Outputs

Experiments

Knowledge Intensity Evaluation of MMMU v.s. VisualPuzzles

Knowledge Checklist Generation

Knowledge Accuracy

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages