Logo Migician

Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

1Beijing Jiaotong University, Beijing, China
2Huazhong University of Science and Technology, Wuhan, China
3Tsinghua University, Beijing, China

Introduction

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.

MIG-Bench Tasks Overview

As shown above, based on whether the task involves explicit reference requirements, multi-image grounding tasks can be further categorized into two types: Spontaneous Grounding and Referential Grounding. Spontaneous Grounding refers to recognizing and grounding the target object in corresponding images without explicitly pointing it out. Unlike the conventional Reference Expression Comprehension task that explicitly refer to the target object, Spontaneous Grounding typically utilizes the relationships between multiple images as contextual cues to autonomously identify and localize the objects to be grounded~(e.g., finding and locating differences between images). Referential Grounding, on the other hand, requires an explicit reference to the target object. As mentioned earlier, such references can take the form of arbitrary combinations of images and textual descriptions.
Noticeably, our proposed multi-image grounding paradigm could potentially be a general paradigm unifying diverse tasks traditionally dominated by specialized expert models, for instance object tracking, vehicle/person re-identification, partial graph matching and etc, facilitating more general and versatile machine intelligence. Moreover, its innate multi-image characteristics circumvents the need for additional visual modules specializedly processing visual input queries (VisionLLM v2, Griffon v2), rendering a more general architecture.

Data Statistics

Chain-of-Thought Framework

Although some existing MLLMs demonstrate strong multi-image understanding and single-image grounding capabilities, directly prompting them to perform MIG tasks often leads to significant performance degradation as illustrated above in (a). Our proposed Chain-of-Thought (CoT) framework enables the model to effectively leverage and combine its exitsing abilities during the MIG execution.
However, the CoT framework suffers from several inherent limitations. On one hand, the multi-step process introduces error propagation issues and impacts reasoning efficiency. On the other hand, many scenarios require grounding through abstract visual semantics across multi-image contexts (as shown above in (c)), making the use of an intermediate textual referring expression impractical. More failure patterns of the CoT framework are illustrated below. This highlights the need for an end-to-end model capable of directly performing the MIG task.

Leaderboard

Rank Model AVE Static Diff Robust Diff Common Object Object Tracking Multi-view Region Locating Referring Grounding Group Grounding Reasoning CoRe
Migician 63.82 65.15 46.81 84.19 70.73 60.07 74.31 76.77 66.53 59.41 34.19
Qwen2-VL-72B 38.88 46.12 46.18 64.46 26.73 22.57 18.62 33.33 62.53 50.50 17.09
Qwen2-VL-7B 28.62 27.84 38.30 19.36 20.73 11.81 25.95 23.23 58.52 48.51 11.97
InternVL2-76B 26.72 15.91 10.64 36.40 30.73 20.83 5.74 46.46 41.28 32.67 26.50
InternVL2-8B 15.96 6.92 7.45 25.49 20.73 9.72 3.49 28.28 30.26 17.82 9.40
LLaVA-OV-72B 13.65 13.26 5.34 26.84 12.91 7.64 2.14 17.83 21.60 11.88 8.55
mPLUG-Owl3 12.35 18.56 6.34 34.93 8.55 7.64 2.41 7.07 22.85 9.09 5.98
MiniCPM-V_2.6 7.55 14.58 2.13 14.34 9.82 6.25 1.75 11.11 10.02 2.97 2.56
LLaVA-OV-7B 4.73 6.06 3.19 3.43 0.18 1.04 1.08 9.09 15.43 6.93 0.85
Mantis 3.20 1.52 0.00 3.31 12.18 2.08 1.00 1.01 10.02 0.00 0.85

Free-Form Grounding Cases

Citation


      @misc{li2025migicianrevealingmagicfreeform,
      title={Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models}, 
      author={You Li and Heyu Huang and Chi Chen and Kaiyu Huang and Chao Huang and Zonghao Guo and Zhiyuan Liu and Jinan Xu and Yuhua Li and Ruixuan Li and Maosong Sun},
      year={2025},
      url={https://arxiv.org/abs/2501.05767}, 
}