Multimodal Large Language Models (MLLMs) suffer from a fundamental “modality gap,” contradicting themselves on visual versus text views of the same content. This paper argues that this inconsistency is not a failure, but a powerful resource for self-reward multimodal learning. Instead of relying on flawed voting mechanisms that amplify systematic errors when the majority is wrong, we introduce cross-modal cycle consistency as rewards (C3R) to improve multimodal reasoning. C3R performs backward inference from an answer to a query, switches modalities, and performs forward inference to verify the answer’s consistency. This cycle serves as a dense, label-free reward that guides the model to resolve its own internal conflicts, while avoiding majority-is-wrong failures of standard voting methods. On standard benchmarks, C3R mitigates modality-specific biases and improves reasoning accuracy by up to 7.6 points. Our results show that robust reasoning emerges not just from scaling data, but from achieving a bidirectional understanding of the multimodal world.
Cross-modal disagreements are common: the same input yields different answers from the screenshot and the HTML view. Simple majority voting over these inconsistent predictions can reinforce the wrong answer instead of correcting it.
C3R turns cross-modal contradictions into rewards. From a candidate answer, the model performs backward reasoning to synthesize queries and then runs forward inference across text and image views, checking whether the cycle returns to the original answer.
Backward inference asks the model to justify its own answer: “for this answer to be correct, what query must have been asked?” C3R applies this in both text and image views, together with our reconstructed VisualWebArena multiple-choice dataset.
Swipe horizontally, drag, or use the carousel arrows to browse backward-inference and dataset visualizations.
We evaluate C3R on six multimodal reasoning benchmarks and observe consistent gains in both accuracy and cross-modal consistency. The carousel below summarizes quantitative improvements and representative qualitative case studies.
Swipe horizontally, drag, or use the carousel arrows to browse quantitative and qualitative results.
C3R supports many combinations of backward and forward modalities. Ablations reveal which paths contribute most to accuracy and self-consistency.
Not all training examples are equally informative. Samples where image and text views strongly disagree turn out to be the most valuable for improving both accuracy and consistency.
@article{c3r2025cross,
title = {C3R: Cross-Modal Cycle Consistency Rewards Improve Multimodal Reasoning},
author = {Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao},
journal = {arXiv preprint},
year = {2025}
}
This work used Purdue Anvil GPU through allocation 250774 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by the U.S. National Science Foundation under grants 2138259, 2138286, 2138307, 2137603, and 2138296. We thank Guangxing Han for the insightful discussion.