Multimodal large language models have shown powerful abilities to understand and reason across text and images, but their ...