MULTIMODAL REASONING IN VISION-LANGUAGE MODELS: BENCHMARKS AND LIMITATIONSID: 1874 Abstract :Vision-language Models (VLMs) Have Demonstrated Remarkable Capabilities In Understanding And Reasoning Across Visual And Textual Modalities. However, Their Performance On Complex Multimodal Reasoning Tasks Remains Inconsistent And Poorly Understood. This Paper Presents A Comprehensive Evaluation Of State-ofthe-art VLMs Including GPT-4V, Gemini Pro Vision, And Claude-3 Across Diverse Reasoning Benchmarks. We Introduce A Novel Evaluation Framework That Systematically Assesses Spatial Reasoning, Temporal Understanding, Causal Inference, And Compositional Reasoning Capabilities. Our Analysis Reveals Significant Limitations In Current Models: GPT-4V Achieves 67.3% Accuracy On Spatial Reasoning Tasks But Drops To 42.1% On Complex Compositional Scenarios, While Gemini Pro Vision Shows Superior Performance In Temporal Reasoning (71.8%) But Struggles With Abstract Visual Concepts (38.4%). Through Extensive Error Analysis, We Identify Key Failure Modes Including Hallucination In Visual Details, Inconsistent Reasoning Chains, And Brittleness To Prompt Variations. We Propose Targeted Improvements Including Multi-step Reasoning Protocols And Uncertainty Quantification Methods. Keywords— Vision-language Models, Multimodal Reasoning, Benchmark Evaluation, GPT-4V, Gemini Pro Vision, Claude-3, Artificial Intelligence, Compositional Reasoning |
Published:15-12-2025 Issue:Vol. 25 No. 12 (2025) Page Nos:230-241 Section:Articles License:This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. How to CiteSushil Khairnar, MULTIMODAL REASONING IN VISION-LANGUAGE MODELS: BENCHMARKS AND LIMITATIONS , 2025, International Journal of Engineering Sciences and Advanced Technology, 25(12), Page 230-241, ISSN No: 2250-3676. |