ISSN No:2250-3676 ----- Crossref DOI Prefix: 10.64771
   Email: ijesatj@gmail.com,   

Scholarly Peer Reviewed and Fully Referred Open Access Multidisciplinary Monthly Research Journal


    MULTIMODAL REASONING IN VISION-LANGUAGE MODELS: BENCHMARKS AND LIMITATIONS

    Sushil Khairnar

    Author

    ID: 1874

    DOI: Https://doi.org/10.64771/ijesat.2025.v25.i12.pp230-241

    Abstract :

    Vision-language Models (VLMs) Have Demonstrated Remarkable Capabilities In Understanding And Reasoning Across Visual And Textual Modalities. However, Their Performance On Complex Multimodal Reasoning Tasks Remains Inconsistent And Poorly Understood. This Paper Presents A Comprehensive Evaluation Of State-ofthe-art VLMs Including GPT-4V, Gemini Pro Vision, And Claude-3 Across Diverse Reasoning Benchmarks. We Introduce A Novel Evaluation Framework That Systematically Assesses Spatial Reasoning, Temporal Understanding, Causal Inference, And Compositional Reasoning Capabilities. Our Analysis Reveals Significant Limitations In Current Models: GPT-4V Achieves 67.3% Accuracy On Spatial Reasoning Tasks But Drops To 42.1% On Complex Compositional Scenarios, While Gemini Pro Vision Shows Superior Performance In Temporal Reasoning (71.8%) But Struggles With Abstract Visual Concepts (38.4%). Through Extensive Error Analysis, We Identify Key Failure Modes Including Hallucination In Visual Details, Inconsistent Reasoning Chains, And Brittleness To Prompt Variations. We Propose Targeted Improvements Including Multi-step Reasoning Protocols And Uncertainty Quantification Methods. Keywords— Vision-language Models, Multimodal Reasoning, Benchmark Evaluation, GPT-4V, Gemini Pro Vision, Claude-3, Artificial Intelligence, Compositional Reasoning

    Published:

    15-12-2025

    Issue:

    Vol. 25 No. 12 (2025)


    Page Nos:

    230-241


    Section:

    Articles

    License:

    This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

    How to Cite

    Sushil Khairnar, MULTIMODAL REASONING IN VISION-LANGUAGE MODELS: BENCHMARKS AND LIMITATIONS , 2025, International Journal of Engineering Sciences and Advanced Technology, 25(12), Page 230-241, ISSN No: 2250-3676.

    DOI: https://doi.org/10.64771/ijesat.2025.v25.i12.pp230-241