Logo NTSEBench: Cognitive Reasoning Benchmark for Vision Language Models

* Equal Contribution
1 Indian Institute of Technology-Guwahati,
2 University of Utah, 3 University of Pennsylvania, 4 Arizona State University
NAACL 2025 Findings
geometric reasoning

Logo Examples from the NTSEBench dataset
Three samples of textual, direction, and spatial reasoning questions from the proposed dataset. Solutions to these questions are not included here but are provided in the dataset NTSEBench dataset

Summary and Abstract

Cognitive textual and visual reasoning tasks, including puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. Due to extensive training on vast amounts of human-curated data, large language models (LLMs) and vision language models (VLMs) excel in common-sense reasoning tasks, but still struggle with more complex reasoning that demands deeper cognitive understanding. We introduce NTSEBench, a new dataset designed to evaluate cognitive multimodal reasoning and problem-solving skills of large models. The dataset contains 2,728 multiple-choice questions, accompanied by a total of 4,642 images, categorized into 26 different types. These questions are drawn from the nationwide NTSE examination in India and feature a mix of visual and textual general aptitude challenges, designed to assess intelligence and critical thinking skills beyond mere rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open-source and propriety models, we propose four distinct modeling strategies to handle different modalities—text and images—in the dataset instances.

Logo NTSEBench Dataset

Overview

The National Talent Search Examination (NTSE), administered by the National Council of Educational Research and Training (NCERT) in India since 1963, is a nationwide exam for secondary-grade students. The exam consists of two sections designed to assess a wide range of analytical skills: the Mental Ability Test (MAT) and the Scholastic Aptitude Test (SAT). The MAT section evaluates students’ general aptitude, critical thinking, logical and spatial reasoning, and analytical problem-solving skills (for both textual and visual problems). In contrast, the SAT assesses their domain-specific knowledge in science and mathematics. All questions in the NTSE are multiplechoice (MCQ) with one correct option. Questions and options can be text/image or a combination of both, i.e., multi-modal. We aim to create a dataset focused on cognitive reasoning abilities (MAT-type questions). Cognitive Reasoning. Cognitive understanding in the context of NTSEBench refers to the ability to process information, recognize patterns, draw inferences, and solve problems using critical, logical and analytical reasoning. This aligns with fundamental concepts in cognitive science, such as problem-solving, pattern recognition, and inferential reasoning

Categories # Samples Categories # Samples
Series 256 Non-Verbal Series 95
Alphabet Test 94 Missing Character 127
Odd one out 170 Embedded Figure 96
Analogy 151 Non-Verbal odd one out 70
Coding-Decoding 149 Non-Verbal Analogy 100
Number and Ranking 139 Paper Folding & Cutting 96
Blood Relation 126 Incomplete Figure 94
Mathematical Operations 99 Figure Partition 71
Puzzle Test 95 Cube and Dice 23
Syllogisms 44 Dot problem 23
Statement & Conclusions 143 Direction Sense 36
Data Sufficiency 90 Time and Clock 51
Mirror, Water and Images 50
Venn diagrams 111

Logo Strategies Proposed for analysis

Evaluating the reasoning abilities of large language models (LLMs) with text-based questions is straightforward. For vision-language models (VLMs), reasoning with vision-text questions is generally straightforward. Some API access model models, like GPT-4o and Gemini, support multi-image inputs, while many others do not (open-source models like LLaVA-OneVision and Ovis are emerging with this capability). To address these task-specific and input-related dependencies, we propose four strategies to fairly evaluate the reasoning abilities of both open-source and API-based models.

Standard QA

For instances where question type(J) or questions(Q),options(O) and solutions(S) is text(T),we use a standard text-based QA model like GPT3.5-Turbo or Llama3-70b

Image-Only

We propose a modeling approach where questions and all the options are presented to the model as a single image. This image consolidates all relevant textual and visual content exactly as it appears in the examination paper, effectively capturing the entire question, including both textual and visuals. This strategy utilizes the OCR capabilities of VLM models to interpret and analyze the content, enabling them to process both text and visual elements within the same input.

Interleaved model

In this approach, we integrate text with multiple images to create an interwoven context. This method involves placing related textual and visual elements in proximity, enhancing the model’s ability to draw connections between them.

Standard VQA

Open-source models typically lack the capability to integrate text and images within a single prompt. To enable fair comparisons, we propose an alternative modeling strategy where the question and option images are combined into a single composite image, labeled as Figure 1, Figure 2, etc. This composite image is accompanied by a structured textual prompt that describes different parts of the image, directing the model’s attention to relevant visual details. The composite image and prompt are then used to evaluate the model’s performance, testing its ability to interpret and respond to questions based on the integrated visual and textual information.

Logo Experiment Results

The Analysis section of the paper presents a comprehensive evaluation of the impact of various strategies on the NTSEBench dataset. The findings indicate that

  • Interleaving text and images performs better than Standard VQA and Image Only strategy for most categories.
  • Multi-modal reasoning is significantly harder.
  • The results reveal that while LLMs generally perform better on the text-only subset of NTSEBench, the high standard deviation (11.04) indicates significant variability in model performance across different question types. This variability may stem from some overlap between NTSEBench and other open-source datasets, suggesting that there are still areas where LLMs exhibit limitations in reasoning capabilities
  • NTSEBench presents a challenging task for SOTA LLMs and VLMs. Based on the findings, it is evident that the proposed dataset presents a challenging task for all state of-the-art LLM and VLM models. None of the open-source models achieve accuracy exceeding 50% on text-only questions and 35% on multimodal questions, with propriety models achieving 62% and 42% accuracy respectively.
We manually conducted an error analysis of 260 questions (10 from each question category) for Gemini 1.5 pro, and we identified distinct patterns in reasoning and error categorization. We have categorised errors based on the cognitive dimensions outlined in section 2. The Sankey diagram in figure 2 illustrates how errors across various question categories correspond to specific error types. A key observation is that many errors arise from Pattern Recognition failures, especially in categories like Alphabet Tests, Non-Verbal Analogy, and Series questions, where the model struggled with recurring patterns and sequence shifts, highlighting challenges in complex pattern-based reasoning. We also noted frequent errors in Spatial Reasoning and Logical Deduction tasks, particularly in spatial or diagrammatic questions like Cube and Dice, Embedded Figure, and Paper Folding & Cutting. These questions often require pattern recognition, shape manipulation, or deducing logical relations from limited visual data. The figure shows that errors in Quantitative Analysis were common in numerical questions like Time and Clock and Mathematical Operations, indicating the model excels in simpler tasks but struggles with complex number sequences and operations. The error distribution reveals key insights into the model’s strengths and weaknesses, guiding future improvements.

BibTeX

                @misc{pandya2025ntsebenchcognitivereasoningbenchmark,
                    title={NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models}, 
                    author={Pranshu Pandya and Vatsal Gupta and Agney S Talwarr and Tushar Kataria and Dan Roth and Vivek Gupta},
                    year={2025},
                    eprint={2407.10380},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2407.10380}, 
              }