[ICLR 2026] AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations.

Object hallucination is a critical issue undermining the reliability of LVLMs. Our analysis on the CHAIR dataset reveals a distinct trade-off between attention-based and diversity-based pruning methods:

Diversity-Based Pruning: These methods achieve higher recall by capturing a wider range of objects. However, this comes at the cost of a higher tendency for hallucination, producing more descriptive but less reliable captions.
Attention-Based Pruning: These methods are more conservative and reliable. They focus on tokens with high attention scores, which significantly reduces hallucination rates, though sometimes at the expense of lower recall.

Our analysis reveals that the ideal token selection strategy is dictated by image complexity. We found a clear divergence in performance based on the characteristics of the image:

Simple Images: Images with concentrated information (e.g., OCR tasks) exhibit low attention entropy and feature diversity (erank). For these, attention-based methods are more effective as they can easily select the few most important tokens.
Complex Images: Scenes with multiple objects and varied backgrounds show higher attention entropy and erank, indicating dispersed information. In these cases, diversity-based methods perform better by capturing a broader range of features and preventing the loss of crucial context.

To prove that our findings are not tied to a specific algorithm, we applied the erank-guided linear mapping to existing fixed-ratio hybrid methods (such as BAT and VisPruner) and heterogeneous mixtures (FasterVLM + DivPrune).

Applying to existing fixed-ratio hybrid methods (BAT and VisPruner)

Adaptive Rule (Ours): our adaptive rule consistently improves performance
Inverse Rule: the inverse rule (which contradicts the discovered image-complexity trend) consistently reduces performance.

Applying toheterogeneous mixtures (FasterVLM + DivPrune).

This combination also benefited from the erank-guided rule, showing that the principle is model-agnostic and reflects general LVLM pruning behavior.

Our method follows a simple pruning procedure: high-attention tokens are selected first, and all candidate tokens whose cosine distance is below a similarity threshold (τ) to a selected token are removed. The threshold τ directly controls the diversity of the final token set.

Building on our empirical analysis, we design a statistics-driven adaptive strategy that automatically adjusts to image complexity using the effective rank (erank) of image features. For each token i, we define a dynamic threshold based on the normalized image complexity:

$$ \tau_i = \text{order}_i \cdot \frac{\text{erank}_{\text{input}}}{\text{erank}_{\text{avg}}} \cdot 0.01,\quad \tau_i \le \tau_{\max} $$

Complex images ($\text{erank}_{\text{input}} > \text{erank}_{\text{avg}}$): A larger scaling factor increases $\tau_i$, enabling stronger pruning and promoting token diversity.
Simple images ($\text{erank}_{\text{input}} < \text{erank}_{\text{avg}}$): A smaller scaling factor keeps $\tau_i$ low, preserving fine-grained, high-attention tokens.

Our work introduces an adaptive pruning framework that dynamically adjusts to image complexity. We make the following key contributions:

We provide the first erank-based characterization of how existing pruning methods preserve feature diversity, and show how this retained diversity relates to hallucination behavior.
We reveal a consistent image-complexity–dependent preference between attention-based and diversity-based pruning, explaining when each paradigm succeeds scceeds or fails or fails.
We show that these empirical principles are actionable by improving existing pruning methods and by presenting a minimal adaptive instantiation that achieves strong, consistent performance. consistent performance across benchmarks.

We evaluate our proposed adaptive pruning method on various Large Vision-Language Models to demonstrate its effectiveness and robustness across different benchmarks. Our method shows robust performance on simple images such as ScienceQA and on complex images such as POPE, effectively reducing redundancy while preserving essential information. This demonstrates a stable and reliable alternative to fixed or non-adaptive pruning methods.

Our analysis on the CHAIR dataset highlights the trade-off between pruning strategies: diversity-based methods capture more objects but increase hallucinations, while attention-based methods reduce hallucination but lose diversity. By adaptively balancing the two, our method achieves results close to using the full set of visual tokens, demonstrating both low hallucination and strong recall.

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

Abstract

Pruning Methods and Hallucination

Response patterns of DivPrune (diversity-based) vs. FasterVLM (attention-based).

Effect of attention scores on object hallucination.

Impact of Image Complexity on Token Selection

Analysis-driven enhancement of pruning methods

Towards Adaptive Token Similarity Thresholding

Contribution Summary

Results

LLaVA-v1.5-7B

LLaVA-v1.5-13B

LLaVA-Next-7B

Qwen2.5-VL-7B

LLaVA-1.5-7B Hallucination (CHAIR Dataset)