Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Sohyeon Kim¹, Sang Yeon Yoon², Kyeongbo Kong^1†

Pusan National University¹ Pukyong National University²

^†Corresponding author

arXiv Code

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dy- namics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.

Analysis

Explore each analysis of visual attention dynamics in LVLMs.

Hover over each analysis to explore our findings

01 · Attention Dynamics

Layer-wise Attention Dynamics in Vision Encoders

In this study, we explore the internal attention dynamics of vision encoders. Through layer-wise analysis across multiple backbones, we reveal a consistent three-phase structure in how visual information is processed, independent of the model's architecture and scale.

To quantify attention concentration, we introduced the \(R^{(l)}\) metric, defined as the ratio of the maximum attention score to attention entropy.

Using this metric, we identified three distinct phases:

Phase 1 - Diffusion: In the early layers, attention remains broadly distributed across many visual tokens.
Phase 2 - Focus: Entering the intermediate layers, attention becomes highly concentrated on a small subset of specific tokens.
Phase 3 - Rediffusion: In the later layers, the previously concentrated attention distribution spreads out again.

Results

Quantitative evaluation of our phase-aware suppression method.

Experimental Setup

DPP Masking

Simply discarding low-attention tokens based on a strict ranking (e.g. Top-K) risks removing potentially useful visual cues, which can degrade the model's overall object recognition capabilities.

To overcome this, we employ a Determinantal Point Process (DPP)-based masking technique. By jointly modeling token importance (derived from attention scores) and token diversity (based on semantic similarity), the DPP approach allows us to selectively suppress redundant, noise-inducing tokens while effectively preserving a diverse set of essential visual features. For more details on our DPP masking method, please refer to the main paper.

All subsequent experimental results were conducted using this DPP masking technique.

Quantitative

Hallucination Mitigation

We compared our approach with the original baseline and the AUE method across various decoding strategies on the CHAIR and POPE benchmarks.

Consistent Hallucination Reduction: Effectively lowers \(CHAIR_S\) and \(CHAIR_I\) metrics, matching or exceeding the computationally expensive AUE baseline.
Preserved Visual Recognition: Maintains or improves F1 and POPE scores, proving that essential visual information remains intact.
Broad Compatibility: Seamlessly integrates with existing decoder-level methods to further drive down hallucinations.

Quantitative

Inference Efficiency

We compared the inference latency of our approach against both the original baseline and the AUE method to evaluate its practical efficiency.

Single Forward Pass Efficiency: By operating entirely in a single forward pass, our method bypasses the computationally heavy iterative optimization of AUE, maintaining generation speeds almost identical to the original baseline.

Qualitative Results: CHAIR

Select a model to view its qualitative results on the CHAIR benchmark.

Qualitative Results: POPE

Select a model to view its qualitative results on the POPE benchmark.

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}