Simply discarding low-attention tokens based on a strict ranking (e.g. Top-K) risks removing potentially useful visual cues, which can degrade the model's overall object recognition capabilities.
To overcome this, we employ a Determinantal Point Process (DPP)-based masking technique. By jointly modeling token importance (derived from attention scores) and token diversity (based on semantic similarity), the DPP approach allows us to selectively suppress redundant, noise-inducing tokens while effectively preserving a diverse set of essential visual features. For more details on our DPP masking method, please refer to the main paper.
All subsequent experimental results were conducted using this DPP masking technique.