3DZip: Spatial-Aware Feature Diversity-Guided Token Compression for 3D Question Answering

Abstract

Recent 3D vision-language models (3D VLMs) construct geometry-aware tokens by projecting 2D visual features into world coordinates, enabling spatial reasoning for tasks such as 3D question answering. However, this design generates thousands of tokens per scene, resulting in substantial computational and memory overhead. While token compression has been extensively studied in 2D VLMs, existing approaches rely on semantic relevance or attention-based selection that overlook the structured spatial nature of 3D tokens. Moreover, redundancy in 3D representations cannot be resolved by spatial proximity alone, as object-level token imbalance persists even after spatial aggregation. To address this, we propose 3DZip, a three-stage token compression framework that first applies coarse voxelization to remove point-level redundancy, then selects anchor tokens based on feature-space diversity via a Determinantal Point Process, and finally merges remaining tokens under spatial constraints to preserve geometric coherence. Experiments on three 3D question answering benchmarks demonstrate that 3DZip consistently outperforms existing compression methods, retaining 94.7% of the original performance with only 128 tokens, achieving a 1.92× faster inference speed.

A three-stage compression pipeline

Voxelization → diversity-aware anchor selection (DPP) → spatially-constrained merging.

Overview of the proposed three-stage token compression pipeline. Given geometry-aware 3D tokens 𝒱, we first apply coarse voxelization, then perform diversity-aware anchor selection via a Determinantal Point Process, and finally spatially-constrained token merging aggregates nearby non-anchor tokens into anchors to produce the compressed set.

Projection-based multi-view aggregation introduces two distinct forms of redundancy. Point-level redundancy arises when identical physical surfaces are observed across multiple viewpoints, producing overlapping tokens. Object-level redundancy occurs when a single object is represented by tokens from multiple surface regions, which spatial aggregation alone cannot collapse. 3DZip explicitly decomposes redundancy into point-level density, object-level duplication, and geometric consistency, addressing each with a dedicated stage.

Why feature diversity matters

Object-level token allocation across selection strategies on the full SQA3D test scenes. Per-object token counts for 1,729 foreground object instances across 67 ScanNet scenes. The voxel-only distribution exhibits a pronounced long-tail pattern; spatial sampling (XYZ-DPP) still concentrates on a limited subset of objects (47% coverage), whereas Feature-DPP distributes tokens more evenly, improving object coverage to 64%.

State-of-the-art under every token budget

Main results on 3D question answering benchmarks

Main results on three 3D question answering benchmarks (ScanQA, SQA3D, OpenEQA). Under matched token budgets (128 / 64 / 32), 3DZip consistently outperforms prior 2D token compression methods. With only 128 tokens, 3DZip retains 94.7% of the full-token (1410) performance.

What 3DZip keeps

Retained 3D tokens produced by 3DZip across different indoor scenes — switch datasets below.