3DZip: Spatial-Aware Feature Diversity-Guided Token Compression for 3D Question Answering

ECCV 2026
Pusan National University
Paper coming soon
TL;DR

Token pruning has so far been explored only for 2D VLMs. We analyze the redundancy of projection-based 3D VLM tokens and propose 3DZip — the first training-free, geometry-aware token compression for 3D VLMs.

Abstract

Abstract

Recent 3D vision-language models (3D VLMs) construct geometry-aware tokens by projecting 2D visual features into world coordinates, enabling spatial reasoning for tasks such as 3D question answering. However, this design generates thousands of tokens per scene, resulting in substantial computational and memory overhead. While token compression has been extensively studied in 2D VLMs, existing approaches rely on semantic relevance or attention-based selection that overlook the structured spatial nature of 3D tokens. Moreover, redundancy in 3D representations cannot be resolved by spatial proximity alone, as object-level token imbalance persists even after spatial aggregation. To address this, we propose 3DZip, a three-stage token compression framework that first applies coarse voxelization to remove point-level redundancy, then selects anchor tokens based on feature-space diversity via a Determinantal Point Process, and finally merges remaining tokens under spatial constraints to preserve geometric coherence. Experiments on three 3D question answering benchmarks demonstrate that 3DZip consistently outperforms existing compression methods, retaining 94.7% of the original performance with only 128 tokens, achieving a 1.92× faster inference speed.
Preliminary

Projection-based 3D VLMs

Geometry-aware 3D token construction
Geometry-aware 3D token construction. Multi-view RGB-D inputs are projected into world coordinates to form geometry-aware 3D tokens via 3D positional embedding. The large token cardinality N = M × N2D motivates an efficient token compression strategy.
Method

A three-stage compression pipeline

Voxelization → diversity-aware anchor selection (DPP) → spatially-constrained merging.

3DZip three-stage pipeline
Overview of the proposed three-stage token compression pipeline. Given geometry-aware 3D tokens 𝒱, we first apply coarse voxelization, then perform diversity-aware anchor selection via a Determinantal Point Process, and finally spatially-constrained token merging aggregates nearby non-anchor tokens into anchors to produce the compressed set.
Projection-based multi-view aggregation introduces two distinct forms of redundancy. Point-level redundancy arises when identical physical surfaces are observed across multiple viewpoints, producing overlapping tokens. Object-level redundancy occurs when a single object is represented by tokens from multiple surface regions, which spatial aggregation alone cannot collapse. 3DZip explicitly decomposes redundancy into point-level density, object-level duplication, and geometric consistency, addressing each with a dedicated stage.
Analysis

Why feature diversity matters

Object-level token allocation across selection strategies
Object-level token allocation across selection strategies on the full SQA3D test scenes. Per-object token counts for 1,729 foreground object instances across 67 ScanNet scenes. The voxel-only distribution exhibits a pronounced long-tail pattern; spatial sampling (XYZ-DPP) still concentrates on a limited subset of objects (47% coverage), whereas Feature-DPP distributes tokens more evenly, improving object coverage to 64%.
Results

State-of-the-art under every token budget

Main results on 3D question answering benchmarks
Main results on three 3D question answering benchmarks (ScanQA, SQA3D, OpenEQA). Under matched token budgets (128 / 64 / 32), 3DZip consistently outperforms prior 2D token compression methods. With only 128 tokens, 3DZip retains 94.7% of the full-token (1410) performance.
Generalization

Works across different 3D VLM backbones

Generalization across 3D VLMs: Video-3D-LLM and SR-3D
Generalization across projection-based 3D VLMs. 3DZip is applied after 3D token construction without modifying the backbone, and consistently outperforms all baselines on Video-3D-LLM and SR-3D under every token budget.
Qualitative

What 3DZip keeps

Retained 3D tokens produced by 3DZip across different indoor scenes — switch datasets below.

Qualitative examples on SQA3D
Qualitative examples on OpenEQA
Qualitative examples on ScanQA