|
Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding
Kadir Yilmaz*,
Adrian Kruse*,
Tristan Höfer,
Daan de Geus,
Bastian Leibe
arXiv preprint, 2026
A vanilla Transformer for 3D scene understanding: partition the scene into volumetric patch tokens and apply global self-attention with 3D rotary positional embeddings. With data-efficient training and multi-dataset scaling, Volt achieves state-of-the-art results on semantic and instance segmentation.
|