Skip to main content Skip to main navigation

Publication

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

Duy M. H. Nguyen; Tuan A. Tran; Duong Nguyen; Siwei Xie; Trung Nguyen; Mai T. N. Truong; Daniel Palenicek; An T. Le; Michael Barz; TrungTin Nguyen; Tuan Quang Dam; Hong Anh Le; Minh Vu; Khoa D. Doan; Ngo Anh Vien; Pengtao Xie; James Zou; Daniel Sonntag; Jan Peters; Mathias Niepert
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2603.07307, Pages 1-17, arXiv, 2026.

Abstract

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM’s image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt- conditioned features for precise boundary pre- diction. We systematically evaluate representa- tive token-merging methods on SAM and Medi- cal SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose Struct- SAM, a resolution-preserving merge–unmerge framework tailored to SAM. StructSAM com- putes a lightweight token-energy score from first- order feature gradients, uses grid-based flatness screening to protect boundary/prompt regions, and merges tokens within flat areas toward low- energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines.

More links