Publikation
StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models
Duy M. H. Nguyen; Tuan A. Tran; Duong Nguyen; Siwei Xie; Trung Nguyen; Mai T. N. Truong; Daniel Palenicek; An T. Le; Michael Barz; TrungTin Nguyen; Tuan Quang Dam; Hong Anh Le; Minh Vu; Khoa D. Doan; Ngo Anh Vien; Pengtao Xie; James Zou; Daniel Sonntag; Jan Peters; Mathias Niepert
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2603.07307, Pages 1-17, arXiv, 2026.
Zusammenfassung
Recent token merging techniques for Vision
Transformers (ViTs) provide substantial speedups
by reducing the number of tokens processed by
self-attention, often without retraining. However,
their direct application to the Segment Anything
Model (SAM) family is nontrivial: SAM’s image
encoder mixes windowed and global attention,
and its mask decoder relies on dense, prompt-
conditioned features for precise boundary pre-
diction. We systematically evaluate representa-
tive token-merging methods on SAM and Medi-
cal SAM in a strict off-the-shelf setting, and find
that existing destination-selection heuristics can
erode boundaries and leak prompt information
as merge rates increase. We propose Struct-
SAM, a resolution-preserving merge–unmerge
framework tailored to SAM. StructSAM com-
putes a lightweight token-energy score from first-
order feature gradients, uses grid-based flatness
screening to protect boundary/prompt regions,
and merges tokens within flat areas toward low-
energy destinations with explicit token recovery.
We further provide a spectral graph coarsening
view showing that score-guided merging yields
bounded Laplacian spectral distortion compared
to random or window-restricted baselines.
