Publikation
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
Mengyue Wang; Shuo Chen; Kristian Kersting; Volker Tresp; Yunpu Ma
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2506.02850, Pages 1-15, Computing Research Repository, 2025.
Zusammenfassung
Recent advances in Vision Large Language
Models (VLLMs) have significantly enhanced
their ability to understand video content.
Nonetheless, processing long videos remains
challenging due to high computational de-
mands and the redundancy present in the vi-
sual data. In this work, we propose METok, a
training-free, Multi-stage Event-based Token
compression framework designed to acceler-
ate VLLMs’ inference while preserving accu-
racy. METok progressively eliminates redun-
dant visual tokens across three critical stages:
(1) event-aware compression during vision en-
coding, (2) hierarchical token pruning in the
prefilling stage based on semantic alignment
and event importance, and (3) a decoding-stage
KV Cache optimization that further reduces
memory consumption. Our experiments on
diverse video benchmarks demonstrate that
METok achieves an optimal trade-off between
efficiency and accuracy by dynamically select-
ing informative visual tokens. For instance,
equipping LongVA-7B with METok realizes an
80.6% FLOPs reduction and 93.5% KV Cache
memory savings, all while maintaining compa-
rable or even superior accuracy. The code is
available here.
