Skip to main content Skip to main navigation

Publikation

METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding

Mengyue Wang; Shuo Chen; Kristian Kersting; Volker Tresp; Yunpu Ma
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2506.02850, Pages 1-15, Computing Research Repository, 2025.

Zusammenfassung

Recent advances in Vision Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational de- mands and the redundancy present in the vi- sual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to acceler- ate VLLMs’ inference while preserving accu- racy. METok progressively eliminates redun- dant visual tokens across three critical stages: (1) event-aware compression during vision en- coding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically select- ing informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining compa- rable or even superior accuracy. The code is available here.

Weitere Links