Publikation
Divergent Token Metrics: Measuring degradation to prune away LLM components - and optimize quantization
Björn Deiseroth; Max Meuer; Nikolas Gritsch; Constantin Eichenberg; Patrick Schramowski; Matthias Aßenmacher; Kristian Kersting
In: Kevin Duh; Helena Gómez-Adorno; Steven Bethard (Hrsg.). Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024. Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Pages 6764-6783, Association for Computational Linguistics, 2024.
Zusammenfassung
Large Language Models (LLMs) have reshaped
natural language processing with their im-
pressive capabilities. However, their ever-
increasing size has raised concerns about their
effective deployment and the need for LLM
compression. This study introduces the Diver-
gent Token Metrics (DTMs), a novel approach
to assessing compressed LLMs, addressing the
limitations of traditional perplexity or accu-
racy measures that fail to accurately reflect text
generation quality. DTMs measure token di-
vergences that allow deeper insights into the
subtleties of model compression, in particu-
lar, when evaluating components’ impacts in-
dividually. Utilizing the First Divergent To-
ken Metric (FDTM) in model sparsification re-
veals that 25% of all attention components can
be pruned beyond 90% on the Llama-2 model
family, still keeping SOTA performance. For
quantization, FDTM suggests that more than
80% of the parameters can be naively trans-
formed to int8 without special outlier manage-
ment. These evaluations indicate the neces-
sity of choosing appropriate compressions for
parameters individually—and that FDTM can
identify those—while standard metrics result
in deteriorated outcomes.
