Skip to main content Skip to main navigation

Publikation

Divergent Token Metrics: Measuring degradation to prune away LLM components - and optimize quantization

Björn Deiseroth; Max Meuer; Nikolas Gritsch; Constantin Eichenberg; Patrick Schramowski; Matthias Aßenmacher; Kristian Kersting
In: Kevin Duh; Helena Gómez-Adorno; Steven Bethard (Hrsg.). Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024. Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Pages 6764-6783, Association for Computational Linguistics, 2024.

Zusammenfassung

Large Language Models (LLMs) have reshaped natural language processing with their im- pressive capabilities. However, their ever- increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Diver- gent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accu- racy measures that fail to accurately reflect text generation quality. DTMs measure token di- vergences that allow deeper insights into the subtleties of model compression, in particu- lar, when evaluating components’ impacts in- dividually. Utilizing the First Divergent To- ken Metric (FDTM) in model sparsification re- veals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of the parameters can be naively trans- formed to int8 without special outlier manage- ment. These evaluations indicate the neces- sity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

Weitere Links