Publikation
Adaptive Knowledge Distillation for Efficient Domain-Specific Language Models.
Prajvi Saxena; Sabine Janzen; Wolfgang Maaß
In: Women in Machine Learning Workshop (WiML 2024) at NeurIPS. Neural Information Processing Systems (NeurIPS-2024), 19th Women in Machine Learning Workshop (WiML 2024), December 10-15, Vancouver, British Columbia, Canada, Women in Machine Learning Workshop (WiML 2024), 2024.
Zusammenfassung
Large Language Models (LLMs) such as GPT, LLaMA, Mistral, etc have demonstrated remarkable capabilities across a wide range of tasks. However, customizing these pre-trained models for domain-specific applications requires significant computational and memory demands making them impractical for deployment in resource- constrained environments. Existing techniques, such as Knowledge Distillation (KD) [1, 5], Parameter-Efficient Fine-Tuning (PEFT) [2], and model parallelism address these issues by reducing the model size and the number of trainable parameters. KD compresses the knowledge of a larger, more complex model (teacher) into a smaller, more efficient model (student) while attempting to preserve the accuracy. In contrast, PEFT selectively tunes a small subset of parameters in large pre-trained models, freezing the rest, to reduce computational overhead. Key methods in PEFT include Adapters, BitFit, LoRa, Compacter, and Soft Prompts, each distinguished by their strategies to integrate and optimize a small set of parameters within large pre-trained models. Despite their advantages, these methods often fail to maintain the performance required in specialized domains. Moreover, current approaches of KD in LLM typically rely on black-box distillation, which uses hard labels and fixed architectures, limiting the flexibility and effectiveness of knowledge transfer in the models. We introduce AKD, an Adaptive Knowledge Distillation framework that addresses these limitations by integrating adapters with white-box distillation. AKD uses soft target cross-entropy loss to transfer knowledge, exposing the student model to the teacher's output distribution and its internal representations, thus preserving critical information for domain-specific tasks, an improvement over black box distillation of LLMs. By integrating adapters (e.g., QLoRA [3]) into the student model, AKD focuses distillation on these newly added adapters while freezing the rest of the parameters. Additionally, our pipeline employs an adaptive prompt engineering optimization mechanism motivated by PromptAid [4], which allows exploring, perturbing, testing, and iterating over prompts to prompt a language model better. These prompts, refined through few-shot learning techniques, will enable the teacher model to produce more accurate and context-aware outputs, improving the quality of knowledge transfer to the student. AKD also features a progressive distillation strategy, where knowledge is transferred in phases from simpler to more complex tasks. This incremental approach ensures that the student model captures both high-level abstractions and domain-specific representations from the teacher. Overall, the AKD hybrid approach not only addresses the limitations of black-box distillation but also improves computational efficiency, achieving performance comparable to traditional KD methods while reducing training overhead.