Minimal Distillation Schedule for Extreme Language Model Compression

Zhang, Chen; Yang, Yang; Wang, Qifan; Liu, Jiahao; Wang, Jingang; Wu, Wei and Song, Dawei (2024). Minimal Distillation Schedule for Extreme Language Model Compression. In: 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 17-22 Mar 2024, Malta.


Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant. In this paper, we propose a novel approach called Minimal Distillation Schedule (MINIDISC), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance trade off of the teacher assistant. We then introduce a new λ-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MINIDISC can select the optimal teacher assistant with the best λ-tradeoff. We extensively evaluate MINIDISC through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MINIDISC by applying it to a language model with billions of parameters.1

Viewing alternatives

Download history

Item Actions