Natural Language Processing Faculty Publications

On the effect of dropping layers of pre-trained transformer models

Hassan Sajjad, Faculty of Computer Science, Dalhousie University, Canada
Fahim Dalvi, Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar
Nadir Durrani, Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar
Preslav Nakov, Mohamed bin Zayed University of Artificial IntelligenceFollow

Document Type

Article

Publication Title

Computer Speech and Language

Abstract

Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as: (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using different objective function exhibit different learning patterns and w.r.t the layer dropping. © 2022 Elsevier Ltd

DOI

10.1016/j.csl.2022.101429

Publication Date

1-2023

Keywords

Efficient transfer learning, Interpretation and analysis, Pre-trained transformer models, Down-stream, Efficient transfer learning, Interpretation and analyse, Objective functions, Performance, Pre-trained transformer model, Sentence similarity, Task performance, Transfer learning, Transformer modeling

Comments

IR Deposit conditions:

OA version (pathway b) Accepted version

24 months embargo

License: CC BY NC-ND

Must link to publisher version with DOI

Recommended Citation

H. Sajjad, F. Dalvi, N. Durrani, and P. Nakov, "On the effect of dropping layers of pre-trained transformer models", Computer Speech and Language, vol. 77, no. 101429, Jan 2023, doi: 10.1016/j.csl.2022.101429

Additional Links

https://doi.org/10.1016/j.csl.2022.101429

Link to Full Text

COinS

Natural Language Processing Faculty Publications

On the effect of dropping layers of pre-trained transformer models

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Browse

Contribute

Links

Natural Language Processing Faculty Publications

On the effect of dropping layers of pre-trained transformer models

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Share

Browse

Contribute

Links