New paper on sparsity and quantization of attention in transformer networks

Our new paper, led by PhD student Tianchu Ji, will be published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. The work will also be presented in the Repl4NLP workshop.

Tianchu Ji, Shraddhan Jain, Michael Ferdman, Peter Milder, H. Andrew Schwartz, and Niranjan Balasubramanian. “On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers.” Accepted to appear in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.

You can read the paper on arXiv, and you can find our data and code on GitHub.

Here is Tianchu’s three-minute overview video:

Abstract: How much information do NLP tasks really need from a transformer’s attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and logscaled mapping which produces only a few (e.g. 23) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (< 1.0%) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.

 

This entry was posted on June 25, 2021.