DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation

1New York University
COLM 2025

*Indicates Corresponding Author
TDLR: In the case of 4-bit activation quantization, we explained why the randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms.
MY ALT TEXT

Comparison of 4-bit activation quantization error E(·) for each token with NR, RO and RH for (a) LLaMA-7B, (b) LLaMA2-7B, (c) LLaMA2-13B and (d) LLaMA3-8B. The tokens are from model.layers.6.post_attention_layernorm.

MY ALT TEXT

Comparison of 2D 4-bit quantization errors for tokens with NR, RO, RH and DFRot for LLaMA3-8B.

MY ALT TEXT

Comparison of 4-bit quantization error for the token with massive activation with NR, RO, RH and DFRot for LLaMA3-8B.

Abstract

Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms. Notably, the reason behind this phenomenon remains unknown. In this paper, we find that these transformations show substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. The primary reason for the accuracy difference lies in the fact that randomized Hadamard transforms can slightly reduce the quantization error for tokens with massive activations while randomized orthogonal transforms increase the quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we consider this a long-tail optimization problem, and therefore construct a simple yet effective method: a weighted loss function. Additionally, we propose an optimization strategy for the rotation matrix that involves alternating optimization of quantization parameters while employing orthogonal Procrustes transforms to refine the rotation matrix. This makes the distribution of the rotated activation values more conducive to quantization, especially for tokens with massive activations. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive experiments demonstrate the effectiveness and efficiency of DFRot. By tuning the rotation matrix using just a single sample, DFRot achieves a perplexity improvement of 0.98 and 0.95 on W4A4KV4 and W4A4KV16, respectively, for LLaMA3-70B, a model known for its quantization challenges. Code is available at GitHub.

Paper

BibTeX

@article{
  xiang2024dfrot,
  title={DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation},
  author={Xiang, Jingyang and Zhang, Sai Qian},
  journal={arXiv preprint arXiv:2412.00648},
  year={2024}
}