Efficient implementation of the HIGGS (Hadamard Incoherence with Gaussian MSE-optimal GridS) non-uniform quantization approach for Large Language Models, based on the Linearity Theorem.
This implementation is based on the paper:
- "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem"
- arXiv: https://arxiv.org/abs/2411.17525
- Published at NAACL 2025
HIGGS provides a principled approach to non-uniform quantization of LLMs by:
- Noise Injection: Simulating quantization effects by injecting calibrated Gaussian noise
- Linearity Theorem: Establishing a linear relationship between layer-wise reconstruction error and perplexity increase
- Alpha Calibration: Computing layer-wise sensitivity coefficients (α_ℓ) via linear regression
- Optimal Assignment: Solving a knapsack problem to find the optimal per-layer bitwidth allocation
- ✅ Efficient: Leverages HuggingFace Transformers and Accelerate for fast inference
- ✅ Flexible: Supports any LLM architecture (dense and MoE models)
- ✅ Principled: Based on theoretical foundations from the Linearity Theorem
- ✅ Practical: Outputs actionable bitwidth assignments
- ✅ MoE Support: Handles Mixture-of-Experts models with grouped expert quantization
git clone https://github.com/yourusername/higgs-quantization.git
cd higgs-quantization
pip install -e .- Python >= 3.8
- PyTorch >= 2.0.0
- Transformers >= 4.35.0
- CUDA-capable GPU (recommended)
Run calibration to compute layer-wise sensitivity coefficients:
python scripts/calibrate_alphas.py \
--model_name meta-llama/Llama-2-7b-hf \
--num_samples 100 \
--bits_range 3 4 5 6 8 \
--output_dir ./outputs/llama2-7bThis will:
- Load the model
- Extract all linear layers
- Load calibration data from Fineweb
- For each layer and bitwidth:
- Inject noise simulating quantization
- Measure PPL impact
- Record noise norms and layer norms
- Solve for α_ℓ values using linear regression
- Save results to
./outputs/llama2-7b/alpha_values.json
Use the calibrated alpha values to find optimal bitwidth allocation:
python scripts/solve_assignment.py \
--alpha_file ./outputs/llama2-7b/alpha_values.json \
--metadata_file ./outputs/llama2-7b/layer_metadata.json \
--target_avg_bits 4.0 \
--method dp \
--output_file ./outputs/llama2-7b/assignment_4bit.jsonThis will:
- Load alpha values and layer metadata
- Solve the knapsack problem using dynamic programming
- Find the optimal per-layer bitwidth assignment
- Minimize expected PPL increase subject to the bit budget
- Save the assignment to JSON
from higgs_quantization import (
ModelHandler,
CalibrationDataset,
PerplexityEvaluator,
NoiseInjector,
AlphaSolver
)
# 1. Load model
handler = ModelHandler("meta-llama/Llama-2-7b-hf")
model = handler.load_model()
linear_layers = handler.extract_linear_layers()
# 2. Load calibration data
calib = CalibrationDataset(
tokenizer_name="meta-llama/Llama-2-7b-hf",
num_samples=100
)
calib.load_tokenizer()
samples = calib.load_dataset()
# 3. Measure baseline PPL
evaluator = PerplexityEvaluator(model, calib.tokenizer)
baseline_ppl = evaluator.evaluate_batched(samples)
# 4. Calibrate alpha values
injector = NoiseInjector(model)
solver = AlphaSolver()
for layer_name, module in linear_layers.items():
for bits in [3, 4, 5, 6, 8]:
# Inject noise
noise_norm_sq, layer_norm_sq = injector.inject_noise(
layer_name, module, bits
)
# Measure PPL
noisy_ppl = evaluator.evaluate_batched(samples)
ppl_increase = noisy_ppl - baseline_ppl
# Record measurement
solver.add_measurement(
layer_name, bits, noise_norm_sq, layer_norm_sq,
ppl_increase, baseline_ppl, noisy_ppl
)
# Remove noise
injector.remove_noise(layer_name)
# 5. Solve for alphas
alpha_values = solver.solve_alpha_all_layers()
solver.save_alpha_values("alpha_values.json")from higgs_quantization import KnapsackSolver
import json
# Load alpha values and metadata
with open("alpha_values.json") as f:
alpha_data = json.load(f)
alpha_values = alpha_data['alpha_values']
with open("layer_metadata.json") as f:
metadata = json.load(f)
layer_sizes = {
name: meta['num_parameters']
for name, meta in metadata.items()
}
# Create solver
solver = KnapsackSolver(
alpha_values=alpha_values,
layer_sizes=layer_sizes,
bits_choices=[3, 4, 5, 6, 8]
)
# Solve for 4-bit average
assignment = solver.solve(target_avg_bits=4.0)
# Evaluate
metrics = solver.evaluate_assignment(assignment)
print(f"Average bits: {metrics['avg_bits']:.2f}")
print(f"Expected cost: {metrics['total_cost']:.4e}")
# Save
solver.save_assignment(assignment, "assignment_4bit.json")The package automatically handles MoE models:
handler = ModelHandler("mistralai/Mixtral-8x7B-v0.1")
model = handler.load_model()
# Extract layers (MoE experts are automatically identified)
linear_layers = handler.extract_linear_layers()
# Group MoE experts
moe_groups = handler.group_moe_experts()
# All experts in the same MoE layer will be assigned the same bitwidth
# This is handled automatically during calibrationContains the calibrated sensitivity coefficients:
{
"alpha_values": {
"model.layers.0.self_attn.q_proj": 1.234e-5,
"model.layers.0.self_attn.k_proj": 8.765e-6,
...
},
"num_measurements": 500,
"num_layers": 100
}Contains the optimal bitwidth assignment:
{
"assignment": {
"model.layers.0.self_attn.q_proj": 5,
"model.layers.0.self_attn.k_proj": 4,
"model.layers.0.mlp.down_proj": 3,
...
},
"metrics": {
"total_cost": 0.123,
"avg_bits": 4.02,
"bitwidth_distribution": {
"3": 20,
"4": 50,
"5": 25,
"6": 5
}
}
}The core insight is that perplexity increase is approximately linear in layer-wise reconstruction error:
ΔPPℓ ≈ Σ_ℓ (α_ℓ × ||noise_ℓ||²)
where:
ΔPPℓis the perplexity increaseα_ℓis the sensitivity coefficient for layer ℓ||noise_ℓ||²is the squared L2 norm of quantization noise
To simulate quantization without actually quantizing:
- Compute expected quantization noise std for target bitwidth
- Generate Gaussian noise with that std
- Add noise to layer weights
- Measure PPL impact
- Revert to original weights
For each layer, we collect multiple measurements at different bitwidths and perform linear regression:
y = α × x
where:
y= PPL increasex= noise_norm²α= sensitivity coefficient (slope)
Find the optimal bitwidth assignment that minimizes:
minimize: Σ_ℓ (α_ℓ × noise_ℓ²(bits_ℓ))
subject to: Σ_ℓ (params_ℓ × bits_ℓ) ≤ budget
Solved using dynamic programming with O(n × budget) complexity.
usage: calibrate_alphas.py [-h] --model_name MODEL_NAME
[--device_map DEVICE_MAP]
[--torch_dtype {float16,float32,bfloat16}]
[--dataset_name DATASET_NAME]
[--num_samples NUM_SAMPLES]
[--max_length MAX_LENGTH]
[--bits_range BITS_RANGE [BITS_RANGE ...]]
[--output_dir OUTPUT_DIR]
[--save_measurements]
[--batch_size BATCH_SIZE]
Required arguments:
--model_name MODEL_NAME HuggingFace model name or path
Optional arguments:
--device_map DEVICE_MAP Device map (default: auto)
--num_samples NUM_SAMPLES Calibration samples (default: 100)
--bits_range BITS Bitwidths to test (default: 3 4 5 6 8)
--output_dir OUTPUT_DIR Output directory (default: ./higgs_outputs)
usage: solve_assignment.py [-h] --alpha_file ALPHA_FILE
--metadata_file METADATA_FILE
[--target_avg_bits TARGET_AVG_BITS]
[--bits_choices BITS_CHOICES [BITS_CHOICES ...]]
[--method {dp,greedy}]
[--output_file OUTPUT_FILE]
Required arguments:
--alpha_file ALPHA_FILE Path to alpha values JSON
--metadata_file METADATA_FILE Path to layer metadata JSON
Optional arguments:
--target_avg_bits AVG_BITS Target average bits (default: 4.0)
--method {dp,greedy} Solving method (default: dp)
--bits_choices BITS Available bitwidths (default: 3 4 5 6 8)
Include/exclude specific layer types:
python scripts/calibrate_alphas.py \
--model_name meta-llama/Llama-2-7b-hf \
--include_patterns q_proj k_proj v_proj o_proj \
--exclude_patterns lm_head embedFor very large models, use the greedy solver for faster (but potentially suboptimal) solutions:
python scripts/solve_assignment.py \
--alpha_file alpha_values.json \
--metadata_file layer_metadata.json \
--target_avg_bits 4.0 \
--method greedyFor large models where single-layer PPL impact is small, calibrate multiple layers jointly:
python scripts/calibrate_alphas.py \
--model_name meta-llama/Llama-2-70b-hf \
--layers_per_step 4This implementation supports:
- ✅ Dense Models: Llama, Mistral, GPT, etc.
- ✅ MoE Models: Mixtral, DeepSeek-MoE, etc.
- ✅ All Linear Layers: Attention (Q/K/V/O), MLP (gate/up/down), etc.
- ✅ Custom Architectures: Any model with linear layers
- Use FP16/BF16: Reduces memory usage and speeds up inference
- Batch Evaluation: Use larger batch sizes for PPL measurement
- Fewer Samples: 100 calibration samples is usually sufficient
- GPU: CUDA-capable GPU highly recommended
- Parallel Layers: For large models, test multiple layers simultaneously
- Requires calibration data (but only ~100 samples)
- Alpha calibration can be time-consuming for very large models
- Assumes Gaussian-like weight distributions (after Hadamard preprocessing)
- Does not include actual quantization kernels (only finds optimal assignment)
Contributions are welcome! Please open an issue or submit a pull request.
MIT License
If you use this implementation, please cite the original HIGGS paper:
@inproceedings{higgs2025,
title={Pushing the Limits of Large Language Model Quantization via the Linearity Theorem},
author={[Authors]},
booktitle={NAACL},
year={2025}
}This implementation is based on the HIGGS quantization method and the Linearity Theorem described in the paper. We thank the authors for their groundbreaking work.