HIGGS Non-Uniform Quantization

Efficient implementation of the HIGGS (Hadamard Incoherence with Gaussian MSE-optimal GridS) non-uniform quantization approach for Large Language Models, based on the Linearity Theorem.

Reference

This implementation is based on the paper:

"Pushing the Limits of Large Language Model Quantization via the Linearity Theorem"
arXiv: https://arxiv.org/abs/2411.17525
Published at NAACL 2025

Overview

HIGGS provides a principled approach to non-uniform quantization of LLMs by:

Noise Injection: Simulating quantization effects by injecting calibrated Gaussian noise
Linearity Theorem: Establishing a linear relationship between layer-wise reconstruction error and perplexity increase
Alpha Calibration: Computing layer-wise sensitivity coefficients (α_ℓ) via linear regression
Optimal Assignment: Solving a knapsack problem to find the optimal per-layer bitwidth allocation

Key Features

✅ Efficient: Leverages HuggingFace Transformers and Accelerate for fast inference
✅ Flexible: Supports any LLM architecture (dense and MoE models)
✅ Principled: Based on theoretical foundations from the Linearity Theorem
✅ Practical: Outputs actionable bitwidth assignments
✅ MoE Support: Handles Mixture-of-Experts models with grouped expert quantization

Installation

From Source

git clone https://github.com/yourusername/higgs-quantization.git
cd higgs-quantization
pip install -e .

Requirements

Python >= 3.8
PyTorch >= 2.0.0
Transformers >= 4.35.0
CUDA-capable GPU (recommended)

Quick Start

Step 1: Calibrate Alpha Values

Run calibration to compute layer-wise sensitivity coefficients:

python scripts/calibrate_alphas.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --num_samples 100 \
    --bits_range 3 4 5 6 8 \
    --output_dir ./outputs/llama2-7b

This will:

Load the model
Extract all linear layers
Load calibration data from Fineweb
For each layer and bitwidth:
- Inject noise simulating quantization
- Measure PPL impact
- Record noise norms and layer norms
Solve for α_ℓ values using linear regression
Save results to ./outputs/llama2-7b/alpha_values.json

Step 2: Solve for Optimal Bitwidth Assignment

Use the calibrated alpha values to find optimal bitwidth allocation:

python scripts/solve_assignment.py \
    --alpha_file ./outputs/llama2-7b/alpha_values.json \
    --metadata_file ./outputs/llama2-7b/layer_metadata.json \
    --target_avg_bits 4.0 \
    --method dp \
    --output_file ./outputs/llama2-7b/assignment_4bit.json

This will:

Load alpha values and layer metadata
Solve the knapsack problem using dynamic programming
Find the optimal per-layer bitwidth assignment
Minimize expected PPL increase subject to the bit budget
Save the assignment to JSON

Usage Examples

Example 1: Full Calibration Pipeline

from higgs_quantization import (
    ModelHandler,
    CalibrationDataset,
    PerplexityEvaluator,
    NoiseInjector,
    AlphaSolver
)

# 1. Load model
handler = ModelHandler("meta-llama/Llama-2-7b-hf")
model = handler.load_model()
linear_layers = handler.extract_linear_layers()

# 2. Load calibration data
calib = CalibrationDataset(
    tokenizer_name="meta-llama/Llama-2-7b-hf",
    num_samples=100
)
calib.load_tokenizer()
samples = calib.load_dataset()

# 3. Measure baseline PPL
evaluator = PerplexityEvaluator(model, calib.tokenizer)
baseline_ppl = evaluator.evaluate_batched(samples)

# 4. Calibrate alpha values
injector = NoiseInjector(model)
solver = AlphaSolver()

for layer_name, module in linear_layers.items():
    for bits in [3, 4, 5, 6, 8]:
        # Inject noise
        noise_norm_sq, layer_norm_sq = injector.inject_noise(
            layer_name, module, bits
        )

        # Measure PPL
        noisy_ppl = evaluator.evaluate_batched(samples)
        ppl_increase = noisy_ppl - baseline_ppl

        # Record measurement
        solver.add_measurement(
            layer_name, bits, noise_norm_sq, layer_norm_sq,
            ppl_increase, baseline_ppl, noisy_ppl
        )

        # Remove noise
        injector.remove_noise(layer_name)

# 5. Solve for alphas
alpha_values = solver.solve_alpha_all_layers()
solver.save_alpha_values("alpha_values.json")

Example 2: Bitwidth Assignment

from higgs_quantization import KnapsackSolver
import json

# Load alpha values and metadata
with open("alpha_values.json") as f:
    alpha_data = json.load(f)
    alpha_values = alpha_data['alpha_values']

with open("layer_metadata.json") as f:
    metadata = json.load(f)
    layer_sizes = {
        name: meta['num_parameters']
        for name, meta in metadata.items()
    }

# Create solver
solver = KnapsackSolver(
    alpha_values=alpha_values,
    layer_sizes=layer_sizes,
    bits_choices=[3, 4, 5, 6, 8]
)

# Solve for 4-bit average
assignment = solver.solve(target_avg_bits=4.0)

# Evaluate
metrics = solver.evaluate_assignment(assignment)
print(f"Average bits: {metrics['avg_bits']:.2f}")
print(f"Expected cost: {metrics['total_cost']:.4e}")

# Save
solver.save_assignment(assignment, "assignment_4bit.json")

Example 3: MoE Model Support

The package automatically handles MoE models:

handler = ModelHandler("mistralai/Mixtral-8x7B-v0.1")
model = handler.load_model()

# Extract layers (MoE experts are automatically identified)
linear_layers = handler.extract_linear_layers()

# Group MoE experts
moe_groups = handler.group_moe_experts()

# All experts in the same MoE layer will be assigned the same bitwidth
# This is handled automatically during calibration

Output Files

alpha_values.json

Contains the calibrated sensitivity coefficients:

{
  "alpha_values": {
    "model.layers.0.self_attn.q_proj": 1.234e-5,
    "model.layers.0.self_attn.k_proj": 8.765e-6,
    ...
  },
  "num_measurements": 500,
  "num_layers": 100
}

assignment_4bit.json

Contains the optimal bitwidth assignment:

{
  "assignment": {
    "model.layers.0.self_attn.q_proj": 5,
    "model.layers.0.self_attn.k_proj": 4,
    "model.layers.0.mlp.down_proj": 3,
    ...
  },
  "metrics": {
    "total_cost": 0.123,
    "avg_bits": 4.02,
    "bitwidth_distribution": {
      "3": 20,
      "4": 50,
      "5": 25,
      "6": 5
    }
  }
}

Algorithm Details

The Linearity Theorem

The core insight is that perplexity increase is approximately linear in layer-wise reconstruction error:

ΔPPℓ ≈ Σ_ℓ (α_ℓ × ||noise_ℓ||²)

where:

ΔPPℓ is the perplexity increase
α_ℓ is the sensitivity coefficient for layer ℓ
||noise_ℓ||² is the squared L2 norm of quantization noise

Noise Injection

To simulate quantization without actually quantizing:

Compute expected quantization noise std for target bitwidth
Generate Gaussian noise with that std
Add noise to layer weights
Measure PPL impact
Revert to original weights

Alpha Computation

For each layer, we collect multiple measurements at different bitwidths and perform linear regression:

y = α × x

where:

y = PPL increase
x = noise_norm²
α = sensitivity coefficient (slope)

Knapsack Formulation

Find the optimal bitwidth assignment that minimizes:

minimize:   Σ_ℓ (α_ℓ × noise_ℓ²(bits_ℓ))
subject to: Σ_ℓ (params_ℓ × bits_ℓ) ≤ budget

Solved using dynamic programming with O(n × budget) complexity.

Command-Line Reference

calibrate_alphas.py

usage: calibrate_alphas.py [-h] --model_name MODEL_NAME
                          [--device_map DEVICE_MAP]
                          [--torch_dtype {float16,float32,bfloat16}]
                          [--dataset_name DATASET_NAME]
                          [--num_samples NUM_SAMPLES]
                          [--max_length MAX_LENGTH]
                          [--bits_range BITS_RANGE [BITS_RANGE ...]]
                          [--output_dir OUTPUT_DIR]
                          [--save_measurements]
                          [--batch_size BATCH_SIZE]

Required arguments:
  --model_name MODEL_NAME    HuggingFace model name or path

Optional arguments:
  --device_map DEVICE_MAP    Device map (default: auto)
  --num_samples NUM_SAMPLES  Calibration samples (default: 100)
  --bits_range BITS          Bitwidths to test (default: 3 4 5 6 8)
  --output_dir OUTPUT_DIR    Output directory (default: ./higgs_outputs)

solve_assignment.py

usage: solve_assignment.py [-h] --alpha_file ALPHA_FILE
                          --metadata_file METADATA_FILE
                          [--target_avg_bits TARGET_AVG_BITS]
                          [--bits_choices BITS_CHOICES [BITS_CHOICES ...]]
                          [--method {dp,greedy}]
                          [--output_file OUTPUT_FILE]

Required arguments:
  --alpha_file ALPHA_FILE        Path to alpha values JSON
  --metadata_file METADATA_FILE  Path to layer metadata JSON

Optional arguments:
  --target_avg_bits AVG_BITS    Target average bits (default: 4.0)
  --method {dp,greedy}          Solving method (default: dp)
  --bits_choices BITS           Available bitwidths (default: 3 4 5 6 8)

Advanced Usage

Custom Layer Filtering

Include/exclude specific layer types:

python scripts/calibrate_alphas.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --include_patterns q_proj k_proj v_proj o_proj \
    --exclude_patterns lm_head embed

Greedy vs Dynamic Programming

For very large models, use the greedy solver for faster (but potentially suboptimal) solutions:

python scripts/solve_assignment.py \
    --alpha_file alpha_values.json \
    --metadata_file layer_metadata.json \
    --target_avg_bits 4.0 \
    --method greedy

Joint Layer Calibration

For large models where single-layer PPL impact is small, calibrate multiple layers jointly:

python scripts/calibrate_alphas.py \
    --model_name meta-llama/Llama-2-70b-hf \
    --layers_per_step 4

Architecture Support

This implementation supports:

✅ Dense Models: Llama, Mistral, GPT, etc.
✅ MoE Models: Mixtral, DeepSeek-MoE, etc.
✅ All Linear Layers: Attention (Q/K/V/O), MLP (gate/up/down), etc.
✅ Custom Architectures: Any model with linear layers

Performance Tips

Use FP16/BF16: Reduces memory usage and speeds up inference
Batch Evaluation: Use larger batch sizes for PPL measurement
Fewer Samples: 100 calibration samples is usually sufficient
GPU: CUDA-capable GPU highly recommended
Parallel Layers: For large models, test multiple layers simultaneously

Limitations

Requires calibration data (but only ~100 samples)
Alpha calibration can be time-consuming for very large models
Assumes Gaussian-like weight distributions (after Hadamard preprocessing)
Does not include actual quantization kernels (only finds optimal assignment)

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT License

Citation

If you use this implementation, please cite the original HIGGS paper:

@inproceedings{higgs2025,
  title={Pushing the Limits of Large Language Model Quantization via the Linearity Theorem},
  author={[Authors]},
  booktitle={NAACL},
  year={2025}
}

Acknowledgments

This implementation is based on the HIGGS quantization method and the Linearity Theorem described in the paper. We thank the authors for their groundbreaking work.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
higgs_quantization		higgs_quantization
scripts		scripts
.gitignore		.gitignore
ALGORITHM.md		ALGORITHM.md
LICENSE		LICENSE
README.md		README.md
TUTORIAL.md		TUTORIAL.md
requirements.txt		requirements.txt
setup.py		setup.py
test_imports.py		test_imports.py

License

dalistarh/HF-HIGGS

Folders and files

Latest commit

History

Repository files navigation

HIGGS Non-Uniform Quantization

Reference

Overview

Key Features

Installation

From Source

Requirements

Quick Start

Step 1: Calibrate Alpha Values

Step 2: Solve for Optimal Bitwidth Assignment

Usage Examples

Example 1: Full Calibration Pipeline

Example 2: Bitwidth Assignment

Example 3: MoE Model Support

Output Files

alpha_values.json

assignment_4bit.json

Algorithm Details

The Linearity Theorem

Noise Injection

Alpha Computation

Knapsack Formulation

Command-Line Reference

calibrate_alphas.py

solve_assignment.py

Advanced Usage

Custom Layer Filtering

Greedy vs Dynamic Programming

Joint Layer Calibration

Architecture Support

Performance Tips

Limitations

Contributing

License

Citation

Acknowledgments

Sources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages