Skip to content

dalistarh/HF-HIGGS

Repository files navigation

HIGGS Non-Uniform Quantization

Efficient implementation of the HIGGS (Hadamard Incoherence with Gaussian MSE-optimal GridS) non-uniform quantization approach for Large Language Models, based on the Linearity Theorem.

Reference

This implementation is based on the paper:

Overview

HIGGS provides a principled approach to non-uniform quantization of LLMs by:

  1. Noise Injection: Simulating quantization effects by injecting calibrated Gaussian noise
  2. Linearity Theorem: Establishing a linear relationship between layer-wise reconstruction error and perplexity increase
  3. Alpha Calibration: Computing layer-wise sensitivity coefficients (α_ℓ) via linear regression
  4. Optimal Assignment: Solving a knapsack problem to find the optimal per-layer bitwidth allocation

Key Features

  • Efficient: Leverages HuggingFace Transformers and Accelerate for fast inference
  • Flexible: Supports any LLM architecture (dense and MoE models)
  • Principled: Based on theoretical foundations from the Linearity Theorem
  • Practical: Outputs actionable bitwidth assignments
  • MoE Support: Handles Mixture-of-Experts models with grouped expert quantization

Installation

From Source

git clone https://github.com/yourusername/higgs-quantization.git
cd higgs-quantization
pip install -e .

Requirements

  • Python >= 3.8
  • PyTorch >= 2.0.0
  • Transformers >= 4.35.0
  • CUDA-capable GPU (recommended)

Quick Start

Step 1: Calibrate Alpha Values

Run calibration to compute layer-wise sensitivity coefficients:

python scripts/calibrate_alphas.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --num_samples 100 \
    --bits_range 3 4 5 6 8 \
    --output_dir ./outputs/llama2-7b

This will:

  • Load the model
  • Extract all linear layers
  • Load calibration data from Fineweb
  • For each layer and bitwidth:
    • Inject noise simulating quantization
    • Measure PPL impact
    • Record noise norms and layer norms
  • Solve for α_ℓ values using linear regression
  • Save results to ./outputs/llama2-7b/alpha_values.json

Step 2: Solve for Optimal Bitwidth Assignment

Use the calibrated alpha values to find optimal bitwidth allocation:

python scripts/solve_assignment.py \
    --alpha_file ./outputs/llama2-7b/alpha_values.json \
    --metadata_file ./outputs/llama2-7b/layer_metadata.json \
    --target_avg_bits 4.0 \
    --method dp \
    --output_file ./outputs/llama2-7b/assignment_4bit.json

This will:

  • Load alpha values and layer metadata
  • Solve the knapsack problem using dynamic programming
  • Find the optimal per-layer bitwidth assignment
  • Minimize expected PPL increase subject to the bit budget
  • Save the assignment to JSON

Usage Examples

Example 1: Full Calibration Pipeline

from higgs_quantization import (
    ModelHandler,
    CalibrationDataset,
    PerplexityEvaluator,
    NoiseInjector,
    AlphaSolver
)

# 1. Load model
handler = ModelHandler("meta-llama/Llama-2-7b-hf")
model = handler.load_model()
linear_layers = handler.extract_linear_layers()

# 2. Load calibration data
calib = CalibrationDataset(
    tokenizer_name="meta-llama/Llama-2-7b-hf",
    num_samples=100
)
calib.load_tokenizer()
samples = calib.load_dataset()

# 3. Measure baseline PPL
evaluator = PerplexityEvaluator(model, calib.tokenizer)
baseline_ppl = evaluator.evaluate_batched(samples)

# 4. Calibrate alpha values
injector = NoiseInjector(model)
solver = AlphaSolver()

for layer_name, module in linear_layers.items():
    for bits in [3, 4, 5, 6, 8]:
        # Inject noise
        noise_norm_sq, layer_norm_sq = injector.inject_noise(
            layer_name, module, bits
        )

        # Measure PPL
        noisy_ppl = evaluator.evaluate_batched(samples)
        ppl_increase = noisy_ppl - baseline_ppl

        # Record measurement
        solver.add_measurement(
            layer_name, bits, noise_norm_sq, layer_norm_sq,
            ppl_increase, baseline_ppl, noisy_ppl
        )

        # Remove noise
        injector.remove_noise(layer_name)

# 5. Solve for alphas
alpha_values = solver.solve_alpha_all_layers()
solver.save_alpha_values("alpha_values.json")

Example 2: Bitwidth Assignment

from higgs_quantization import KnapsackSolver
import json

# Load alpha values and metadata
with open("alpha_values.json") as f:
    alpha_data = json.load(f)
    alpha_values = alpha_data['alpha_values']

with open("layer_metadata.json") as f:
    metadata = json.load(f)
    layer_sizes = {
        name: meta['num_parameters']
        for name, meta in metadata.items()
    }

# Create solver
solver = KnapsackSolver(
    alpha_values=alpha_values,
    layer_sizes=layer_sizes,
    bits_choices=[3, 4, 5, 6, 8]
)

# Solve for 4-bit average
assignment = solver.solve(target_avg_bits=4.0)

# Evaluate
metrics = solver.evaluate_assignment(assignment)
print(f"Average bits: {metrics['avg_bits']:.2f}")
print(f"Expected cost: {metrics['total_cost']:.4e}")

# Save
solver.save_assignment(assignment, "assignment_4bit.json")

Example 3: MoE Model Support

The package automatically handles MoE models:

handler = ModelHandler("mistralai/Mixtral-8x7B-v0.1")
model = handler.load_model()

# Extract layers (MoE experts are automatically identified)
linear_layers = handler.extract_linear_layers()

# Group MoE experts
moe_groups = handler.group_moe_experts()

# All experts in the same MoE layer will be assigned the same bitwidth
# This is handled automatically during calibration

Output Files

alpha_values.json

Contains the calibrated sensitivity coefficients:

{
  "alpha_values": {
    "model.layers.0.self_attn.q_proj": 1.234e-5,
    "model.layers.0.self_attn.k_proj": 8.765e-6,
    ...
  },
  "num_measurements": 500,
  "num_layers": 100
}

assignment_4bit.json

Contains the optimal bitwidth assignment:

{
  "assignment": {
    "model.layers.0.self_attn.q_proj": 5,
    "model.layers.0.self_attn.k_proj": 4,
    "model.layers.0.mlp.down_proj": 3,
    ...
  },
  "metrics": {
    "total_cost": 0.123,
    "avg_bits": 4.02,
    "bitwidth_distribution": {
      "3": 20,
      "4": 50,
      "5": 25,
      "6": 5
    }
  }
}

Algorithm Details

The Linearity Theorem

The core insight is that perplexity increase is approximately linear in layer-wise reconstruction error:

ΔPPℓ ≈ Σ_ℓ (α_ℓ × ||noise_ℓ||²)

where:

  • ΔPPℓ is the perplexity increase
  • α_ℓ is the sensitivity coefficient for layer ℓ
  • ||noise_ℓ||² is the squared L2 norm of quantization noise

Noise Injection

To simulate quantization without actually quantizing:

  1. Compute expected quantization noise std for target bitwidth
  2. Generate Gaussian noise with that std
  3. Add noise to layer weights
  4. Measure PPL impact
  5. Revert to original weights

Alpha Computation

For each layer, we collect multiple measurements at different bitwidths and perform linear regression:

y = α × x

where:

  • y = PPL increase
  • x = noise_norm²
  • α = sensitivity coefficient (slope)

Knapsack Formulation

Find the optimal bitwidth assignment that minimizes:

minimize:   Σ_ℓ (α_ℓ × noise_ℓ²(bits_ℓ))
subject to: Σ_ℓ (params_ℓ × bits_ℓ) ≤ budget

Solved using dynamic programming with O(n × budget) complexity.

Command-Line Reference

calibrate_alphas.py

usage: calibrate_alphas.py [-h] --model_name MODEL_NAME
                          [--device_map DEVICE_MAP]
                          [--torch_dtype {float16,float32,bfloat16}]
                          [--dataset_name DATASET_NAME]
                          [--num_samples NUM_SAMPLES]
                          [--max_length MAX_LENGTH]
                          [--bits_range BITS_RANGE [BITS_RANGE ...]]
                          [--output_dir OUTPUT_DIR]
                          [--save_measurements]
                          [--batch_size BATCH_SIZE]

Required arguments:
  --model_name MODEL_NAME    HuggingFace model name or path

Optional arguments:
  --device_map DEVICE_MAP    Device map (default: auto)
  --num_samples NUM_SAMPLES  Calibration samples (default: 100)
  --bits_range BITS          Bitwidths to test (default: 3 4 5 6 8)
  --output_dir OUTPUT_DIR    Output directory (default: ./higgs_outputs)

solve_assignment.py

usage: solve_assignment.py [-h] --alpha_file ALPHA_FILE
                          --metadata_file METADATA_FILE
                          [--target_avg_bits TARGET_AVG_BITS]
                          [--bits_choices BITS_CHOICES [BITS_CHOICES ...]]
                          [--method {dp,greedy}]
                          [--output_file OUTPUT_FILE]

Required arguments:
  --alpha_file ALPHA_FILE        Path to alpha values JSON
  --metadata_file METADATA_FILE  Path to layer metadata JSON

Optional arguments:
  --target_avg_bits AVG_BITS    Target average bits (default: 4.0)
  --method {dp,greedy}          Solving method (default: dp)
  --bits_choices BITS           Available bitwidths (default: 3 4 5 6 8)

Advanced Usage

Custom Layer Filtering

Include/exclude specific layer types:

python scripts/calibrate_alphas.py \
    --model_name meta-llama/Llama-2-7b-hf \
    --include_patterns q_proj k_proj v_proj o_proj \
    --exclude_patterns lm_head embed

Greedy vs Dynamic Programming

For very large models, use the greedy solver for faster (but potentially suboptimal) solutions:

python scripts/solve_assignment.py \
    --alpha_file alpha_values.json \
    --metadata_file layer_metadata.json \
    --target_avg_bits 4.0 \
    --method greedy

Joint Layer Calibration

For large models where single-layer PPL impact is small, calibrate multiple layers jointly:

python scripts/calibrate_alphas.py \
    --model_name meta-llama/Llama-2-70b-hf \
    --layers_per_step 4

Architecture Support

This implementation supports:

  • Dense Models: Llama, Mistral, GPT, etc.
  • MoE Models: Mixtral, DeepSeek-MoE, etc.
  • All Linear Layers: Attention (Q/K/V/O), MLP (gate/up/down), etc.
  • Custom Architectures: Any model with linear layers

Performance Tips

  1. Use FP16/BF16: Reduces memory usage and speeds up inference
  2. Batch Evaluation: Use larger batch sizes for PPL measurement
  3. Fewer Samples: 100 calibration samples is usually sufficient
  4. GPU: CUDA-capable GPU highly recommended
  5. Parallel Layers: For large models, test multiple layers simultaneously

Limitations

  • Requires calibration data (but only ~100 samples)
  • Alpha calibration can be time-consuming for very large models
  • Assumes Gaussian-like weight distributions (after Hadamard preprocessing)
  • Does not include actual quantization kernels (only finds optimal assignment)

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT License

Citation

If you use this implementation, please cite the original HIGGS paper:

@inproceedings{higgs2025,
  title={Pushing the Limits of Large Language Model Quantization via the Linearity Theorem},
  author={[Authors]},
  booktitle={NAACL},
  year={2025}
}

Acknowledgments

This implementation is based on the HIGGS quantization method and the Linearity Theorem described in the paper. We thank the authors for their groundbreaking work.

Sources

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages