The Ultra-Fast LLM Quantization & Export Library
Load โ Quantize โ Fine-tune โ Export โ All in One Line
Quick Start โข Features โข Export Formats โข Examples โข Documentation
# 50+ lines of configuration...
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
# ... more config
)
# Then llama.cpp compilation for GGUF...
# Then manual tensor conversion... |
from quantllm import turbo
# One line does everything
model = turbo("meta-llama/Llama-3-8B")
# Generate
print(model.generate("Hello!"))
# Fine-tune
model.finetune(dataset, epochs=3)
# Export to any format
model.export("gguf", quantization="Q4_K_M") |
# Recommended installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With all export formats
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"from quantllm import turbo
# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")
# Generate text
response = model.generate("Explain quantum computing simply")
print(response)
# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")QuantLLM automatically:
- โ Detects your GPU and available memory
- โ Applies optimal 4-bit quantization
- โ Enables Flash Attention 2 when available
- โ Configures memory management
# One unified API for everything
model = turbo("mistralai/Mistral-7B")
model.generate("Hello!")
model.finetune(data, epochs=3)
model.export("gguf", quantization="Q4_K_M")
model.push("user/repo", format="gguf") |
|
|
Llama 2/3, Mistral, Mixtral, Qwen 1/2, Phi 1/2/3, Gemma, Falcon, DeepSeek, Yi, StarCoder, ChatGLM, InternLM, Baichuan, StableLM, BLOOM, OPT, MPT, GPT-NeoX... |
|
|
# Auto-generates model cards with:
# - YAML frontmatter
# - Usage examples
# - "Use this model" button
model.push("user/my-model", format="gguf") |
Export to any deployment target with a single line:
from quantllm import turbo
model = turbo("microsoft/phi-3-mini")
# GGUF โ For llama.cpp, Ollama, LM Studio
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
# ONNX โ For ONNX Runtime, TensorRT
model.export("onnx", "./model-onnx/")
# MLX โ For Apple Silicon Macs
model.export("mlx", "./model-mlx/", quantization="4bit")
# SafeTensors โ For HuggingFace
model.export("safetensors", "./model-hf/")| Type | Bits | Quality | Use Case |
|---|---|---|---|
Q2_K |
2-bit | Low | Minimum size |
Q3_K_M |
3-bit | Fair | Very constrained |
Q4_K_M |
4-bit | Good | Recommended โญ |
Q5_K_M |
5-bit | High | Quality-focused |
Q6_K |
6-bit | Very High | Near-original |
Q8_0 |
8-bit | Excellent | Best quality |
from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
# Simple generation
response = model.generate(
"Write a Python function for fibonacci",
max_new_tokens=200,
temperature=0.7,
)
print(response)
# Chat format
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)from quantllm import TurboModel
# Load any GGUF model directly
model = TurboModel.from_gguf(
"TheBloke/Llama-2-7B-Chat-GGUF",
filename="llama-2-7b-chat.Q4_K_M.gguf"
)
print(model.generate("Hello!"))from quantllm import turbo
model = turbo("mistralai/Mistral-7B")
# Simple โ everything auto-configured
model.finetune("training_data.json", epochs=3)
# Advanced โ full control
model.finetune(
"training_data.json",
epochs=5,
learning_rate=2e-4,
lora_r=32,
lora_alpha=64,
batch_size=4,
)Supported data formats:
[
{"instruction": "What is Python?", "output": "Python is..."},
{"text": "Full text for language modeling"},
{"prompt": "Question", "completion": "Answer"}
]from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
# Push with auto-generated model card
model.push(
"your-username/my-model",
format="gguf",
quantization="Q4_K_M",
license="apache-2.0"
)The model card includes:
- โ
Proper YAML frontmatter (
library_name,tags,base_model) - โ Format-specific usage examples
- โ "Use this model" button compatibility
- โ Quantization details
| Configuration | GPU VRAM | Models |
|---|---|---|
| ๐ข Entry | 6-8 GB | 1-7B (4-bit) |
| ๐ก Mid-Range | 12-24 GB | 7-30B (4-bit) |
| ๐ด High-End | 24-80 GB | 70B+ |
Tested GPUs: RTX 3060/3070/3080/3090/4070/4080/4090, A100, H100, Apple M1/M2/M3/M4
# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With specific features
pip install "quantllm[gguf]" # GGUF export
pip install "quantllm[onnx]" # ONNX export
pip install "quantllm[mlx]" # MLX export (Apple Silicon)
pip install "quantllm[triton]" # Triton kernels
pip install "quantllm[full]" # Everythingquantllm/
โโโ core/ # Core functionality
โ โโโ turbo_model.py # TurboModel unified API
โ โโโ smart_config.py # Auto-configuration
โ โโโ export.py # Universal exporter
โโโ quant/ # Quantization
โ โโโ llama_cpp.py # GGUF conversion
โโโ hub/ # HuggingFace integration
โ โโโ hub_manager.py # Push/pull models
โ โโโ model_card.py # Auto model cards
โโโ kernels/ # Custom kernels
โ โโโ triton/ # Fused operations
โโโ utils/ # Utilities
โโโ progress.py # Beautiful UI
git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
pip install -e ".[dev]"
pytestAreas for contribution:
- ๐ New model architectures
- ๐ง Performance optimizations
- ๐ Documentation
- ๐ Bug fixes
MIT License โ see LICENSE for details.
Made with ๐งก by Dark Coder
โญ Star on GitHub โข ๐ Report Bug โข ๐ Sponsor
Happy Quantizing! ๐