🚀 QuantLLM v2.0

The Ultra-Fast LLM Quantization & Export Library

Load → Quantize → Fine-tune → Export — All in One Line

Quick Start • Features • Export Formats • Examples • Documentation

🎯 Why QuantLLM?

❌ Without QuantLLM

# 50+ lines of configuration...
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    # ... more config
)
# Then llama.cpp compilation for GGUF...
# Then manual tensor conversion...

✅ With QuantLLM

from quantllm import turbo

# One line does everything
model = turbo("meta-llama/Llama-3-8B")

# Generate
print(model.generate("Hello!"))

# Fine-tune
model.finetune(dataset, epochs=3)

# Export to any format
model.export("gguf", quantization="Q4_K_M")

⚡ Quick Start

Installation

# Recommended installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With all export formats
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

Your First Model

from quantllm import turbo

# Load any model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("Explain quantum computing simply")
print(response)

# Export to GGUF for Ollama/llama.cpp
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

QuantLLM automatically:

✅ Detects your GPU and available memory
✅ Applies optimal 4-bit quantization
✅ Enables Flash Attention 2 when available
✅ Configures memory management

✨ Features

🔥 TurboModel API

# One unified API for everything
model = turbo("mistralai/Mistral-7B")
model.generate("Hello!")
model.finetune(data, epochs=3)
model.export("gguf", quantization="Q4_K_M")
model.push("user/repo", format="gguf")

⚡ Performance

Flash Attention 2 — Auto-enabled
torch.compile — 2x faster training
Dynamic Padding — 50% less VRAM
Triton Kernels — Fused operations

🧠 45+ Model Architectures

Llama 2/3, Mistral, Mixtral, Qwen 1/2, Phi 1/2/3, Gemma, Falcon, DeepSeek, Yi, StarCoder, ChatGLM, InternLM, Baichuan, StableLM, BLOOM, OPT, MPT, GPT-NeoX...

📦 Multi-Format Export

GGUF — llama.cpp, Ollama, LM Studio
ONNX — ONNX Runtime, TensorRT
MLX — Apple Silicon (M1/M2/M3/M4)
SafeTensors — HuggingFace

🎨 Beautiful UI

╔════════════════════════════════════╗
║  🚀 QuantLLM v2.0                  ║
║  ✓ GGUF  ✓ ONNX  ✓ MLX             ║
╚════════════════════════════════════╝

📊 Model: meta-llama/Llama-3.2-3B
   Parameters: 3.21B
   Memory: 6.4 GB → 1.9 GB (70% saved)

🤗 One-Click Hub Publishing

# Auto-generates model cards with:
# - YAML frontmatter
# - Usage examples  
# - "Use this model" button

model.push("user/my-model", format="gguf")

📦 Export Formats

Export to any deployment target with a single line:

from quantllm import turbo

model = turbo("microsoft/phi-3-mini")

# GGUF — For llama.cpp, Ollama, LM Studio
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# ONNX — For ONNX Runtime, TensorRT  
model.export("onnx", "./model-onnx/")

# MLX — For Apple Silicon Macs
model.export("mlx", "./model-mlx/", quantization="4bit")

# SafeTensors — For HuggingFace
model.export("safetensors", "./model-hf/")

GGUF Quantization Types

Type	Bits	Quality	Use Case
`Q2_K`	2-bit	Low	Minimum size
`Q3_K_M`	3-bit	Fair	Very constrained
`Q4_K_M`	4-bit	Good	Recommended ⭐
`Q5_K_M`	5-bit	High	Quality-focused
`Q6_K`	6-bit	Very High	Near-original
`Q8_0`	8-bit	Excellent	Best quality

🎮 Examples

Chat with Any Model

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

# Simple generation
response = model.generate(
    "Write a Python function for fibonacci",
    max_new_tokens=200,
    temperature=0.7,
)
print(response)

# Chat format
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)

Load GGUF Models from HuggingFace

from quantllm import TurboModel

# Load any GGUF model directly
model = TurboModel.from_gguf(
    "TheBloke/Llama-2-7B-Chat-GGUF", 
    filename="llama-2-7b-chat.Q4_K_M.gguf"
)

print(model.generate("Hello!"))

Fine-Tune with Your Data

from quantllm import turbo

model = turbo("mistralai/Mistral-7B")

# Simple — everything auto-configured
model.finetune("training_data.json", epochs=3)

# Advanced — full control
model.finetune(
    "training_data.json",
    epochs=5,
    learning_rate=2e-4,
    lora_r=32,
    lora_alpha=64,
    batch_size=4,
)

Supported data formats:

[
  {"instruction": "What is Python?", "output": "Python is..."},
  {"text": "Full text for language modeling"},
  {"prompt": "Question", "completion": "Answer"}
]

Push to HuggingFace Hub

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

# Push with auto-generated model card
model.push(
    "your-username/my-model",
    format="gguf",
    quantization="Q4_K_M",
    license="apache-2.0"
)

The model card includes:

✅ Proper YAML frontmatter (library_name, tags, base_model)
✅ Format-specific usage examples
✅ "Use this model" button compatibility
✅ Quantization details

💻 Hardware Requirements

Configuration	GPU VRAM	Models
🟢 Entry	6-8 GB	1-7B (4-bit)
🟡 Mid-Range	12-24 GB	7-30B (4-bit)
🔴 High-End	24-80 GB	70B+

Tested GPUs: RTX 3060/3070/3080/3090/4070/4080/4090, A100, H100, Apple M1/M2/M3/M4

📦 Installation Options

# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With specific features
pip install "quantllm[gguf]"     # GGUF export
pip install "quantllm[onnx]"     # ONNX export  
pip install "quantllm[mlx]"      # MLX export (Apple Silicon)
pip install "quantllm[triton]"   # Triton kernels
pip install "quantllm[full]"     # Everything

🏗️ Architecture

quantllm/
├── core/                    # Core functionality
│   ├── turbo_model.py      # TurboModel unified API
│   ├── smart_config.py     # Auto-configuration
│   └── export.py           # Universal exporter
├── quant/                   # Quantization
│   └── llama_cpp.py        # GGUF conversion
├── hub/                     # HuggingFace integration
│   ├── hub_manager.py      # Push/pull models
│   └── model_card.py       # Auto model cards
├── kernels/                 # Custom kernels
│   └── triton/             # Fused operations
└── utils/                   # Utilities
    └── progress.py         # Beautiful UI

🤝 Contributing

git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
pip install -e ".[dev]"
pytest

Areas for contribution:

🆕 New model architectures
🔧 Performance optimizations
📚 Documentation
🐛 Bug fixes

📜 License

MIT License — see LICENSE for details.

Made with 🧡 by Dark Coder

⭐ Star on GitHub • 🐛 Report Bug • 💖 Sponsor

Happy Quantizing! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github		.github
docs		docs
examples		examples
quantllm		quantllm
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTE.md		CONTRIBUTE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Uh oh!

codewithdark-git/QuantLLM

Folders and files

Latest commit

History

Repository files navigation

🚀 QuantLLM v2.0

🎯 Why QuantLLM?

❌ Without QuantLLM

✅ With QuantLLM

⚡ Quick Start

Installation

Your First Model

✨ Features

🔥 TurboModel API

⚡ Performance

🧠 45+ Model Architectures

📦 Multi-Format Export

🎨 Beautiful UI

🤗 One-Click Hub Publishing

📦 Export Formats

GGUF Quantization Types

🎮 Examples

Chat with Any Model

Load GGUF Models from HuggingFace

Fine-Tune with Your Data

Push to HuggingFace Hub

💻 Hardware Requirements

📦 Installation Options

🏗️ Architecture

🤝 Contributing

📜 License

Made with 🧡 by Dark Coder

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 2

Languages

Packages