Preprocess document service for RAG (Retrieval Augmented Generation)
We recommend you use conda to isolate RAG environment, then install dependencies via pip:
conda create -y --name RAG python=3.10Ensure that you are in RAG env, then install pytorch and required dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtYou may need install tesseract for OCR capability, see doc for installation guide:
# for ubuntu/debian
apt-get update && apt-get install -y --no-install-recommends \
tesseract-ocr tesseract-ocr-eng tesseract-ocr-chi-sim libgl1-mesa-glx \For LLMChunker, you can use ollama(set it as default backend) to run the model locally, see doc for installation guide.
from prep.chunker import LLMChunker
chunker = LLMChunker(
backend='ollama',
ollama_api_host='http://127.0.0.1:11434',
model_name='qwen3:4b'
)You can also use api backend to use any OpenAI-compatible API, e.g. you can use vLLM to deploy a model on your own server.
For contributor, install git hook before you commit:
pre-commit installWe provide CPU/GPU Docker images to suit different deployment needs. See Docker Build Guide for detailed documentation.
We suggest you use CPU version as it not only provides smaller image size but also almost same performance (yolo-parsing and embedding will not be bottleneck).
Copy the example configuration and edit required fields:
cp config.yaml.example config.yaml
# Edit config.yaml with your settingsSee Configuration Guide for detailed configuration reference.
API Server Mode (synchronous):
# config.yaml
app:
enable_message_queue: false
# Start server
python run.pyAccess API documentation at http://localhost:8000/docs
Task Consumer Mode (asynchronous with RabbitMQ):
# config.yaml
app:
enable_message_queue: true
# Start consumer
python run.pyThe service uses YAML-based configuration with environment variable override support.
📖 Full Configuration Guide - Detailed reference with all available options
Quick Setup:
# 1. Copy example configuration
cp config.yaml.example config.yaml
# 2. Edit required fields (see CONFIGURATION.md)
vim config.yaml
# 3. Or use environment variables
export MILVUS_HOST=localhost
export CHUNK_API_KEY=sk-your-api-keyConfiguration Priority: Environment Variables > config.yaml > Defaults
- YOLO image recognition for PDF parsing
- LLM for text chunking, rule-based chunking(semantic)
- use
paraphrase-multilingual-mpnet-base-v2as embedding model
- paddleocr support
- batch-processing for PDF parsing
- yolo model/chunk LLM upgrade