Predoc

Preprocess document service for RAG (Retrieval Augmented Generation)

Usage

Dev Environment Setup

We recommend you use conda to isolate RAG environment, then install dependencies via pip:

conda create -y --name RAG python=3.10

Ensure that you are in RAG env, then install pytorch and required dependencies:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

You may need install tesseract for OCR capability, see doc for installation guide:

# for ubuntu/debian
apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr tesseract-ocr-eng tesseract-ocr-chi-sim libgl1-mesa-glx \

For LLMChunker, you can use ollama(set it as default backend) to run the model locally, see doc for installation guide.

from prep.chunker import LLMChunker

chunker = LLMChunker(
    backend='ollama',
    ollama_api_host='http://127.0.0.1:11434',
    model_name='qwen3:4b'
)

You can also use api backend to use any OpenAI-compatible API, e.g. you can use vLLM to deploy a model on your own server.

For contributor, install git hook before you commit:

pre-commit install

Install from Docker

We provide CPU/GPU Docker images to suit different deployment needs. See Docker Build Guide for detailed documentation.

We suggest you use CPU version as it not only provides smaller image size but also almost same performance (yolo-parsing and embedding will not be bottleneck).

Getting Started

1. Configure the Service

Copy the example configuration and edit required fields:

cp config.yaml.example config.yaml
# Edit config.yaml with your settings

See Configuration Guide for detailed configuration reference.

2. Choose Operation Mode

API Server Mode (synchronous):

# config.yaml
app:
  enable_message_queue: false

# Start server
python run.py

Access API documentation at http://localhost:8000/docs

Task Consumer Mode (asynchronous with RabbitMQ):

# config.yaml
app:
  enable_message_queue: true

# Start consumer
python run.py

Configuration

The service uses YAML-based configuration with environment variable override support.

📖 Full Configuration Guide - Detailed reference with all available options

Quick Setup:

# 1. Copy example configuration
cp config.yaml.example config.yaml

# 2. Edit required fields (see CONFIGURATION.md)
vim config.yaml

# 3. Or use environment variables
export MILVUS_HOST=localhost
export CHUNK_API_KEY=sk-your-api-key

Configuration Priority: Environment Variables > config.yaml > Defaults

Supported features

YOLO image recognition for PDF parsing
LLM for text chunking, rule-based chunking(semantic)
use paraphrase-multilingual-mpnet-base-v2 as embedding model

TODOs

paddleocr support
batch-processing for PDF parsing
yolo model/chunk LLM upgrade

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
api		api
backends		backends
config		config
docker		docker
messaging		messaging
predoc		predoc
schemas		schemas
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CONFIGURATION.md		CONFIGURATION.md
README.md		README.md
config.yaml.example		config.yaml.example
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predoc

Usage

Dev Environment Setup

Install from Docker

Getting Started

1. Configure the Service

2. Choose Operation Mode

Configuration

Supported features

TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Besthope-Official/predoc

Folders and files

Latest commit

History

Repository files navigation

Predoc

Usage

Dev Environment Setup

Install from Docker

Getting Started

1. Configure the Service

2. Choose Operation Mode

Configuration

Supported features

TODOs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages