A lightweight sketch-language model built on top of the LLaVA codebase.
Create and activate the conda environment:
conda create -n o3slm python=3.10 -y
conda activate o3slm
pip install --upgrade pip # Enable PEP 660 support
pip install -e .Install training dependencies:
pip install -e ".[train]"
pip install flash-attn --no-build-isolationDownload the pretrained O3SLM model checkpoints from: <link>
Place the downloaded checkpoints in a directory accessible to your training/evaluation scripts.
LLaVA: https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 MM_Projector: 13b: https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5 7b: https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
Place your conversation JSON files for training in the data_jsons/ directory.
Organize your datasets under a single data_root directory with the following structure:
data_root/
├── pretrain_data/
│ ├── images/
│ │ ├── O365/
│ │ └── OI/
│ └── sketches/
│ ├── SketchVCL-OI/
│ │ ├── 1/
│ │ ├── ...
│ │ └── 601/
│ └── SketchVCL-O365/
│ ├── 0/
│ ├── ...
│ └── 364/
├── finetune_data/
│ ├── images/
│ │ ├── coco/
│ │ ├── pixmo_count/
│ │ └── sketchy/
│ └── sketches/
│ └── SketchMIX/
│ ├── 0/
│ ├── ...
│ └── 364/
└── eval_data/
├── images/
│ ├── coco/
│ ├── pixmo_count/
│ └── sketchy/
└── sketches/
├── SketchVCL-C/
├── QuickDraw/
├── Sketchy/
└── TU_Berlin/
Ensure your training and evaluation scripts point to the correct data_root path and that the machine has read access to these directories.
- Download pretrained model checkpoints (see Checkpoints section)
- Prepare your data (see Data Preparation section)
- Ensure conversation JSONs are in
data_jsons/
conda activate o3slm
# Add your training command here
# Example: python train.py --config configs/train.yamlEvaluation is performed using Evaluation/run_eval.sh. The script supports both local execution and Slurm cluster submission.
Before running evaluation, configure the following placeholders in run_eval.sh:
Slurm Parameters:
<JOB_NAME>: Job name for Slurm<SLURM_OUTPUT_PATH>: Output log path (e.g.,run_output/qd_Det.out)<SLURM_PARTITION>: Cluster partition (e.g.,ada)<NTASKS>: Number of tasks (e.g.,1)<CPUS_PER_TASK>: CPUs per task (e.g.,8)<MEMORY>: Memory allocation (e.g.,32G)<GRES>: GPU resources (e.g.,gpu:1)<TIME_LIMIT>: Time limit (e.g.,24:00:00)
Evaluation Parameters:
<ENV_NAME>: Conda environment name (e.g.,o3slm)<WORKDIR>: Project working directory<RUN_NAME>: Experiment run name (e.g.,Molmo_qd_detect)<SKETCH_PATH>: Path to sketch dataset<DATASET_PATH>: Path to image dataset<MODEL_NAME>: Model to evaluate<DATASET_NAME>: Dataset identifier<TASK1>,<TASK2>: Tasks to run (typicallycountanddetection)
Models:
MolmoLLaVAOnevisionPixtralQwenO3SLMGPTGemini
Datasets:
qd(QuickDraw)sketchy(Sketchy)tub(TU-Berlin)coco(COCO)
Tasks:
detectioncount
Sketch Paths:
eval_data/sketches/Sketchy/tx_000100000000/eval_data/sketches/QuickDraw/eval_data/sketches/TU_Berlin/eval_data/sketches/coco_sketches/
Run evaluation locally without Slurm:
conda activate o3slm
cd Evaluation
python count.py \
--name Molmo_qd_detect \
--sketch_path /path/to/eval_data/sketches/QuickDraw/ \
--dataset /path/to/eval_data/images/pixmo_count \
--model Molmo \
--dataset_name qd \
--task count
python detections.py \
--name Molmo_qd_detect \
--sketch_path /path/to/eval_data/sketches/QuickDraw/ \
--dataset /path/to/eval_data/images/pixmo_count \
--model Molmo \
--dataset_name qd \
--task detectionTo submit the evaluation job to a Slurm cluster:
sbatch Evaluation/run_eval.sh@inproceedings{O3SLM2025,
title={O3SLM: Sketch-Language Modeling},
author={...},
booktitle={...},
year={2025}
}- Built on the LLaVA codebase
- Additional acknowledgements to be added from the final paper