From 021e8a7fadba7953786c21b23217243ea938da9f Mon Sep 17 00:00:00 2001 From: Ramraj Nagar Date: Wed, 28 Jan 2026 10:56:55 +0530 Subject: [PATCH] Add documentation for sparse data support --- README.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/README.md b/README.md index 1030de3..f041f80 100644 --- a/README.md +++ b/README.md @@ -131,3 +131,33 @@ When training with leave-one-out validation, make sure to specify the drug index * `best.W`, `best.alpha`, `best.eps`: model parameters snapshot for each training stage * `best.test_hat`: Prediction on test set, using the best model for each stage * `.ckpt` files are the final models in tensorflow compatible format. + +# Working with Sparse Data (scRNAseq) + +CellBox supports training on sparse data formats (e.g., scRNAseq count matrices) to improve memory efficiency and performance. + +### 1. Data Preparation +Convert your expression and perturbation matrices to `scipy.sparse` Compressed Sparse Row (CSR) format and save them as `.npz` files. + +```python +import scipy.sparse +import numpy as np + +# Save your sparse matrices +scipy.sparse.save_npz('data/expr.npz', expr_matrix_csr) +scipy.sparse.save_npz('data/pert.npz', pert_matrix_csr) +``` + +### 2. Configuration Update +In your experiment configuration JSON file (e.g., `configs/MyExperiment.json`), set `sparse_data` to `true` and point to the `.npz` files: + +```json +{ + "sparse_data": true, + "expr_file": "expr.npz", + "pert_file": "pert.npz", + ... +} +``` + +**Note**: When `sparse_data` is enabled, the Gaussian noise augmentation (`add_noise_level`) is currently not supported.