Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 60 additions & 12 deletions docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Deconvoluting spatial data with Panpipes

The `deconvolution_spatial` workflow runs deconvolution for spatial data. Multiple slides can be deconvoluted with the same reference in one run. For each spatial slide, the workflow expects one `MuData` object, with the spatial data saved in `mudata.mod['spatial']`. For the reference scRNA-Seq data, it expects a `MuData`, with the gene expression data saved in `mudata.mod['rna']`. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/deconvolute_spatial.html).
The `deconvolution_spatial` workflow runs deconvolution for spatial data. Multiple slides can be deconvoluted with the same reference in one run. For each spatial slide, the workflow expects one `SpatialData` object. For the reference scRNA-Seq data, it expects a `MuData`, with the gene expression data saved in `mudata.mod['rna']`. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/deconvolute_spatial.html).

For all the tutorials, we will append the `--local` command which ensures that the pipeline runs on the computing node you're currently on, namely your local machine or an interactive session on a computing node on a cluster.

Expand All @@ -17,10 +17,11 @@ cd deconvolution
```

In this tutorial, we will use human heart spatial transcriptomics and scRNA-Seq data published by [Kuppe et al](https://www.nature.com/articles/s41586-022-05060-x).
You can use the following python code to download and save the data as `MuData` files:
You can use the following python code to download and save the data:

```
import scanpy as sc
import spatialdata as sd
import muon as mu

# Loading the spatial data:
Expand All @@ -33,7 +34,8 @@ adata_st = adata_st[adata_st.obs["patient"] == "P1"]
del adata_st.uns["spatial"]["control_P17"]
del adata_st.uns["spatial"]["control_P7"]
del adata_st.uns["spatial"]["control_P8"]
mu.MuData({"spatial": adata_st}).write_h5mu("./data/spatial_data/Human_Heart.h5mu")
sdata_st = sd.SpatialData(tables=adata_st)
sdata_st.write("./data/spatial_data/Human_Heart.zarr")

# Loading the single-cell reference
adata_sc = sc.read(
Expand All @@ -58,26 +60,25 @@ deconvolution
└── data
|── Human_Heart_reference.h5mu
└── spatial_data
└── Human_Heart.h5mu
└── Human_Heart.zarr
```


**Note, that the `MuData` object of the reference data should not be saved into the same folder as the spatial data!**
*When running the workflow, the workflow expects all spatial `MuData` objects to be saved in the same folder and reads in all `MuData` objects of that directory. When saving the reference `MuData` into the same folder as the spatial, the workflow would treat the reference `MuData` as spatial and would try to run deconvolution on it.*

*When running the workflow, the workflow expects all spatial `SpatialData` objects to be saved in the same folder and reads in all `SpatialData` objects of that directory.*


### Create yaml file

In `spatial/deconvolution`, create the pipeline.yml and pipeline.log files by running `panpipes deconvolution_spatial config` in the command line (you potentially need to activate the conda environment with `conda activate pipeline_env` first!).


## Cell2Location

### Edit yaml file

In `spatial/deconvolution`, create the pipeline.yml and pipeline.log files by running `panpipes deconvolution_spatial config` in the command line (you potentially need to activate the conda environment with `conda activate pipeline_env` first!).
Modify the yaml file, or simply use the [pipeline.yml](pipeline_yml.md) that we provide (you potentially need to add the path of the conda environment in the yaml).



### Run Panpipes

Run the full workflow with `panpipes deconvolution_spatial make full --local`.
Expand All @@ -90,11 +91,11 @@ deconvolution
│ └── Human_Heart
│ ├── Cell2Loc_inf_aver.csv
│ ├── Cell2Loc_screference_output.h5mu
│ └── Cell2Loc_spatial_output.h5mu
│ └── Cell2Loc_spatial_output.zarr
├── data
│ |── Human_Heart_reference.h5mu
│ └── spatial_data
│ └── Human_Heart.h5mu
│ └── Human_Heart.zarr
├── figures
│ └── Cell2Location
│ └── Human_Heart
Expand All @@ -114,7 +115,8 @@ deconvolution


In the folder `./cell2location.output` a folder for each slide will be created containing the following outputs:
* `MuData` containing the spatial data together with the posterior of the spatial mapping model
* `SpatialData` containing the spatial data together with the posterior of the spatial mapping model
* If `export_gene_by_spot = True` a gene by spot matrix for each cell type in a layer
* `MuData` containing the reference data together with the posterior of the reference model
* A csv-file `Cell2Loc_inf_anver.csv` containing the estimated expression of every gene in every cell type
* If `save_models = True`, the reference model and the spatial mapping model
Expand All @@ -137,5 +139,51 @@ Also in `./figures/Cell2Location`, a folder for each slide will be created. Each
* Spatial plot where the spots of the slide are colored by the estimated cell type abundances


## Tangram

### Edit yaml file

Modify the yaml file, or simply use the [pipeline.yml](../deconvolution_tangram/pipeline_yml.md) that we provide (you potentially need to add the path of the conda environment in the yaml).

### Run Panpipes

Run the full workflow with `panpipes deconvolution_spatial make full --local`.

Once `Panpipes` has finished, the `spatial/deconvolution` directory will have the following structure:

```
deconvolution
├── data
│ |── Human_Heart_reference.h5mu
│ └── spatial_data
│ └── Human_Heart.zarr
├── figures
│ └── Tangram
│ └── Human_Heart
│ ├── rank_genes_groups_cell_type_original.png
│ └── show_tangram_ct_pred.png
├── logs
│ └── 2_Tangram_Human_Heart.log
├── pipeline.log
├── pipeline.yml
└── tangram.output
└── Human_Heart
├── Tangram_screference_output.h5mu
├── Tangram_spatial_output.zarr
└── rank_genes_groups.csv
```

In the folder `./tangram.output` a folder for each slide will be created containing the following outputs:
* `SpatialData` containing the spatial data with the projected cell annotations
* `MuData` containing the reference data
* If gene selection via rank genes is performed: A csv-file `rank_genes_groups.csv` containing the rank genes


Also in `./figures/Tangram`, a folder for each slide will be created. Each folder will contain the following plots:
* If gene selection via rank genes is performed: a plot of the rank genes groups ([sc.pl.rank_genes_groups](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://scanpy.readthedocs.io/en/1.10.x/api/generated/scanpy.pl.rank_genes_groups.html&ved=2ahUKEwirjO-x9biLAxV-0QIHHa18MF4QFnoECAgQAQ&usg=AOvVaw1j4CGWeO_DBNcRM7eIjZfE))
* Spatial plot where the spots of the slide are colored by the predicted cell type probabilities




*Note: We find that keeping the suggested directory structure (one main directory by project with all the individual steps in separate folders) is useful for project management. You can of course customize your directories as you prefer, and change the paths accordingly in the `pipeline.yml` config files!*
6 changes: 4 additions & 2 deletions docs/deconvolution/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,8 @@ Cell2Location:
categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
use_gpu: # Default True; whether to use GPU for training
#use_gpu: # Default True; whether to use GPU for training
accelerator: auto

# -------------------------------
# Spatial mapping model paramaters
Expand All @@ -85,7 +86,8 @@ Cell2Location:
detection_alpha: 20 # Regularization of with-in experiment variation in RNA detection sensitivity

max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
use_gpu: # Default True; whether to use GPU for training
#use_gpu: # Default True; whether to use GPU for training
accelerator: auto

# -------------------------------
save_models: False # Default False; whether to save the reference and spatial mapping models
Expand Down
127 changes: 127 additions & 0 deletions docs/deconvolution_tangram/pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# =============================================================================================
# Deconvolution, spatial transcriptomics workflow Panpipes (pipeline_deconvolution_spatial.py)
# =============================================================================================
# Written by Sarah Ouologuem


# ---------------------------
# 0. Compute resource options
# ---------------------------
resources:
# Number of threads used for parallel jobs
threads_high: 1 # Number of threads used for high intensity computing tasks
threads_medium: 1 # Must be enough memory to load your mudata and do computationally light tasks
threads_low: 1 # Must be enough memory to load text files and do plotting, requires much less memory

condaenv: # Path to conda env, leave blank if running native or your cluster automatically inherits the login node environment



# ----------------------
# 1. Specify input
# ----------------------

# One or multiple slides can be deconvoluted with the same reference in one run. For that, one MuData object for each slide is expected.
input:
spatial: ./data/spatial_data # Path to folder containing one or multiple MuDatas of spatial data. The pipeline is reading in all MuData files in that folder and assuming that they are MuDatas of spatial slides.
# For all MuData files in that folder, deconvolution is run by the pipeline.
# In the MuData objects, the spatial data is expected to be saved in mudata.mod["spatial"]. For each spatial MuData, deconvolution is run with "singlecell" (see below) as a reference
singlecell: ./data/Human_Heart_reference.h5mu # Path to the MuData file of the reference single-cell data, reference data expected to be saved in mudata.mod["rna"]



# ----------------------
## 2. Cell2Location
# ----------------------

Cell2Location:
run: False # Whether to run Cell2Location

# -------------------------------
# Feature selection paramaters
# -------------------------------
feature_selection:
# Reduced feature set can either be given a) via a csv file of genes or b) feature selection will be performed à la Cell2Location, i.e. via the function: cell2location.utils.filtering.filter_genes()
# If no file is given in a), b) will be run, i.e. feature selection is not optional.

# a) Path to a csv file containing a reduced feature set
gene_list: # A header in the csv is expected in the first row

# b) Parameters for Cell2Location's feature selection, leave empty to use defaults
# Whether to remove mitochondrial genes before feature selection
remove_mt: # Default True
# All genes detected in less than cell_count_cutoff cells will be excluded.
cell_count_cutoff: # Default 15, parameter of function cell2location.utils.filtering.filter_genes()
# All genes detected in at least this percentage of cells will be included.
cell_percentage_cutoff2: # Default 0.05, parameter of function cell2location.utils.filtering.filter_genes()
# Genes detected in the number of cells between the above-mentioned cutoffs are selected only when their average expression in non-zero cells is above this cutoff
nonz_mean_cutoff: # Default 1.12, parameter of function cell2location.utils.filtering.filter_genes()

# -------------------------------
# Reference model paramaters
# Leave empty to use defaults
# -------------------------------
reference:
labels_key: cell_type_original # Default None, key in adata.obs for label (cell type) information
batch_key: # Default None, key in adata.obs for batch information
layer: # Default None (if None, X will be used), Layer of the raw (!) counts
categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
use_gpu: # Default True; whether to use GPU for training

# -------------------------------
# Spatial mapping model paramaters
# Leave empty to use defaults
# -------------------------------
spatial:
batch_key: # Default None, key in adata.obs for batch information
layer: # Default None (if None, X will be used), Layer of the raw (!) counts
categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)

# The following two parameters must be specified (cannot leave empty), otherwise an error will be thrown:
N_cells_per_location: 8 # Expected cell abundance per voxel
detection_alpha: 20 # Regularization of with-in experiment variation in RNA detection sensitivity

max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
use_gpu: # Default True; whether to use GPU for training

# -------------------------------
save_models: False # Default False; whether to save the reference and spatial mapping models



# -------------
## 3. Tangram
# -------------

Tangram:
run: True # Whether to run Tangram

# -------------------------------
# Feature selection paramaters
# -------------------------------
# Reduced feature set can either be given a) via a csv file of genes or b) sc.tl.rank_genes_groups() is run and the top n markers of each group are selected
# If no file is given in a), b) will be run, i.e. feature selection is not optional.
feature_selection:
# a) Path to a csv file containing a reduced feature set
gene_list: # A header in the csv is expected in the first row

# b) Parameters for sc.tl.rank_genes_groups() gene selection.
rank_genes:
labels_key: cell_type_original # Which column in .obs of the reference to use for the 'groupby' parameter of sc.tl.rank_genes_groups()
layer: Null # Default None, which layer to use of the reference for sc.tl.rank_genes_groups(). if Null (i.e. None), uses .X
n_genes: 100 # Default 100, how many top genes to select of each 'groupby' group
test_method: wilcoxon # Default t-test_overestim_var, which test method to use. one of ['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']
correction_method: benjamini-hochberg # Default benjamini-hochberg, which p-value correction method to use. one of ['benjamini-hochberg', 'bonferroni']. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon'

model:
labels_key: cell_type_original # Default None, cell type key in the reference .obs
num_epochs: 100 # Default 1000. Number of epochs for tangram.mapping_utils.map_cells_to_space()
device: cpu # Default cpu. Device to use for deconvolution
kwargs: # Parameters for tangram.mapping_utils.map_cells_to_space(), feel free to add or remove parameters below




7 changes: 7 additions & 0 deletions docs/deconvolution_tangram/pipeline_yml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Pipeline deconvolution yml
-----------------------------

Download this [file](pipeline.yml)

```{literalinclude} pipeline.yml
```
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ spatial
│ └── datasets_mouse_brain_map_BrainReceptorShowcase_Slice1_Replicate1_images_micron_to_mosaic_pixel_transform.csv
```

**Please note, that the data folder structure needs to be structured as expected by the [squidpy.read.vizgen](https://squidpy.readthedocs.io/en/stable/api/squidpy.read.vizgen.html) function.**
**Please note, that the data folder structure needs to be structured as expected by the [spatialdata_io.merscope](https://spatialdata.scverse.org/projects/io/en/latest/generated/spatialdata_io.merscope.html) function.**


## Edit submission and yaml file
Expand Down Expand Up @@ -60,20 +60,20 @@ ingestion
│ ├── violin_obs_total_counts_sample_id.mouse_brain.png
│ └── violin_var_total_counts.mouse_brain.png
├── logs
│ ├── make_mudatas_mouse_brain.log
│ ├── make_spatialdatas_mouse_brain.log
│ ├── qcplot.mouse_brain.log
│ └── spatialQC_mouse_brain.log
├── qc.data # MuData with QC metrics
│ └── mouse_brain_unfilt.h5mu
├── tmp # MuData without QC metrics
│ └── mouse_brain_raw.h5mu
├── qc.data # SpatialData with QC metrics
│ └── mouse_brain_unfilt.zarr
├── tmp # SpatialData without QC metrics
│ └── mouse_brain_raw.zarr
├── pipeline.log
├── pipeline.yml
├── sample_file_qc_spatial.txt
└── mouse_brain_cell_metadata.tsv # Metadata, i.e. .obs
```

In the `qc.data` folder, the final `MuData` object with computed QC metrics is stored. The `MuData` object without QC metrics is also available and stored in the `tmp` folder. The metadata of the final `Mudata` object is additionally extracted and saved as tsv file, `mouse_brain_cell_metadata.tsv`.
In the `qc.data` folder, the final `SpatialData` object with computed QC metrics is stored. The `SpatialData` object without QC metrics is also available and stored in the `tmp` folder. The metadata of the final `SpatialData` object is additionally extracted and saved as tsv file, `mouse_brain_cell_metadata.tsv`.
Using the [provided example yaml file](pipeline_yml.md), the first rows and columns of the `mouse_brain_cell_metadata` tsv file look as follows:

| | spatial:fov | spatial:volume | spatial:min_x | spatial:max_x | spatial:min_y | spatial:max_y
Expand Down
2 changes: 1 addition & 1 deletion docs/ingesting_merfish_data/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ condaenv: # Path to conda env, leave blank if running native or your cluster au

project: test
# Submission_file format (example at resources/sample_file_qc_spatial.txt) :
submission_file: sample_file_qc_spatial.txt # Path to submission file
submission_file: sample_file_qc_merfish.txt # Path to submission file



Expand Down
2 changes: 2 additions & 0 deletions docs/ingesting_merfish_data/sample_file_qc_merfish.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
sample_id spatial_path spatial_filetype vpt_cell_by_gene vpt_cell_metadata vpt_cell_boundaries
mouse_brain data vizgen
3 changes: 0 additions & 3 deletions docs/ingesting_merfish_data/sample_file_qc_spatial.txt

This file was deleted.

Loading