DendrouLab · SarahOuologuem · Nov 26, 2024 · Nov 26, 2024 · Dec 1, 2024 · Dec 17, 2024
diff --git a/docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md b/docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md
@@ -1,6 +1,6 @@
 # Deconvoluting spatial data with Panpipes
 
-The `deconvolution_spatial` workflow runs deconvolution for spatial data. Multiple slides can be deconvoluted with the same reference in one run. For each spatial slide, the workflow expects one `MuData` object, with the spatial data saved in `mudata.mod['spatial']`. For the reference scRNA-Seq data, it expects a `MuData`, with the gene expression data saved in `mudata.mod['rna']`. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/deconvolute_spatial.html).
+The `deconvolution_spatial` workflow runs deconvolution for spatial data. Multiple slides can be deconvoluted with the same reference in one run. For each spatial slide, the workflow expects one `SpatialData` object. For the reference scRNA-Seq data, it expects a `MuData`, with the gene expression data saved in `mudata.mod['rna']`. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/deconvolute_spatial.html).
 
 For all the tutorials, we will append the `--local` command which ensures that the pipeline runs on the computing node you're currently on, namely your local machine or an interactive session on a computing node on a cluster.
 
@@ -17,10 +17,11 @@ cd deconvolution
 ```
 
 In this tutorial, we will use human heart spatial transcriptomics and scRNA-Seq data published by [Kuppe et al](https://www.nature.com/articles/s41586-022-05060-x). 
-You can use the following python code to download and save the data as `MuData` files: 
+You can use the following python code to download and save the data: 
 
 ```
 import scanpy as sc
+import spatialdata as sd
 import muon as mu
 
 # Loading the spatial data: 
@@ -33,7 +34,8 @@ adata_st = adata_st[adata_st.obs["patient"] == "P1"]
 del adata_st.uns["spatial"]["control_P17"]
 del adata_st.uns["spatial"]["control_P7"]
 del adata_st.uns["spatial"]["control_P8"]
-mu.MuData({"spatial": adata_st}).write_h5mu("./data/spatial_data/Human_Heart.h5mu")
+sdata_st = sd.SpatialData(tables=adata_st)
+sdata_st.write("./data/spatial_data/Human_Heart.zarr")
 
 # Loading the single-cell reference
 adata_sc = sc.read(
@@ -58,26 +60,25 @@ deconvolution
 └──  data
 	|── Human_Heart_reference.h5mu
 	└──  spatial_data
-		└── Human_Heart.h5mu
+		└── Human_Heart.zarr
 ```
 
 
-**Note, that the `MuData` object of the reference data should not be saved into the same folder as the spatial data!**
-*When running the workflow, the workflow expects all spatial `MuData` objects to be saved in the same folder and reads in all `MuData` objects of that directory. When saving the reference `MuData` into the same folder as the spatial, the workflow would treat the reference `MuData` as spatial and would try to run deconvolution on it.*
 
+*When running the workflow, the workflow expects all spatial `SpatialData` objects to be saved in the same folder and reads in all `SpatialData` objects of that directory.*
 
 
+### Create yaml file 
+
+In `spatial/deconvolution`, create the pipeline.yml and pipeline.log files by running `panpipes deconvolution_spatial config` in the command line (you potentially need to activate the conda environment with `conda activate pipeline_env` first!). 
 
 
 ## Cell2Location
 
 ### Edit yaml file 
 
-In `spatial/deconvolution`, create the pipeline.yml and pipeline.log files by running `panpipes deconvolution_spatial config` in the command line (you potentially need to activate the conda environment with `conda activate pipeline_env` first!). 
 Modify the yaml file, or simply use the [pipeline.yml](pipeline_yml.md) that we provide (you potentially need to add the path of the conda environment in the yaml).  
 
-
-
 ### Run Panpipes
 
 Run the full workflow with `panpipes deconvolution_spatial make full --local`.
@@ -90,11 +91,11 @@ deconvolution
 │   └── Human_Heart
 │		├── Cell2Loc_inf_aver.csv
 │		├── Cell2Loc_screference_output.h5mu
-│		└── Cell2Loc_spatial_output.h5mu
+│		└── Cell2Loc_spatial_output.zarr
 ├── data
 │	|── Human_Heart_reference.h5mu
 │	└──  spatial_data
-│		└── Human_Heart.h5mu
+│		└── Human_Heart.zarr
 ├── figures
 │   └── Cell2Location
 │   	└── Human_Heart
@@ -114,7 +115,8 @@ deconvolution
 
 
 In the folder `./cell2location.output` a folder for each slide will be created containing the following outputs: 
-* `MuData` containing the spatial data together with the posterior of the spatial mapping model
+* `SpatialData` containing the spatial data together with the posterior of the spatial mapping model
+  * If `export_gene_by_spot = True` a gene by spot matrix for each cell type in a layer
 * `MuData` containing the reference data together with the posterior of the reference model
 * A csv-file `Cell2Loc_inf_anver.csv` containing the estimated expression of every gene in every cell type 
 * If `save_models = True`, the reference model and the spatial mapping model 
@@ -137,5 +139,51 @@ Also in `./figures/Cell2Location`, a folder for each slide will be created. Each
 * Spatial plot where the spots of the slide are colored by the estimated cell type abundances 
 
 
+## Tangram
+
+### Edit yaml file 
+
+Modify the yaml file, or simply use the [pipeline.yml](../deconvolution_tangram/pipeline_yml.md) that we provide (you potentially need to add the path of the conda environment in the yaml).  
+
+### Run Panpipes
+
+Run the full workflow with `panpipes deconvolution_spatial make full --local`.
+
+Once `Panpipes` has finished, the `spatial/deconvolution` directory will have the following structure:
+
+```
+deconvolution
+├── data
+│	|── Human_Heart_reference.h5mu
+│	└──  spatial_data
+│		└── Human_Heart.zarr
+├── figures
+│   └── Tangram
+│       └── Human_Heart            
+│           ├── rank_genes_groups_cell_type_original.png
+│           └── show_tangram_ct_pred.png
+├── logs
+│   └── 2_Tangram_Human_Heart.log
+├── pipeline.log
+├── pipeline.yml
+└── tangram.output
+    └── Human_Heart
+        ├── Tangram_screference_output.h5mu
+        ├── Tangram_spatial_output.zarr
+        └── rank_genes_groups.csv
+```
+
+In the folder `./tangram.output` a folder for each slide will be created containing the following outputs: 
+* `SpatialData` containing the spatial data with the projected cell annotations
+* `MuData` containing the reference data
+* If gene selection via rank genes is performed: A csv-file `rank_genes_groups.csv` containing the rank genes
+
+
+Also in `./figures/Tangram`, a folder for each slide will be created. Each folder will contain the following plots: 
+* If gene selection via rank genes is performed: a plot of the rank genes groups ([sc.pl.rank_genes_groups](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://scanpy.readthedocs.io/en/1.10.x/api/generated/scanpy.pl.rank_genes_groups.html&ved=2ahUKEwirjO-x9biLAxV-0QIHHa18MF4QFnoECAgQAQ&usg=AOvVaw1j4CGWeO_DBNcRM7eIjZfE))
+* Spatial plot where the spots of the slide are colored by the predicted cell type probabilities 
+
+
+
 
 *Note: We find that keeping the suggested directory structure (one main directory by project with all the individual steps in separate folders) is useful for project management. You can of course customize your directories as you prefer, and change the paths accordingly in the `pipeline.yml` config files!*
diff --git a/docs/deconvolution/pipeline.yml b/docs/deconvolution/pipeline.yml
@@ -68,7 +68,8 @@ Cell2Location:
     categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
     continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
     max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
-    use_gpu: # Default True; whether to use GPU for training 
+    #use_gpu: # Default True; whether to use GPU for training 
+    accelerator: auto
 
   # -------------------------------
   # Spatial mapping model paramaters
@@ -85,7 +86,8 @@ Cell2Location:
     detection_alpha: 20 # Regularization of with-in experiment variation in RNA detection sensitivity
 
     max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
-    use_gpu: # Default True; whether to use GPU for training
+    #use_gpu: # Default True; whether to use GPU for training
+    accelerator: auto
 
  # -------------------------------
   save_models: False # Default False; whether to save the reference and spatial mapping models

diff --git a/docs/deconvolution_tangram/pipeline.yml b/docs/deconvolution_tangram/pipeline.yml
@@ -0,0 +1,127 @@
+# =============================================================================================
+# Deconvolution, spatial transcriptomics workflow Panpipes (pipeline_deconvolution_spatial.py)
+# =============================================================================================
+# Written by Sarah Ouologuem
+
+
+# ---------------------------
+# 0. Compute resource options
+# ---------------------------
+resources:
+  # Number of threads used for parallel jobs
+  threads_high: 1 # Number of threads used for high intensity computing tasks
+  threads_medium: 1  # Must be enough memory to load your mudata and do computationally light tasks
+  threads_low: 1 # Must be enough memory to load text files and do plotting, requires much less memory
+
+condaenv:  # Path to conda env, leave blank if running native or your cluster automatically inherits the login node environment
+
+
+
+# ----------------------
+# 1. Specify input 
+# ----------------------
+
+# One or multiple slides can be deconvoluted with the same reference in one run. For that, one MuData object for each slide is expected. 
+input: 
+  spatial: ./data/spatial_data # Path to folder containing one or multiple MuDatas of spatial data. The pipeline is reading in all MuData files in that folder and assuming that they are MuDatas of spatial slides. 
+# For all MuData files in that folder, deconvolution is run by the pipeline.
+# In the MuData objects, the spatial data is expected to be saved in mudata.mod["spatial"]. For each spatial MuData, deconvolution is run with "singlecell" (see below) as a reference
+  singlecell: ./data/Human_Heart_reference.h5mu # Path to the MuData file of the reference single-cell data, reference data expected to be saved in mudata.mod["rna"]
+
+
+
+# ----------------------
+## 2. Cell2Location
+# ----------------------
+
+Cell2Location: 
+  run: False # Whether to run Cell2Location 
+
+  # -------------------------------
+  # Feature selection paramaters
+  # -------------------------------
+  feature_selection:
+  # Reduced feature set can either be given  a) via a csv file of genes or b) feature selection will be performed à la Cell2Location, i.e. via the function: cell2location.utils.filtering.filter_genes()
+  # If no file is given in a), b) will be run, i.e. feature selection is not optional. 
+
+    # a) Path to a csv file containing a reduced feature set
+    gene_list: # A header in the csv is expected in the first row
+
+    # b) Parameters for Cell2Location's feature selection, leave empty to use defaults
+    # Whether to remove mitochondrial genes before feature selection 
+    remove_mt: # Default True
+    # All genes detected in less than cell_count_cutoff cells will be excluded.
+    cell_count_cutoff: # Default 15, parameter of function cell2location.utils.filtering.filter_genes()
+    # All genes detected in at least this percentage of cells will be included.
+    cell_percentage_cutoff2: # Default 0.05, parameter of function cell2location.utils.filtering.filter_genes()
+    # Genes detected in the number of cells between the above-mentioned cutoffs are selected only when their average expression in non-zero cells is above this cutoff
+    nonz_mean_cutoff:  # Default 1.12, parameter of function cell2location.utils.filtering.filter_genes()
+
+  # -------------------------------
+  # Reference model paramaters
+  # Leave empty to use defaults
+  # -------------------------------
+  reference: 
+    labels_key: cell_type_original # Default None, key in adata.obs for label (cell type) information
+    batch_key: # Default None, key in adata.obs for batch information
+    layer: # Default None (if None, X will be used), Layer of the raw (!) counts
+    categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
+    continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
+    max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
+    use_gpu: # Default True; whether to use GPU for training 
+
+  # -------------------------------
+  # Spatial mapping model paramaters
+  # Leave empty to use defaults
+  # -------------------------------
+  spatial: 
+    batch_key: # Default None, key in adata.obs for batch information
+    layer: # Default None (if None, X will be used), Layer of the raw (!) counts
+    categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
+    continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space)
+
+    # The following two parameters must be specified (cannot leave empty), otherwise an error will be thrown:
+    N_cells_per_location: 8 # Expected cell abundance per voxel
+    detection_alpha: 20 # Regularization of with-in experiment variation in RNA detection sensitivity
+
+    max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400])
+    use_gpu: # Default True; whether to use GPU for training
+
+ # -------------------------------
+  save_models: False # Default False; whether to save the reference and spatial mapping models
+
+
+
+# -------------
+## 3. Tangram
+# -------------
+
+Tangram: 
+  run: True # Whether to run Tangram 
+
+  # -------------------------------
+  # Feature selection paramaters
+  # -------------------------------
+  # Reduced feature set can either be given  a) via a csv file of genes or b) sc.tl.rank_genes_groups() is run and the top n markers of each group are selected 
+  # If no file is given in a), b) will be run, i.e. feature selection is not optional.  
+  feature_selection: 
+    #  a) Path to a csv file containing a reduced feature set
+    gene_list: # A header in the csv is expected in the first row
+
+    # b) Parameters for sc.tl.rank_genes_groups() gene selection.
+    rank_genes: 
+      labels_key: cell_type_original # Which column in .obs of the reference to use for the 'groupby' parameter of sc.tl.rank_genes_groups()
+      layer: Null # Default None, which layer to use of the reference for sc.tl.rank_genes_groups(). if Null (i.e. None), uses .X
+      n_genes: 100 # Default 100, how many top genes to select of each 'groupby' group
+      test_method: wilcoxon # Default t-test_overestim_var, which test method to use. one of ['logreg', 't-test', 'wilcoxon', 't-test_overestim_var']
+      correction_method: benjamini-hochberg # Default benjamini-hochberg, which p-value correction method to use. one of ['benjamini-hochberg', 'bonferroni']. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon'
+
+  model: 
+    labels_key: cell_type_original # Default None, cell type key in the reference .obs
+    num_epochs: 100 # Default 1000. Number of epochs for tangram.mapping_utils.map_cells_to_space()
+    device: cpu # Default cpu. Device to use for deconvolution
+    kwargs: # Parameters for tangram.mapping_utils.map_cells_to_space(), feel free to add or remove parameters below 
+
+
+
+
diff --git a/docs/deconvolution_tangram/pipeline_yml.md b/docs/deconvolution_tangram/pipeline_yml.md
@@ -0,0 +1,7 @@
+Pipeline deconvolution yml
+-----------------------------
+
+Download this [file](pipeline.yml)
+
+```{literalinclude} pipeline.yml
+```
diff --git a/docs/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.md b/docs/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.md
@@ -29,7 +29,7 @@ spatial
 │           └── datasets_mouse_brain_map_BrainReceptorShowcase_Slice1_Replicate1_images_micron_to_mosaic_pixel_transform.csv
 ```
 
-**Please note, that the data folder structure needs to be structured as expected by the [squidpy.read.vizgen](https://squidpy.readthedocs.io/en/stable/api/squidpy.read.vizgen.html) function.**
+**Please note, that the data folder structure needs to be structured as expected by the [spatialdata_io.merscope](https://spatialdata.scverse.org/projects/io/en/latest/generated/spatialdata_io.merscope.html) function.**
 
 
 ## Edit submission and yaml file 
@@ -60,20 +60,20 @@ ingestion
 │       ├── violin_obs_total_counts_sample_id.mouse_brain.png
 │       └── violin_var_total_counts.mouse_brain.png     
 ├── logs
-│   ├── make_mudatas_mouse_brain.log  
+│   ├── make_spatialdatas_mouse_brain.log  
 │   ├── qcplot.mouse_brain.log  
 │   └── spatialQC_mouse_brain.log
-├── qc.data # MuData with QC metrics 
-│   └──	mouse_brain_unfilt.h5mu
-├── tmp # MuData without QC metrics
-│   └──	mouse_brain_raw.h5mu
+├── qc.data # SpatialData with QC metrics 
+│   └──	mouse_brain_unfilt.zarr
+├── tmp # SpatialData without QC metrics
+│   └──	mouse_brain_raw.zarr
 ├── pipeline.log
 ├── pipeline.yml
 ├── sample_file_qc_spatial.txt
 └── mouse_brain_cell_metadata.tsv # Metadata, i.e. .obs
 ```
 
-In the `qc.data` folder, the final `MuData` object with computed QC metrics is stored. The `MuData` object without QC metrics is also available and stored in the `tmp` folder. The metadata of the final `Mudata` object is additionally extracted and saved as tsv file, `mouse_brain_cell_metadata.tsv`.
+In the `qc.data` folder, the final `SpatialData` object with computed QC metrics is stored. The `SpatialData` object without QC metrics is also available and stored in the `tmp` folder. The metadata of the final `SpatialData` object is additionally extracted and saved as tsv file, `mouse_brain_cell_metadata.tsv`.
 Using the [provided example yaml file](pipeline_yml.md), the first rows and columns of the `mouse_brain_cell_metadata` tsv file look as follows: 
 
 |                  | spatial:fov | spatial:volume | spatial:min_x | spatial:max_x | spatial:min_y | spatial:max_y

diff --git a/docs/ingesting_merfish_data/pipeline.yml b/docs/ingesting_merfish_data/pipeline.yml
@@ -32,7 +32,7 @@ condaenv:  # Path to conda env, leave blank if running native or your cluster au
 
 project: test
 # Submission_file format (example at resources/sample_file_qc_spatial.txt) :
-submission_file: sample_file_qc_spatial.txt # Path to submission file
+submission_file: sample_file_qc_merfish.txt # Path to submission file
 
 
 

diff --git a/docs/ingesting_merfish_data/sample_file_qc_merfish.txt b/docs/ingesting_merfish_data/sample_file_qc_merfish.txt
@@ -0,0 +1,2 @@
+sample_id	spatial_path	spatial_filetype	vpt_cell_by_gene	vpt_cell_metadata	vpt_cell_boundaries
+mouse_brain	data	vizgen
diff --git a/docs/ingesting_merfish_data/sample_file_qc_spatial.txt b/docs/ingesting_merfish_data/sample_file_qc_spatial.txt
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		sample_id spatial_path spatial_filetype vpt_cell_by_gene vpt_cell_metadata vpt_cell_boundaries
		mouse_brain data vizgen