diff --git a/docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md b/docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md index dedadbd..8a0a0ff 100644 --- a/docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md +++ b/docs/deconvolution/deconvoluting_spatial_data_with_panpipes.md @@ -1,6 +1,6 @@ # Deconvoluting spatial data with Panpipes -The `deconvolution_spatial` workflow runs deconvolution for spatial data. Multiple slides can be deconvoluted with the same reference in one run. For each spatial slide, the workflow expects one `MuData` object, with the spatial data saved in `mudata.mod['spatial']`. For the reference scRNA-Seq data, it expects a `MuData`, with the gene expression data saved in `mudata.mod['rna']`. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/deconvolute_spatial.html). +The `deconvolution_spatial` workflow runs deconvolution for spatial data. Multiple slides can be deconvoluted with the same reference in one run. For each spatial slide, the workflow expects one `SpatialData` object. For the reference scRNA-Seq data, it expects a `MuData`, with the gene expression data saved in `mudata.mod['rna']`. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/deconvolute_spatial.html). For all the tutorials, we will append the `--local` command which ensures that the pipeline runs on the computing node you're currently on, namely your local machine or an interactive session on a computing node on a cluster. @@ -17,10 +17,11 @@ cd deconvolution ``` In this tutorial, we will use human heart spatial transcriptomics and scRNA-Seq data published by [Kuppe et al](https://www.nature.com/articles/s41586-022-05060-x). -You can use the following python code to download and save the data as `MuData` files: +You can use the following python code to download and save the data: ``` import scanpy as sc +import spatialdata as sd import muon as mu # Loading the spatial data: @@ -33,7 +34,8 @@ adata_st = adata_st[adata_st.obs["patient"] == "P1"] del adata_st.uns["spatial"]["control_P17"] del adata_st.uns["spatial"]["control_P7"] del adata_st.uns["spatial"]["control_P8"] -mu.MuData({"spatial": adata_st}).write_h5mu("./data/spatial_data/Human_Heart.h5mu") +sdata_st = sd.SpatialData(tables=adata_st) +sdata_st.write("./data/spatial_data/Human_Heart.zarr") # Loading the single-cell reference adata_sc = sc.read( @@ -58,26 +60,25 @@ deconvolution └── data |── Human_Heart_reference.h5mu └── spatial_data - └── Human_Heart.h5mu + └── Human_Heart.zarr ``` -**Note, that the `MuData` object of the reference data should not be saved into the same folder as the spatial data!** -*When running the workflow, the workflow expects all spatial `MuData` objects to be saved in the same folder and reads in all `MuData` objects of that directory. When saving the reference `MuData` into the same folder as the spatial, the workflow would treat the reference `MuData` as spatial and would try to run deconvolution on it.* +*When running the workflow, the workflow expects all spatial `SpatialData` objects to be saved in the same folder and reads in all `SpatialData` objects of that directory.* +### Create yaml file + +In `spatial/deconvolution`, create the pipeline.yml and pipeline.log files by running `panpipes deconvolution_spatial config` in the command line (you potentially need to activate the conda environment with `conda activate pipeline_env` first!). ## Cell2Location ### Edit yaml file -In `spatial/deconvolution`, create the pipeline.yml and pipeline.log files by running `panpipes deconvolution_spatial config` in the command line (you potentially need to activate the conda environment with `conda activate pipeline_env` first!). Modify the yaml file, or simply use the [pipeline.yml](pipeline_yml.md) that we provide (you potentially need to add the path of the conda environment in the yaml). - - ### Run Panpipes Run the full workflow with `panpipes deconvolution_spatial make full --local`. @@ -90,11 +91,11 @@ deconvolution │ └── Human_Heart │ ├── Cell2Loc_inf_aver.csv │ ├── Cell2Loc_screference_output.h5mu -│ └── Cell2Loc_spatial_output.h5mu +│ └── Cell2Loc_spatial_output.zarr ├── data │ |── Human_Heart_reference.h5mu │ └── spatial_data -│ └── Human_Heart.h5mu +│ └── Human_Heart.zarr ├── figures │ └── Cell2Location │ └── Human_Heart @@ -114,7 +115,8 @@ deconvolution In the folder `./cell2location.output` a folder for each slide will be created containing the following outputs: -* `MuData` containing the spatial data together with the posterior of the spatial mapping model +* `SpatialData` containing the spatial data together with the posterior of the spatial mapping model + * If `export_gene_by_spot = True` a gene by spot matrix for each cell type in a layer * `MuData` containing the reference data together with the posterior of the reference model * A csv-file `Cell2Loc_inf_anver.csv` containing the estimated expression of every gene in every cell type * If `save_models = True`, the reference model and the spatial mapping model @@ -137,5 +139,51 @@ Also in `./figures/Cell2Location`, a folder for each slide will be created. Each * Spatial plot where the spots of the slide are colored by the estimated cell type abundances +## Tangram + +### Edit yaml file + +Modify the yaml file, or simply use the [pipeline.yml](../deconvolution_tangram/pipeline_yml.md) that we provide (you potentially need to add the path of the conda environment in the yaml). + +### Run Panpipes + +Run the full workflow with `panpipes deconvolution_spatial make full --local`. + +Once `Panpipes` has finished, the `spatial/deconvolution` directory will have the following structure: + +``` +deconvolution +├── data +│ |── Human_Heart_reference.h5mu +│ └── spatial_data +│ └── Human_Heart.zarr +├── figures +│ └── Tangram +│ └── Human_Heart +│ ├── rank_genes_groups_cell_type_original.png +│ └── show_tangram_ct_pred.png +├── logs +│ └── 2_Tangram_Human_Heart.log +├── pipeline.log +├── pipeline.yml +└── tangram.output + └── Human_Heart + ├── Tangram_screference_output.h5mu + ├── Tangram_spatial_output.zarr + └── rank_genes_groups.csv +``` + +In the folder `./tangram.output` a folder for each slide will be created containing the following outputs: +* `SpatialData` containing the spatial data with the projected cell annotations +* `MuData` containing the reference data +* If gene selection via rank genes is performed: A csv-file `rank_genes_groups.csv` containing the rank genes + + +Also in `./figures/Tangram`, a folder for each slide will be created. Each folder will contain the following plots: +* If gene selection via rank genes is performed: a plot of the rank genes groups ([sc.pl.rank_genes_groups](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://scanpy.readthedocs.io/en/1.10.x/api/generated/scanpy.pl.rank_genes_groups.html&ved=2ahUKEwirjO-x9biLAxV-0QIHHa18MF4QFnoECAgQAQ&usg=AOvVaw1j4CGWeO_DBNcRM7eIjZfE)) +* Spatial plot where the spots of the slide are colored by the predicted cell type probabilities + + + *Note: We find that keeping the suggested directory structure (one main directory by project with all the individual steps in separate folders) is useful for project management. You can of course customize your directories as you prefer, and change the paths accordingly in the `pipeline.yml` config files!* diff --git a/docs/deconvolution/pipeline.yml b/docs/deconvolution/pipeline.yml index c84a248..47af525 100644 --- a/docs/deconvolution/pipeline.yml +++ b/docs/deconvolution/pipeline.yml @@ -68,7 +68,8 @@ Cell2Location: categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400]) - use_gpu: # Default True; whether to use GPU for training + #use_gpu: # Default True; whether to use GPU for training + accelerator: auto # ------------------------------- # Spatial mapping model paramaters @@ -85,7 +86,8 @@ Cell2Location: detection_alpha: 20 # Regularization of with-in experiment variation in RNA detection sensitivity max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400]) - use_gpu: # Default True; whether to use GPU for training + #use_gpu: # Default True; whether to use GPU for training + accelerator: auto # ------------------------------- save_models: False # Default False; whether to save the reference and spatial mapping models diff --git a/docs/deconvolution_tangram/pipeline.yml b/docs/deconvolution_tangram/pipeline.yml new file mode 100644 index 0000000..d45a21b --- /dev/null +++ b/docs/deconvolution_tangram/pipeline.yml @@ -0,0 +1,127 @@ +# ============================================================================================= +# Deconvolution, spatial transcriptomics workflow Panpipes (pipeline_deconvolution_spatial.py) +# ============================================================================================= +# Written by Sarah Ouologuem + + +# --------------------------- +# 0. Compute resource options +# --------------------------- +resources: + # Number of threads used for parallel jobs + threads_high: 1 # Number of threads used for high intensity computing tasks + threads_medium: 1 # Must be enough memory to load your mudata and do computationally light tasks + threads_low: 1 # Must be enough memory to load text files and do plotting, requires much less memory + +condaenv: # Path to conda env, leave blank if running native or your cluster automatically inherits the login node environment + + + +# ---------------------- +# 1. Specify input +# ---------------------- + +# One or multiple slides can be deconvoluted with the same reference in one run. For that, one MuData object for each slide is expected. +input: + spatial: ./data/spatial_data # Path to folder containing one or multiple MuDatas of spatial data. The pipeline is reading in all MuData files in that folder and assuming that they are MuDatas of spatial slides. +# For all MuData files in that folder, deconvolution is run by the pipeline. +# In the MuData objects, the spatial data is expected to be saved in mudata.mod["spatial"]. For each spatial MuData, deconvolution is run with "singlecell" (see below) as a reference + singlecell: ./data/Human_Heart_reference.h5mu # Path to the MuData file of the reference single-cell data, reference data expected to be saved in mudata.mod["rna"] + + + +# ---------------------- +## 2. Cell2Location +# ---------------------- + +Cell2Location: + run: False # Whether to run Cell2Location + + # ------------------------------- + # Feature selection paramaters + # ------------------------------- + feature_selection: + # Reduced feature set can either be given a) via a csv file of genes or b) feature selection will be performed à la Cell2Location, i.e. via the function: cell2location.utils.filtering.filter_genes() + # If no file is given in a), b) will be run, i.e. feature selection is not optional. + + # a) Path to a csv file containing a reduced feature set + gene_list: # A header in the csv is expected in the first row + + # b) Parameters for Cell2Location's feature selection, leave empty to use defaults + # Whether to remove mitochondrial genes before feature selection + remove_mt: # Default True + # All genes detected in less than cell_count_cutoff cells will be excluded. + cell_count_cutoff: # Default 15, parameter of function cell2location.utils.filtering.filter_genes() + # All genes detected in at least this percentage of cells will be included. + cell_percentage_cutoff2: # Default 0.05, parameter of function cell2location.utils.filtering.filter_genes() + # Genes detected in the number of cells between the above-mentioned cutoffs are selected only when their average expression in non-zero cells is above this cutoff + nonz_mean_cutoff: # Default 1.12, parameter of function cell2location.utils.filtering.filter_genes() + + # ------------------------------- + # Reference model paramaters + # Leave empty to use defaults + # ------------------------------- + reference: + labels_key: cell_type_original # Default None, key in adata.obs for label (cell type) information + batch_key: # Default None, key in adata.obs for batch information + layer: # Default None (if None, X will be used), Layer of the raw (!) counts + categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) + continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) + max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400]) + use_gpu: # Default True; whether to use GPU for training + + # ------------------------------- + # Spatial mapping model paramaters + # Leave empty to use defaults + # ------------------------------- + spatial: + batch_key: # Default None, key in adata.obs for batch information + layer: # Default None (if None, X will be used), Layer of the raw (!) counts + categorical_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) + continuous_covariate_keys: # Comma-separated without spaces; default None; keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space) + + # The following two parameters must be specified (cannot leave empty), otherwise an error will be thrown: + N_cells_per_location: 8 # Expected cell abundance per voxel + detection_alpha: 20 # Regularization of with-in experiment variation in RNA detection sensitivity + + max_epochs: 400 # Default np.min([round((20000 / n_cells) * 400), 400]) + use_gpu: # Default True; whether to use GPU for training + + # ------------------------------- + save_models: False # Default False; whether to save the reference and spatial mapping models + + + +# ------------- +## 3. Tangram +# ------------- + +Tangram: + run: True # Whether to run Tangram + + # ------------------------------- + # Feature selection paramaters + # ------------------------------- + # Reduced feature set can either be given a) via a csv file of genes or b) sc.tl.rank_genes_groups() is run and the top n markers of each group are selected + # If no file is given in a), b) will be run, i.e. feature selection is not optional. + feature_selection: + # a) Path to a csv file containing a reduced feature set + gene_list: # A header in the csv is expected in the first row + + # b) Parameters for sc.tl.rank_genes_groups() gene selection. + rank_genes: + labels_key: cell_type_original # Which column in .obs of the reference to use for the 'groupby' parameter of sc.tl.rank_genes_groups() + layer: Null # Default None, which layer to use of the reference for sc.tl.rank_genes_groups(). if Null (i.e. None), uses .X + n_genes: 100 # Default 100, how many top genes to select of each 'groupby' group + test_method: wilcoxon # Default t-test_overestim_var, which test method to use. one of ['logreg', 't-test', 'wilcoxon', 't-test_overestim_var'] + correction_method: benjamini-hochberg # Default benjamini-hochberg, which p-value correction method to use. one of ['benjamini-hochberg', 'bonferroni']. Used only for 't-test', 't-test_overestim_var', and 'wilcoxon' + + model: + labels_key: cell_type_original # Default None, cell type key in the reference .obs + num_epochs: 100 # Default 1000. Number of epochs for tangram.mapping_utils.map_cells_to_space() + device: cpu # Default cpu. Device to use for deconvolution + kwargs: # Parameters for tangram.mapping_utils.map_cells_to_space(), feel free to add or remove parameters below + + + + diff --git a/docs/deconvolution_tangram/pipeline_yml.md b/docs/deconvolution_tangram/pipeline_yml.md new file mode 100644 index 0000000..5d9b26d --- /dev/null +++ b/docs/deconvolution_tangram/pipeline_yml.md @@ -0,0 +1,7 @@ +Pipeline deconvolution yml +----------------------------- + +Download this [file](pipeline.yml) + +```{literalinclude} pipeline.yml +``` \ No newline at end of file diff --git a/docs/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.md b/docs/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.md index c04eee8..6acc8e7 100644 --- a/docs/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.md +++ b/docs/ingesting_merfish_data/Ingesting_merfish_data_with_panpipes.md @@ -29,7 +29,7 @@ spatial │ └── datasets_mouse_brain_map_BrainReceptorShowcase_Slice1_Replicate1_images_micron_to_mosaic_pixel_transform.csv ``` -**Please note, that the data folder structure needs to be structured as expected by the [squidpy.read.vizgen](https://squidpy.readthedocs.io/en/stable/api/squidpy.read.vizgen.html) function.** +**Please note, that the data folder structure needs to be structured as expected by the [spatialdata_io.merscope](https://spatialdata.scverse.org/projects/io/en/latest/generated/spatialdata_io.merscope.html) function.** ## Edit submission and yaml file @@ -60,20 +60,20 @@ ingestion │ ├── violin_obs_total_counts_sample_id.mouse_brain.png │ └── violin_var_total_counts.mouse_brain.png ├── logs -│ ├── make_mudatas_mouse_brain.log +│ ├── make_spatialdatas_mouse_brain.log │ ├── qcplot.mouse_brain.log │ └── spatialQC_mouse_brain.log -├── qc.data # MuData with QC metrics -│ └── mouse_brain_unfilt.h5mu -├── tmp # MuData without QC metrics -│ └── mouse_brain_raw.h5mu +├── qc.data # SpatialData with QC metrics +│ └── mouse_brain_unfilt.zarr +├── tmp # SpatialData without QC metrics +│ └── mouse_brain_raw.zarr ├── pipeline.log ├── pipeline.yml ├── sample_file_qc_spatial.txt └── mouse_brain_cell_metadata.tsv # Metadata, i.e. .obs ``` -In the `qc.data` folder, the final `MuData` object with computed QC metrics is stored. The `MuData` object without QC metrics is also available and stored in the `tmp` folder. The metadata of the final `Mudata` object is additionally extracted and saved as tsv file, `mouse_brain_cell_metadata.tsv`. +In the `qc.data` folder, the final `SpatialData` object with computed QC metrics is stored. The `SpatialData` object without QC metrics is also available and stored in the `tmp` folder. The metadata of the final `SpatialData` object is additionally extracted and saved as tsv file, `mouse_brain_cell_metadata.tsv`. Using the [provided example yaml file](pipeline_yml.md), the first rows and columns of the `mouse_brain_cell_metadata` tsv file look as follows: | | spatial:fov | spatial:volume | spatial:min_x | spatial:max_x | spatial:min_y | spatial:max_y diff --git a/docs/ingesting_merfish_data/pipeline.yml b/docs/ingesting_merfish_data/pipeline.yml index 0009287..18cfa77 100644 --- a/docs/ingesting_merfish_data/pipeline.yml +++ b/docs/ingesting_merfish_data/pipeline.yml @@ -32,7 +32,7 @@ condaenv: # Path to conda env, leave blank if running native or your cluster au project: test # Submission_file format (example at resources/sample_file_qc_spatial.txt) : -submission_file: sample_file_qc_spatial.txt # Path to submission file +submission_file: sample_file_qc_merfish.txt # Path to submission file diff --git a/docs/ingesting_merfish_data/sample_file_qc_merfish.txt b/docs/ingesting_merfish_data/sample_file_qc_merfish.txt new file mode 100644 index 0000000..f560c83 --- /dev/null +++ b/docs/ingesting_merfish_data/sample_file_qc_merfish.txt @@ -0,0 +1,2 @@ +sample_id spatial_path spatial_filetype vpt_cell_by_gene vpt_cell_metadata vpt_cell_boundaries +mouse_brain data vizgen \ No newline at end of file diff --git a/docs/ingesting_merfish_data/sample_file_qc_spatial.txt b/docs/ingesting_merfish_data/sample_file_qc_spatial.txt deleted file mode 100644 index cbe60d4..0000000 --- a/docs/ingesting_merfish_data/sample_file_qc_spatial.txt +++ /dev/null @@ -1,3 +0,0 @@ -sample_id spatial_path spatial_filetype spatial_counts spatial_metadata spatial_transformation -mouse_brain data vizgen datasets_mouse_brain_map_BrainReceptorShowcase_Slice1_Replicate1_cell_by_gene_S1R1.csv datasets_mouse_brain_map_BrainReceptorShowcase_Slice1_Replicate1_cell_metadata_S1R1.csv datasets_mouse_brain_map_BrainReceptorShowcase_Slice1_Replicate1_images_micron_to_mosaic_pixel_transform.csv - diff --git a/docs/ingesting_merscope_data/pipeline.yml b/docs/ingesting_merscope_data/pipeline.yml new file mode 100644 index 0000000..82b32c6 --- /dev/null +++ b/docs/ingesting_merscope_data/pipeline.yml @@ -0,0 +1,80 @@ +# ----------------------- # +# Spatial QC pipeline +# ----------------------- # +# Written by Fabiola Curion, Sarah Ouologuem + +# This pipeline needs a sample metadata file (see resources for an example) +# Will run : +# scanpy QC +# summary QC plots + +# Followed by preprocess pipeline. +# The qc_spatial pipeline does not perform any filtering of cells or genes. +# Filtering happens as the first step in the preprocess pipeline. See pipeline_preprocess_spatial for details + + +# --------------------------- +# 0. Compute resource options +# --------------------------- +resources: + # Number of threads used for parallel jobs + threads_high: 1 # Must be enough memory to load all input files and create MuDatas + threads_medium: 1 # Must be enough memory to load your mudata and do computationally light tasks + threads_low: 1 # Must be enough memory to load text files and do plotting, requires much less memory + +condaenv: # Path to conda env, leave blank if running native or your cluster automatically inherits the login node environment + + + +# ------------------------------------------------------------------------------------------------ +# 1. Loading options +# ------------------------------------------------------------------------------------------------ + +project: test +# Submission_file format (example at resources/sample_file_qc_spatial.txt) : +submission_file: sample_file_qc_merscope.txt # Path to submission file + + + +# ------------------------------------------------------------------------------------------------ +# 2. QC options +# ------------------------------------------------------------------------------------------------ + +# ------------------------- +# Calculate QC metrics +# ------------------------- +# This part of the pipeline allows to generate the QC metrics that will be used to evaluate inclusion/exclusion criteria. +# Filtering of cells/genes happens in the pipeline pipeline_preprocess_spatial.py. +# Basic QC metrics using scanpy.pp.calculate_qc_metrics() are in every case calculated. +# Additional optional scores can be calculated for 1. cell cycle genes and 2. custom genes: + +# 1. Cell cycle scores (optional) +# Leave options blank to avoid running +# Default file: panpipes/resources/cell_cycle_genes.tsv +ccgenes: # "default" (uses panpipes/resources/cell_cycle_genes.tsv) or path to tsv file with columns "cc_phase" and "gene_name". "cc_phase" can either be "s" or "g2m". Information in tsv file used to run the function scanpy.tl.score_genes_cell_cycle() + + +# 2. Custom genes actions (optional) +# It's often practical to rely on known gene lists, for a series of tasks, like evaluating of mitochondrial genes, ribosomal genes, or excluding IGG genes from HVG selection. +# We collect useful gene lists in: panpipes/resources/custom_gene_lists_v1.tsv and define "actions" on them as follows: +# calc_proportions: calculate proportion of reads mapping to X genes over total number of reads, per cell +# score_genes: using scanpy.score_genes() function, + +# Leave options blank to avoid running +# Default file: panpipes/resources/qc_genelist_1.0.csv +custom_genes_file: # "default" (uses panpipes/resources/qc_genelist_1.0.csv) or path to csv file with columns "group" and "feature". + +calc_proportions: # Comma-separated without spaces, which groups in the custom_genes_file to calculate percentages for +score_genes: # Comma-separated without spaces, which groups in the custom_genes_file to run scanpy.score_genes() for + + +# --------------- +# Plot QC metrics +# --------------- + +# All metrics and grouping variables should be inputted as a comma separated string without spaces, e.g. a,b,c + +plotqc: + grouping_var: sample_id # sample_id is the name of each spatial slide (specified in the submission file) and it will always be used to plot. + spatial_metrics: total_counts,n_genes_by_counts # If metric is present in both, .obs and .var, both will be plotted + diff --git a/docs/ingesting_merscope_data/sample_file_qc_merscope.txt b/docs/ingesting_merscope_data/sample_file_qc_merscope.txt new file mode 100644 index 0000000..6f8492a --- /dev/null +++ b/docs/ingesting_merscope_data/sample_file_qc_merscope.txt @@ -0,0 +1,2 @@ +sample_id spatial_path spatial_filetype vpt_cell_by_gene vpt_cell_metadata vpt_cell_boundaries +merscope_test data vizgen data/cell_by_gene.csv data/cell_metadata.csv data/cellpose_micron_space.parquet diff --git a/docs/ingesting_visium_data/Ingesting_visium_data_with_panpipes.md b/docs/ingesting_visium_data/Ingesting_visium_data_with_panpipes.md index a821100..72fda76 100644 --- a/docs/ingesting_visium_data/Ingesting_visium_data_with_panpipes.md +++ b/docs/ingesting_visium_data/Ingesting_visium_data_with_panpipes.md @@ -1,6 +1,6 @@ # Ingesting 10X Visium data with Panpipes -Let's run through an example of reading `10X Visium` data into `MuData` objects and computing QC metrics using `Panpipes`. The [workflow](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/ingest_spatial.html) describes the steps run by the pipeline in detail. +Let's run through an example of reading `10X Visium` data into `SpatialData` objects and computing QC metrics using `Panpipes`. The [workflow](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/ingest_spatial.html) describes the steps run by the pipeline in detail. For all the tutorials, we will append the `--local` command which ensures that the pipeline runs on the computing node you're currently on, namely your local machine or an interactive session on a computing node on a cluster. @@ -62,18 +62,18 @@ data └── tissue_positions_list.csv ``` -**Please note, that the data folder structure needs to be structured as expected by the [squidpy.read.visium](https://squidpy.readthedocs.io/en/stable/api/squidpy.read.visium.html) function.** +**Please note, that the data folder structure needs to be structured as expected by the [spatialdata_io.visium](https://spatialdata.scverse.org/projects/io/en/latest/generated/spatialdata_io.visium.html) function.** ## Edit submission and yaml file In `spatial/ingestion`, create a submission file like [the one we provide](sample_file_qc_spatial.txt). For this tutorial, you can use the provided. -In general, the spatial submission file expects the following columns: +In general, the spatial submission file for `Visium` datasets expects the following columns: -`sample_id` `spatial_path` `spatial_filetype` `spatial_counts` `spatial_metadata` `spatial_transformation` +`sample_id` `spatial_path` `spatial_filetype` `visium_feature_bc_matrix` `visium_fullres_image_file` `visium_tissue_positions_file` `visium_scalefactors_file` -For `10X Visium datasets`, only the first four columns need to be specified. With `Panpipes` you can ingest multiple spatial slides by creating one line for each in the submission file. For each slide, one `MuData` will be created by the pipeline. Detailed information about the submission file is provided in the [usage guidelines](https://panpipes-pipelines.readthedocs.io/en/latest/usage/setup_for_spatial_workflows.html) +For `10X Visium datasets`, only the first three columns need to be specified, the other four columns are optional. With `Panpipes` you can ingest multiple spatial slides by creating one line for each in the submission file. For each slide, one `SpatialData` will be created by the pipeline. Detailed information about the submission file is provided in the [usage guidelines](https://panpipes-pipelines.readthedocs.io/en/latest/usage/setup_for_spatial_workflows.html) Next, in `spatial/ingestion` call `panpipes qc_spatial config` (you potentially need to activate the conda environment with `conda activate pipeline_env` first!). This will generate a `pipeline.log` and a `pipeline.yml` file. @@ -98,24 +98,24 @@ ingestion │ ├── violin_var_total_counts.V1_Human_Heart.png │ └── violin_var_total_counts.V1_Human_Lymph_Node.png ├── logs -│ ├── make_mudatas_V1_Human_Heart.log +│ ├── make_spatialdatas_V1_Human_Heart.log │ ├── qcplot.V1_Human_Lymph_Node.log -│ ├── make_mudatas_V1_Human_Lymph_Node.log +│ ├── make_spatialdatas_V1_Human_Lymph_Node.log │ ├── spatialQC_V1_Human_Heart.log │ └── qcplot.V1_Human_Heart.log -├── qc.data # MuDatas with QC metrics -│ ├── V1_Human_Heart_unfilt.h5mu -│ └── V1_Human_Lymph_Node_unfilt.h5mu -├── tmp # MuDatas without QC metrics -│ ├── V1_Human_Heart_raw.h5mu -│ └── V1_Human_Lymph_Node_raw.h5mu +├── qc.data # Spatialdatas with QC metrics +│ ├── V1_Human_Heart_unfilt.zarr +│ └── V1_Human_Lymph_Node_unfilt.zarr +├── tmp # SpatialDatass without QC metrics +│ ├── V1_Human_Heart_raw.zarr +│ └── V1_Human_Lymph_Node_raw.zarr ├── pipeline.log ├── pipeline.yml ├── sample_file_qc_spatial.txt ├── V1_Human_Heart_cell_metadata.tsv # Metadata, i.e. .obs └── V1_Human_Lymph_Node_cell_metadata.tsv # Metadata, i.e. .obs ``` -In the `qc.data` folder, the final `MuData` objects with computed QC metrics are stored. `MuData` objects without QC metrics are also available and stored in the `tmp` folder. The metadata of the final `Mudata` objects is additionally extracted and saved as tsv files, `V1_Human_Heart_cell_metadata.tsv` `V1_Human_Lymph_Node_cell_metadata.tsv`. +In the `qc.data` folder, the final `SpatialData` objects with computed QC metrics are stored. `SpatialData` objects without QC metrics are also available and stored in the `tmp` folder. The metadata of the final `SpatialData` objects is additionally extracted and saved as tsv files, `V1_Human_Heart_cell_metadata.tsv` `V1_Human_Lymph_Node_cell_metadata.tsv`. Using the [provided example yaml file](pipeline_yml.md), the first rows and columns of the `V1_Human_Heart_cell_metadata` tsv file should look as follows: | | spatial:in_tissue | spatial:array_row | spatial:array_col | spatial:sample_id | spatial:MarkersNeutro_score | spatial:n_genes_by_counts @@ -133,7 +133,7 @@ With the plots in `spatial/ingestion/figures/spatial` you can now decide on cuto ### [Next: filtering and preprocessing using `panpipes preprocess_spatial`](../preprocess_spatial_data/preprocess_spatial_data_with_panpipes.md) -*Note: In this workflow, we have decided to process individual ST sections instead of concatenating them at the beginning, as you saw for cell-suspension datasets. This is because the workflows for processing multiple spatial transcriptomics slides (especially concerning normalization, dimensionality reduction, and batch correction) are still experimental. With panpipes, you can group multiple samples and process them one by one with the same choice of parameters. In the future we will implement the advanced functionalities of [SpatialData](https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks.html) to deal with multi-sample ST datasets.* +*Note: In this workflow, we have decided to process individual ST sections instead of concatenating them at the beginning, as you saw for cell-suspension datasets. This is because the workflows for processing multiple spatial transcriptomics slides (especially concerning normalization, dimensionality reduction, and batch correction) are still experimental. With panpipes, you can group multiple samples and process them one by one with the same choice of parameters.* *Note: We find that keeping the suggested directory structure (one main directory by project with all the individual steps in separate folders) is useful for project management. You can of course customize your directories as you prefer, and change the paths accordingly in the `pipeline.yml` config files!* diff --git a/docs/ingesting_visium_data/pipeline.yml b/docs/ingesting_visium_data/pipeline.yml index 22fca7b..d27cdb1 100644 --- a/docs/ingesting_visium_data/pipeline.yml +++ b/docs/ingesting_visium_data/pipeline.yml @@ -32,7 +32,7 @@ condaenv: # Path to conda env, leave blank if running native or your cluster au project: test # Submission_file format (example at resources/sample_file_qc_spatial.txt) : -submission_file: sample_file_qc_spatial.txt # Path to submission file +submission_file: sample_file_qc_visium.txt # Path to submission file diff --git a/docs/ingesting_visium_data/sample_file_qc_spatial.txt b/docs/ingesting_visium_data/sample_file_qc_spatial.txt deleted file mode 100644 index 4190fba..0000000 --- a/docs/ingesting_visium_data/sample_file_qc_spatial.txt +++ /dev/null @@ -1,3 +0,0 @@ -sample_id spatial_path spatial_filetype spatial_counts spatial_metadata spatial_transformation -V1_Human_Heart data/V1_Human_Heart visium V1_Human_Heart_filtered_feature_bc_matrix.h5 -V1_Human_Lymph_Node data/V1_Human_Lymph_Node visium V1_Human_Lymph_Node_filtered_feature_bc_matrix.h5 diff --git a/docs/ingesting_visium_data/sample_file_qc_visium.txt b/docs/ingesting_visium_data/sample_file_qc_visium.txt new file mode 100644 index 0000000..ce7e088 --- /dev/null +++ b/docs/ingesting_visium_data/sample_file_qc_visium.txt @@ -0,0 +1,3 @@ +sample_id spatial_path spatial_filetype visium_feature_bc_matrix visium_fullres_image_file visium_tissue_positions_file visium_scalefactors_file +V1_Human_Heart data/V1_Human_Heart visium V1_Human_Heart_filtered_feature_bc_matrix.h5 spatial/tissue_hires_image.png spatial/tissue_positions_list.csv spatial/scalefactors_json.json +V1_Human_Lymph_Node data/V1_Human_Lymph_Node visium V1_Human_Lymph_Node_filtered_feature_bc_matrix.h5 spatial/tissue_hires_image.png spatial/tissue_positions_list.csv spatial/scalefactors_json.json \ No newline at end of file diff --git a/docs/ingesting_xenium_data/pipeline.yml b/docs/ingesting_xenium_data/pipeline.yml new file mode 100644 index 0000000..743bfe8 --- /dev/null +++ b/docs/ingesting_xenium_data/pipeline.yml @@ -0,0 +1,81 @@ +# ============================================================================= +# Ingestion, spatial transcriptomics workflow Panpipes (pipeline_qc_spatial.py) +# ============================================================================= +# Written by Fabiola Curion and Sarah Ouologuem + + +# This pipeline needs a sample metadata file (see resources for an example) +# Will run : +# scanpy QC +# summary QC plots + +# Followed by preprocess pipeline. +# The qc_spatial pipeline does not perform any filtering of cells or genes. +# Filtering happens as the first step in the preprocess pipeline. See pipeline_preprocess_spatial for details + + +# --------------------------- +# 0. Compute resource options +# --------------------------- +resources: + # Number of threads used for parallel jobs + threads_high: 1 # Must be enough memory to load all input files and create MuDatas + threads_medium: 1 # Must be enough memory to load your mudata and do computationally light tasks + threads_low: 1 # Must be enough memory to load text files and do plotting, requires much less memory + +condaenv: # Path to conda env, leave blank if running native or your cluster automatically inherits the login node environment + + + +# ------------------------------------------------------------------------------------------------ +# 1. Loading options +# ------------------------------------------------------------------------------------------------ + +project: test +# Submission_file format (example at resources/sample_file_qc_spatial.txt) : +submission_file: sample_file_qc_xenium.txt # Path to submission file + + + +# ------------------------------------------------------------------------------------------------ +# 2. QC options +# ------------------------------------------------------------------------------------------------ + +# ------------------------- +# Calculate QC metrics +# ------------------------- +# This part of the pipeline allows to generate the QC metrics that will be used to evaluate inclusion/exclusion criteria. +# Filtering of cells/genes happens in the pipeline pipeline_preprocess_spatial.py. +# Basic QC metrics using scanpy.pp.calculate_qc_metrics() are in every case calculated. +# Additional optional scores can be calculated for 1. cell cycle genes and 2. custom genes: + +# 1. Cell cycle scores (optional) +# Leave options blank to avoid running +# Default file: panpipes/resources/cell_cycle_genes.tsv +ccgenes: # "default" (uses panpipes/resources/cell_cycle_genes.tsv) or path to tsv file with columns "cc_phase" and "gene_name". "cc_phase" can either be "s" or "g2m". Information in tsv file used to run the function scanpy.tl.score_genes_cell_cycle() + + +# 2. Custom genes actions (optional) +# It's often practical to rely on known gene lists, for a series of tasks, like evaluating of mitochondrial genes, ribosomal genes, or excluding IGG genes from HVG selection. +# We collect useful gene lists in: panpipes/resources/custom_gene_lists_v1.tsv and define "actions" on them as follows: +# calc_proportions: calculate proportion of reads mapping to X genes over total number of reads, per cell +# score_genes: using scanpy.score_genes() function, + +# Leave options blank to avoid running +# Default file: panpipes/resources/qc_genelist_1.0.csv +custom_genes_file: # "default" (uses panpipes/resources/qc_genelist_1.0.csv) or path to csv file with columns "group" and "feature". + +calc_proportions: # Comma-separated without spaces, which groups in the custom_genes_file to calculate percentages for +score_genes: # Comma-separated without spaces, which groups in the custom_genes_file to run scanpy.score_genes() for + + +# --------------- +# Plot QC metrics +# --------------- + +# All metrics and grouping variables should be inputted as a comma separated string without spaces, e.g. a,b,c + +plotqc: + grouping_var: sample_id # sample_id is the name of each spatial slide (specified in the submission file) + spatial_metrics: total_counts,n_genes_by_counts # If metric is present in both, .obs and .var, both will be plotted + diff --git a/docs/ingesting_xenium_data/sample_file_qc_xenium.txt b/docs/ingesting_xenium_data/sample_file_qc_xenium.txt new file mode 100644 index 0000000..b939fd9 --- /dev/null +++ b/docs/ingesting_xenium_data/sample_file_qc_xenium.txt @@ -0,0 +1,2 @@ +sample_id spatial_path spatial_filetype +xenium_test data xenium diff --git a/docs/preprocess_spatial_data/pipeline.yml b/docs/preprocess_spatial_data/pipeline.yml index bf42987..0ae6880 100644 --- a/docs/preprocess_spatial_data/pipeline.yml +++ b/docs/preprocess_spatial_data/pipeline.yml @@ -125,6 +125,11 @@ spatial: n_pcs: 50 # Leave blank to use default (50); how many PCs to compute with sc.pp.PCA() +# ----------------------------------------- +# 4. Concatenation +# ----------------------------------------- + +concat: False # Leave blank to use default (False); whether to conatenate all preprocessed SpatialDatas in ./filtered_data diff --git a/docs/preprocess_spatial_data/pipeline_concat.yml b/docs/preprocess_spatial_data/pipeline_concat.yml new file mode 100644 index 0000000..f2a2b45 --- /dev/null +++ b/docs/preprocess_spatial_data/pipeline_concat.yml @@ -0,0 +1,135 @@ +# ============================================== +# Preprocessing, spatial transcriptomics +# ============================================== + + +# --------------------------- +# 0. Compute resource options +# --------------------------- + +resources: + # Number of threads used for parallel jobs + threads_high: 1 # Must be enough memory to load all input files and create MuDatas + threads_medium: 1 # Must be enough memory to load your mudata and do computationally light tasks + threads_low: 1 # Must be enough memory to load text files and do plotting, requires much less memory + +condaenv: # Path to conda env, leave blank if running native or your cluster automatically inherits the login node environment + + +# ------------------ +## 1. Specify input +# ------------------ + +# With the preprocess_spatial workflow, one or multiple MuData objects can be preprocessed in one run. +# For the preprocessing, it reads in all '.h5mu' objects of a directory, for example: +# ./qc.data +# ├── Human_LN_alt_unfilt.h5mu +# ├── Human_LN_unfilt.h5mu +# └── Human_heart_unfilt.h5mu + +# The workflow then runs the preprocessing for each MuData object separately with the same parameters that are specified in this yaml file. +# Note, that the MuData objects in the directory need to be of the same assay (Vizgen or Visium). + +input_dir: ../ingestion/qc.data/ +assay: visium # Specify the spatial transcriptomics assay, one of "visium" or "vizgen". If left blank, will use default "visium" + + +# -------------- +## 2. Filtering +# -------------- + +# The filtering is fully customisable to any columns in the .obs or .var. You are not restricted by the columns given as default. When specifying a column name, make sure it exactly matches the column name in the h5mu object. +# General structure: +# spatial: + # obs: <- spot/cell level filtering + # min: <-- Any column for which you want to run a minimum filter, + # n_genes_by_counts: 500 <--- i.e. each spot/cell must have a minimum of 500 in the n_genes_by_counts column + # max: <-- Any column for which you want to run a maxiumum filter + # pct_counts_mt: 20 <-- i.e. each spot/cell may have a maximum of 20 in the pct_counts_mt column + # bool: + # highly_variable: True <--- boolean columns you want to filter on, in this case any obs['highly_variable'] that are True will be retained in the dataset. + # var: <- gene level filtering + # min: + # max: + # bool + +filtering: + run: True + keep_barcodes: # A file containing only barcodes you want to keep, leave blank if not applicable + + #------------------------------------------------------ + spatial: + #------------------------------------------------------ + ## Obs filtering: spot/cell level filtering here + obs: + min: + total_counts: + max: + total_counts: 30000 + n_genes_by_counts: + pct_counts_mt: # Percent filtering: this should be a value between 0 and 100 + pct_counts_rp: + bool: + ## Var filtering: gene level filtering here + var: + min: + n_cells_by_counts: + max: + total_counts: + n_cells_by_counts: + + + +# ------------------------- +## 3. Post-filter plotting +# ------------------------- + +# All metrics should be comma-separated without spaces, e.g. a,b,c +# If the metric is in both, .obs and .var, then both are plotted +plotqc: + grouping_var: # Categorical variables to plot by. Not mandatory, can be left empty + spatial_metrics: total_counts + + + +# ----------------------------------------- +# 4. Spatial transcriptomics preprocessing +# Normalization + HVG selection + PCA +# ----------------------------------------- + +spatial: + # How to perform normalization and HVG selection: + norm_hvg_flavour: seurat # ['seurat', 'squidpy'] + # 'seurat': HVG selection and normalization by analytic Pearson residuals using sc.experimental.pp.normalize_pearson_residuals() and sc.experimental.pp.highly_variable_genes() + # 'squidpy': HVG selection and normalization using the standard scanpy functions (sc.pp.normalize_total(),sc.pp.log1p(), sc.pp.highly_variable_genes()) + + # The following parameters are only for the case norm_hvg_flavour == squidpy: + # Parameters for the HVG selection using sc.pp.highly_variable_genes(): + squidpy_hvg_flavour: #seurat #['seurat','cellranger','seurat_v3'], leave blank to use default ('seurat'), flavour for the sc.pp.highly_variable_genes() function + min_mean: # Leave blank to use default (0.05), parameter in sc.pp.highly_variable_genes() + max_mean: # Leave blank to use default (1.5), parameter in sc.pp.highly_variable_genes() + min_disp: # Leave blank to use default (0.5), parameter in sc.pp.highly_variable_genes() + + # The following parameters are only for the case norm_hvg_flavour == seurat: + theta: # Leave blank to use default (100), the negative binomial overdispersion parameter for Pearson residuals + clip: # Leave blank to use default (None) + # 'clip' can be specified as: + # None: residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)] + # a float value: if float c specified: clipped to the interval [-c, c] + # np.Inf: no clipping + + # Parameters for both cases norm_hvg_flavour = "seurat" and "squidpy": + n_top_genes: 2000 # Default is None; mandatory for norm_hvg_flavour = "seurat" and squidpy_hvg_flavor: "seurat_v3" + filter_by_hvg: False # Leave blank to use default (False); if True, subset the data to highly-variable genes after finding them + hvg_batch_key: # Leave blank to use default (None); if specified, highly-variable genes are selected within each batch separately and merged + + n_pcs: 50 # Leave blank to use default (50); how many PCs to compute with sc.pp.PCA() + +# ----------------------------------------- +# 4. Concatenation +# ----------------------------------------- + +concat: True # Leave blank to use default (False); whether to conatenate all preprocessed SpatialDatas in ./filtered_data + + + diff --git a/docs/preprocess_spatial_data/preprocess_spatial_data_with_panpipes.md b/docs/preprocess_spatial_data/preprocess_spatial_data_with_panpipes.md index 6abd720..21f4481 100644 --- a/docs/preprocess_spatial_data/preprocess_spatial_data_with_panpipes.md +++ b/docs/preprocess_spatial_data/preprocess_spatial_data_with_panpipes.md @@ -1,6 +1,6 @@ # Preprocessing spatial data with Panpipes -The `preprocess_spatial` workflow expects one or multiple `MuData` objects as input, each with a `spatial` slot. The workflow filters the data, followed by normalization, HVG selection, and PCA computation. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/preprocess_spatial.html). +The `preprocess_spatial` workflow expects one or multiple `SpatialData` objects as input. The workflow filters the data, followed by normalization, HVG selection, and PCA computation. The steps of the workflow are explained in greater detail [here](https://panpipes-pipelines.readthedocs.io/en/latest/workflows/preprocess_spatial.html). For all the tutorials, we will append the `--local` command which ensures that the pipeline runs on the computing node you're currently on, namely your local machine or an interactive session on a computing node on a cluster. @@ -16,7 +16,7 @@ mkdir preprocess cd preprocess ``` -In this tutorial, we will use the output `Mudata` objects of the [Visium ingestion tutorial](../ingesting_visium_data/Ingesting_visium_data_with_panpipes.md). Namely, the `MuData` files saved in `spatial/ingestion/qc.data/`: +In this tutorial, we will use the output `Spatialdata` objects of the [Visium ingestion tutorial](../ingesting_visium_data/Ingesting_visium_data_with_panpipes.md). Namely, the `SpatialData` files saved in `spatial/ingestion/qc.data/`: ``` @@ -27,8 +27,8 @@ spatial ├── figures ├── logs ├── qc.data # MuDatas with QC metrics - │ ├── V1_Human_Heart_unfilt.h5mu - │ └── V1_Human_Lymph_Node_unfilt.h5mu + │ ├── V1_Human_Heart_unfilt.zarr + │ └── V1_Human_Lymph_Node_unfilt.zarr ├── tmp ├── pipeline.log ├── pipeline.yml @@ -37,7 +37,7 @@ spatial └── V1_Human_Lymph_Node_cell_metadata.tsv ``` -The `preprocess_spatial` workflow allows you to preprocess one or multiple `MuData` objects **of the same assay, i.e. Visium or Vizgen,** in one run. For that, the workflow reads in all `.h5mu` files of the input directory. The `MuData` objects of the input directory are then **preprocessed with the same specified parameters**. +The `preprocess_spatial` workflow allows you to preprocess one or multiple `SpatialData` objects **of the same assay, i.e. Visium, Vizgen, or Xenium,** in one run. For that, the workflow reads in all `.zarr` files of the input directory. The `SpatialData` objects of the input directory are then **preprocessed with the same specified parameters**. ## Edit yaml file @@ -68,8 +68,8 @@ preprocess │ ├── violin_var_total_counts.V1_Human_Heart.png │ └── violin_var_total_counts.V1_Human_Lymph_Node.png ├── filtered.data -│ ├──V1_Human_Heart_filtered.h5mu -│ └── V1_Human_Lymph_Node_filtered.h5mu +│ ├──V1_Human_Heart_filtered.zarr +│ └── V1_Human_Lymph_Node_filtered.zarr ├── logs │ ├── filtering.V1_Human_Heart_.log │ ├── filtering.V1_Human_Lymph_Node_.log @@ -86,7 +86,7 @@ preprocess │ └── V1_Human_Lymph_Node_filtered_filtered_cell_metadata.tsv ``` -You can find the final `MuData` objects in the `spatial/preprocess/filtered.data` folder. Additionally, the metadata of the filtered `Mudata` objects is saved as tsv files in the `spatial/preprocess/tables` directory, together with csv-files containing the number of spots/cells after filtering. +You can find the final `SpatialData` objects in the `spatial/preprocess/filtered.data` folder. Additionally, the metadata of the filtered `Spatialdata` objects is saved as tsv files in the `spatial/preprocess/tables` directory, together with csv-files containing the number of spots/cells after filtering. Post-filter plots are stored in `spatial/preprocess/figures/spatial`. The plots include visualizations of the spatial embeddings, as well as violin plots: