Preprocess Reference Expression data for Web App

This vignette demonstrates how to preprocess your own perturb-seq pilot data for use with the PerturbPlan web application and power analysis functions.

library(perturbplan)

Overview

This preprocessing workflow consists of two main steps:

reference_data_preprocessing_10x(): Loading raw data from multiple SRR runs of raw Cell Ranger outputs and processing them into a list containing:
- Gene-by-cell expression matrix
- UMI-level molecule information dataframe
- Proportion of reads confidently mapped to the whole transcriptome among all reads
reference_data_processing(): Fitting statistical models with the output from Step 1 and obtaining expression data, library and sequencing parameters required for power analysis. These parameters include:
- Baseline expression statistics: Baseline gene expression data: response_id, relative_expression and expression_size
- Library parameters: Reads-UMIs saturation curve parameters: UMI_per_cell and variation. These two parameters characterize library size and PCR amplification bias, given the cell type and assay of interest
- Mapping efficiency: Proportion of reads that are confidently mapped to the transcripts of genes we’re interested in.

The output after the two-step preprocessing can be uploaded to the Web App as custom reference data.

Step 1: Process Cell Ranger Outputs

The reference_data_preprocessing_10x() function aggregates Cell Ranger outputs from multiple sequencing runs (SRRs) into a single data structure.

Input Requirements

Your data should be organized with Cell Ranger output directories under a top-level folder:

path_to_top_level_output/
├── SRR_run_1/
│   ├── outs/
│   │   ├── filtered_feature_bc_matrix/
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── molecule_info.h5
│   │   ├── filtered_feature_bc_matrix.h5
│   │   └── metrics_summary.csv
├── SRR_run_2/
│   └── ...
└── SRR_run_3/
    └── ...

The SRR directories should be generated by a recent version of Cell Ranger configured for the perturbation (CRISPR or Perturb-seq) workflow. For more information about input format and required fields, see ?obtain_qc_response_data, ?obtain_qc_read_umi_table, and ?obtain_mapping_efficiency.

Note:

In some cases, the subfolder filtered_feature_bc_matrix may need to be created by unzipping the filtered_feature_bc_matrix.tar.gz file.
Each metrics_summary.csv file must include a column named “Number of Reads”, which is required to estimate mapping efficiency. This column may need to be added or edited manually when Cell Ranger is run with multiple libraries or multiple samples.

Function Usage

There’s an example SRR folder called cellranger_tiny under the folder extdata in our package, and we’ll use that for demonstration:

# Point to directory containing example Cell Ranger outputs
extdata_path <- system.file("extdata", package = "perturbplan")

# Aggregate data from all SRR runs
raw_data <- reference_data_preprocessing_10x(
  path_to_top_level_output = extdata_path,
  path_to_run_level_output = "cellranger_tiny",  # Only read subfolder cellranger_tiny 
  h5_rough = TRUE,  # Use first SRR for QC data (faster)
  skip_mapping_efficiency = FALSE  # Estimate mapping efficiency
)

Arguments:

path_to_top_level_output: Path to directory containing Cell Ranger run-level subdirectories
path_to_run_level_output: Optional character vector specifying subset of run directories to process
h5_rough: If TRUE (default), the function will extract the UMI-level molecule information dataframe from first SRR only for speed. If FALSE, combines UMI-level molecule information from all SRRs
skip_mapping_efficiency: If TRUE, skips estimation of mapping efficiency. If FALSE (default), calculates naive mapping efficiency using data from a properly formatted metrics_summary.csv file as described above.

Function Output

# Inspect structure
str(raw_data)
#> List of 3
#>  $ response_matrix   :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:11] 0 1 2 3 4 1 2 1 4 1 ...
#>   .. ..@ p       : int [1:9] 0 1 2 3 4 5 7 9 11
#>   .. ..@ Dim     : int [1:2] 5 8
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#>   .. .. ..$ : chr [1:8] "AGCAGCCGTCCAAGTT-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAACCTGGTATATGAG-1" ...
#>   .. ..@ x       : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ read_umi_table    :'data.frame':  11 obs. of  5 variables:
#>   ..$ num_reads  : int [1:11] 2 1 1 2 1 1 1 1 1 1 ...
#>   ..$ UMI_id     : num [1:11] 139105 723247 998389 622094 584568 ...
#>   ..$ cell_id    : chr [1:11] "AAACCTGGTATATGAG-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAAGTAGTCCAAATGC-1" ...
#>   ..$ response_id: chr [1:11] "ENSG00000241860" "ENSG00000238009" "ENSG00000239945" "ENSG00000286448" ...
#>   ..$ srr_idx    : chr [1:11] "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" ...
#>  $ mapping_efficiency: num 1

Format of each output:

response_matrix: a sparse gene-by-cell expression matrix (genes as rows, cells as columns) with row and column names
read_umi_table: a dataframe with UMI-level molecule information, including columns:
- num_reads: Number of reads supporting this UMI-cell combination
- UMI_id: UMI index of this UMI-cell combination
- cell_id: Cell barcode of this UMI-cell combination
- response_id: Gene identifier (e.g., Ensembl ID)
- srr_idx: SRR run identifier of the UMI
mapping_efficiency: A numeric value between 0 and 1 if skip_mapping_efficiency = FALSE, and NULL if skip_mapping_efficiency = TRUE

Step 2: Extract Reference Data

The reference_data_processing() function fits statistical models to extract gene expression data, library parameters and more sophisticated mapping efficiency required by PerturbPlan. We postpone the details of these two models to the end of this document.

Input Requirements

The input format of the function is the same as the output format of the function reference_data_preprocessing_10x().

Function Usage

We demonstrate two use cases of reference_data_processing() with the aggregated data from Step 1 (raw_data): one for perturb-seq experimental design using all genes, and one for TAP-seq experimental design using a targeted gene list.

# Set seed for reproducibility
set.seed(123)

# Process into final pilot data format with all genes, suitable for perturb-seq experimental design
pilot_data_perturbseq <- reference_data_processing(
  response_matrix = raw_data$response_matrix,
  read_umi_table = raw_data$read_umi_table,
  mapping_efficiency = raw_data$mapping_efficiency,
  gene_list = NULL,     # Use all genes
  TPM_thres = 0.1,      # Default expression threshold for filtering
  downsample_ratio = 0.6,  # Downsampling for sequencing
  D2_rough = 0.4,       # prior for variation parameter
  h5_only = FALSE,      # Set TRUE to skip expression model fitting
  n_threads = NULL      # No parallel processing
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:33.52766
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:33.552606
#> Finish relative expression calculation @ 2025-10-13 22:56:33.555532
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:33.556631
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 5
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:33.561745
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:33.578881
#> Processed 5 genes
#> Library parameters: UMI_per_cell = 6, variation = 0.371
#> Mapping efficiency = 1

# Process into pilot data format with targetd genes only, suitable for TAP-seq experimental design
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_tapseq <- reference_data_processing(
  response_matrix = raw_data$response_matrix,
  read_umi_table = raw_data$read_umi_table,
  mapping_efficiency = raw_data$mapping_efficiency,
  gene_list = gene_list, # Restrict to specific genes
  TPM_thres = 0          # No expression threshold for filtering
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:33.582692
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:33.597784
#> Finish relative expression calculation @ 2025-10-13 22:56:33.59869
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:33.599555
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 3
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:33.601596
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:33.611949
#> Processed 3 genes
#> Library parameters: UMI_per_cell = 7, variation = 0.366
#> Mapping efficiency = 0.727

Arguments:

response_matrix: Gene-by-cell expression matrix from Step 1 (or NULL if h5_only = TRUE)
read_umi_table: QC dataframe from Step 1
mapping_efficiency: Mapping efficiency from Step 1
gene_list: Optional character vector to restrict analysis to specific genes
TPM_thres: Threshold for filtering lowly expressed genes (default: 0.1)
downsample_ratio: Proportion of downsampling when learning read-UMI model (default: 0.7)
D2_rough: Rough prior value for library variation parameter (default for perturbseq experiment: 0.3)
h5_only: If TRUE, the function will skip the construction of baseline expression dataframe to save time (default: FALSE)
n_threads: Number of parallel processing threads (default: NULL for single-threaded)

Note: In practice, our function demonstrates robustness to downsampling with different seeds and moderate misspecification of the prior for the variation parameter (D2_rough).

Function Output

# Inspect structure of perturb-seq pilot data
str(pilot_data_perturbseq)
#> List of 3
#>  $ baseline_expression_stats:'data.frame':   5 obs. of  3 variables:
#>   ..$ response_id        : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#>   ..$ relative_expression: num [1:5] 0.0909 0.3636 0.1818 0.1818 0.1818
#>   ..$ expression_size    : num [1:5] 401 484 468 468 468
#>  $ library_parameters       :List of 2
#>   ..$ UMI_per_cell: num 6.12
#>   ..$ variation   : num 0.371
#>  $ mapping_efficiency       : num 1

# Inspect structure of TAP-seq pilot data
str(pilot_data_tapseq)
#> List of 3
#>  $ baseline_expression_stats:'data.frame':   3 obs. of  3 variables:
#>   ..$ response_id        : chr [1:3] "ENSG00000238009" "ENSG00000239945" "ENSG00000241860"
#>   ..$ relative_expression: num [1:3] 0.5 0.25 0.25
#>   ..$ expression_size    : num [1:3] 487 476 476
#>  $ library_parameters       :List of 2
#>   ..$ UMI_per_cell: num 6.82
#>   ..$ variation   : num 0.366
#>  $ mapping_efficiency       : num 0.727

Format of each output:

baseline_expression_stats: Dataframe with columns:
- response_id: Ensembl gene identifier
- relative_expression: Estimated relative expression proportions for each gene, normalized to sum to 1 across all genes of interest
- expression_size: Estimated dispersion representing gene-specific expression variability
library_parameters: List with:
- UMI_per_cell: Estimated maximum UMI count per cell
- variation: Estimated PCR amplification variation parameter
mapping_efficiency: Adjusted mapping efficiency, accounting for the fraction of reads mapped to the genes of interest

Note: The relative expression is higher and the mapping efficiency is lower in the TAP-seq example because of significantly smaller target gene panel compared to Perturb-seq.

Mathematical details of the models used in Step 2

We consider two models, one for modeling gene expression and one for modeling read-UMI relationship.

Negative binomial expression model

For each gene, the function fits a negative binomial (NB) model to characterize the distribution of gene expression levels across cells:

\[\text{gene_expression} \sim \text{NB}(\text{mean} = \text{library_size} \times \text{relative_expression}, \text{size} = \text{expression_size})\]

where

gene_expression: Number of observed UMIs for the gene in each cell
library_size: Number of observed UMIs per cell

are the data to use, and

relative_expression: Relative expression level of the gene
expression_size: Reciprocal of dispersion

are the estimated values.

Read-UMI model

The function fits a saturation curve that relates mapped reads per cell to observed UMIs per cell (library_size):

\[\text{library_size} = \text{UMI_per_cell} \times \left(1 - \exp\left(-\frac{\text{mapped_reads_per_cell}}{\text{UMI_per_cell}}\right) \times \left(1 + \text{variation} \times \frac{\text{mapped_reads_per_cell}^2}{2 \times \text{UMI_per_cell}^2}\right)\right)\]

where

mapped_reads_per_cell: Number of mapped reads per cell
library_size: Number of observed UMIs per cell

are the data used to fit the model, and

UMI_per_cell: Total UMI per cell obtained at sequencing saturation
variation: Variation parameter characterizing PCR amplification bias (between 0 and 1)

are the fitted parameters.

Overview

Step 1: Process Cell Ranger Outputs

Input Requirements

Function Usage

Function Output

Step 2: Extract Reference Data

Input Requirements

Function Usage

Function Output

Mathematical details of the models used in Step 2

Negative binomial expression model

Read-UMI model

See Also