
Preprocess Reference Expression data for Web App
preprocess-reference.Rmd
This vignette demonstrates how to preprocess your own perturb-seq pilot data for use with the PerturbPlan web application and power analysis functions.
Overview
This preprocessing workflow consists of two main steps:
-
reference_data_preprocessing_10x()
: Loading raw data from multiple SRR runs of raw Cell Ranger outputs and processing them into a list containing:- Gene-by-cell expression matrix
- UMI-level molecule information dataframe
- Proportion of reads confidently mapped to the whole transcriptome among all reads
-
reference_data_processing()
: Fitting statistical models with the output from Step 1 and obtaining expression data, library and sequencing parameters required for power analysis. These parameters include:-
Baseline expression statistics: Baseline gene
expression data:
response_id
,relative_expression
andexpression_size
-
Library parameters: Reads-UMIs saturation curve
parameters:
UMI_per_cell
andvariation
. These two parameters characterize library size and PCR amplification bias, given the cell type and assay of interest - Mapping efficiency: Proportion of reads that are confidently mapped to the transcripts of genes we’re interested in.
-
Baseline expression statistics: Baseline gene
expression data:
The output after the two-step preprocessing can be uploaded to the Web App as custom reference data.
Step 1: Process Cell Ranger Outputs
The reference_data_preprocessing_10x()
function
aggregates Cell Ranger outputs from multiple sequencing runs (SRRs) into
a single data structure.
Input Requirements
Your data should be organized with Cell Ranger output directories under a top-level folder:
path_to_top_level_output/
├── SRR_run_1/
│ ├── outs/
│ │ ├── filtered_feature_bc_matrix/
│ │ │ ├── barcodes.tsv.gz
│ │ │ ├── features.tsv.gz
│ │ │ └── matrix.mtx.gz
│ │ ├── molecule_info.h5
│ │ ├── filtered_feature_bc_matrix.h5
│ │ └── metrics_summary.csv
├── SRR_run_2/
│ └── ...
└── SRR_run_3/
└── ...
The SRR directories should be generated by a recent version of Cell
Ranger configured for the perturbation (CRISPR or Perturb-seq) workflow.
For more information about input format and required fields, see
?obtain_qc_response_data
,
?obtain_qc_read_umi_table
, and
?obtain_mapping_efficiency
.
Note:
- In some cases, the subfolder
filtered_feature_bc_matrix
may need to be created by unzipping thefiltered_feature_bc_matrix.tar.gz
file. - Each
metrics_summary.csv
file must include a column named “Number of Reads”, which is required to estimate mapping efficiency. This column may need to be added or edited manually when Cell Ranger is run with multiple libraries or multiple samples.
Function Usage
There’s an example SRR folder called cellranger_tiny
under the folder extdata
in our package, and we’ll use that
for demonstration:
# Point to directory containing example Cell Ranger outputs
extdata_path <- system.file("extdata", package = "perturbplan")
# Aggregate data from all SRR runs
raw_data <- reference_data_preprocessing_10x(
path_to_top_level_output = extdata_path,
path_to_run_level_output = "cellranger_tiny", # Only read subfolder cellranger_tiny
h5_rough = TRUE, # Use first SRR for QC data (faster)
skip_mapping_efficiency = FALSE # Estimate mapping efficiency
)
Arguments:
-
path_to_top_level_output
: Path to directory containing Cell Ranger run-level subdirectories -
path_to_run_level_output
: Optional character vector specifying subset of run directories to process -
h5_rough
: IfTRUE
(default), the function will extract the UMI-level molecule information dataframe from first SRR only for speed. IfFALSE
, combines UMI-level molecule information from all SRRs -
skip_mapping_efficiency
: IfTRUE
, skips estimation of mapping efficiency. IfFALSE
(default), calculates naive mapping efficiency using data from a properly formattedmetrics_summary.csv
file as described above.
Function Output
# Inspect structure
str(raw_data)
#> List of 3
#> $ response_matrix :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:11] 0 1 2 3 4 1 2 1 4 1 ...
#> .. ..@ p : int [1:9] 0 1 2 3 4 5 7 9 11
#> .. ..@ Dim : int [1:2] 5 8
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#> .. .. ..$ : chr [1:8] "AGCAGCCGTCCAAGTT-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAACCTGGTATATGAG-1" ...
#> .. ..@ x : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
#> .. ..@ factors : list()
#> $ read_umi_table :'data.frame': 11 obs. of 5 variables:
#> ..$ num_reads : int [1:11] 2 1 1 2 1 1 1 1 1 1 ...
#> ..$ UMI_id : num [1:11] 139105 723247 998389 622094 584568 ...
#> ..$ cell_id : chr [1:11] "AAACCTGGTATATGAG-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAAGTAGTCCAAATGC-1" ...
#> ..$ response_id: chr [1:11] "ENSG00000241860" "ENSG00000238009" "ENSG00000239945" "ENSG00000286448" ...
#> ..$ srr_idx : chr [1:11] "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" ...
#> $ mapping_efficiency: num 1
Format of each output:
-
response_matrix
: a sparse gene-by-cell expression matrix (genes as rows, cells as columns) with row and column names -
read_umi_table
: a dataframe with UMI-level molecule information, including columns:-
num_reads
: Number of reads supporting this UMI-cell combination -
UMI_id
: UMI index of this UMI-cell combination -
cell_id
: Cell barcode of this UMI-cell combination -
response_id
: Gene identifier (e.g., Ensembl ID) -
srr_idx
: SRR run identifier of the UMI
-
-
mapping_efficiency
: A numeric value between 0 and 1 ifskip_mapping_efficiency = FALSE
, and NULL ifskip_mapping_efficiency = TRUE
Step 2: Extract Reference Data
The reference_data_processing()
function fits
statistical models to extract gene expression data, library parameters
and more sophisticated mapping efficiency required by PerturbPlan. We
postpone the details of these two models to the end of this
document.
Input Requirements
The input format of the function is the same as the output format of
the function reference_data_preprocessing_10x()
.
Function Usage
We demonstrate two use cases of
reference_data_processing()
with the aggregated data from
Step 1 (raw_data
): one for perturb-seq experimental design
using all genes, and one for TAP-seq experimental design using a
targeted gene list.
# Set seed for reproducibility
set.seed(123)
# Process into final pilot data format with all genes, suitable for perturb-seq experimental design
pilot_data_perturbseq <- reference_data_processing(
response_matrix = raw_data$response_matrix,
read_umi_table = raw_data$read_umi_table,
mapping_efficiency = raw_data$mapping_efficiency,
gene_list = NULL, # Use all genes
TPM_thres = 0.1, # Default expression threshold for filtering
downsample_ratio = 0.6, # Downsampling for sequencing
D2_rough = 0.4, # prior for variation parameter
h5_only = FALSE, # Set TRUE to skip expression model fitting
n_threads = NULL # No parallel processing
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:33.52766
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:33.552606
#> Finish relative expression calculation @ 2025-10-13 22:56:33.555532
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:33.556631
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 5
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:33.561745
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:33.578881
#> Processed 5 genes
#> Library parameters: UMI_per_cell = 6, variation = 0.371
#> Mapping efficiency = 1
# Process into pilot data format with targetd genes only, suitable for TAP-seq experimental design
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_tapseq <- reference_data_processing(
response_matrix = raw_data$response_matrix,
read_umi_table = raw_data$read_umi_table,
mapping_efficiency = raw_data$mapping_efficiency,
gene_list = gene_list, # Restrict to specific genes
TPM_thres = 0 # No expression threshold for filtering
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:33.582692
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:33.597784
#> Finish relative expression calculation @ 2025-10-13 22:56:33.59869
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:33.599555
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 3
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:33.601596
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:33.611949
#> Processed 3 genes
#> Library parameters: UMI_per_cell = 7, variation = 0.366
#> Mapping efficiency = 0.727
Arguments:
-
response_matrix
: Gene-by-cell expression matrix from Step 1 (or NULL ifh5_only = TRUE
) -
read_umi_table
: QC dataframe from Step 1 -
mapping_efficiency
: Mapping efficiency from Step 1 -
gene_list
: Optional character vector to restrict analysis to specific genes -
TPM_thres
: Threshold for filtering lowly expressed genes (default: 0.1) -
downsample_ratio
: Proportion of downsampling when learning read-UMI model (default: 0.7) -
D2_rough
: Rough prior value for library variation parameter (default for perturbseq experiment: 0.3) -
h5_only
: IfTRUE
, the function will skip the construction of baseline expression dataframe to save time (default: FALSE) -
n_threads
: Number of parallel processing threads (default: NULL for single-threaded)
Note: In practice, our function demonstrates
robustness to downsampling with different seeds and moderate
misspecification of the prior for the variation parameter
(D2_rough
).
Function Output
# Inspect structure of perturb-seq pilot data
str(pilot_data_perturbseq)
#> List of 3
#> $ baseline_expression_stats:'data.frame': 5 obs. of 3 variables:
#> ..$ response_id : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#> ..$ relative_expression: num [1:5] 0.0909 0.3636 0.1818 0.1818 0.1818
#> ..$ expression_size : num [1:5] 401 484 468 468 468
#> $ library_parameters :List of 2
#> ..$ UMI_per_cell: num 6.12
#> ..$ variation : num 0.371
#> $ mapping_efficiency : num 1
# Inspect structure of TAP-seq pilot data
str(pilot_data_tapseq)
#> List of 3
#> $ baseline_expression_stats:'data.frame': 3 obs. of 3 variables:
#> ..$ response_id : chr [1:3] "ENSG00000238009" "ENSG00000239945" "ENSG00000241860"
#> ..$ relative_expression: num [1:3] 0.5 0.25 0.25
#> ..$ expression_size : num [1:3] 487 476 476
#> $ library_parameters :List of 2
#> ..$ UMI_per_cell: num 6.82
#> ..$ variation : num 0.366
#> $ mapping_efficiency : num 0.727
Format of each output:
-
baseline_expression_stats
: Dataframe with columns:-
response_id
: Ensembl gene identifier -
relative_expression
: Estimated relative expression proportions for each gene, normalized to sum to 1 across all genes of interest -
expression_size
: Estimated dispersion representing gene-specific expression variability
-
-
library_parameters
: List with:-
UMI_per_cell
: Estimated maximum UMI count per cell -
variation
: Estimated PCR amplification variation parameter
-
-
mapping_efficiency
: Adjusted mapping efficiency, accounting for the fraction of reads mapped to the genes of interest
Note: The relative expression is higher and the mapping efficiency is lower in the TAP-seq example because of significantly smaller target gene panel compared to Perturb-seq.
Mathematical details of the models used in Step 2
We consider two models, one for modeling gene expression and one for modeling read-UMI relationship.
Negative binomial expression model
For each gene, the function fits a negative binomial (NB) model to characterize the distribution of gene expression levels across cells:
\[\text{gene_expression} \sim \text{NB}(\text{mean} = \text{library_size} \times \text{relative_expression}, \text{size} = \text{expression_size})\]
where
-
gene_expression
: Number of observed UMIs for the gene in each cell -
library_size
: Number of observed UMIs per cell
are the data to use, and
-
relative_expression
: Relative expression level of the gene -
expression_size
: Reciprocal of dispersion
are the estimated values.
Read-UMI model
The function fits a saturation curve that relates mapped reads per
cell to observed UMIs per cell (library_size
):
\[\text{library_size} = \text{UMI_per_cell} \times \left(1 - \exp\left(-\frac{\text{mapped_reads_per_cell}}{\text{UMI_per_cell}}\right) \times \left(1 + \text{variation} \times \frac{\text{mapped_reads_per_cell}^2}{2 \times \text{UMI_per_cell}^2}\right)\right)\]
where
-
mapped_reads_per_cell
: Number of mapped reads per cell -
library_size
: Number of observed UMIs per cell
are the data used to fit the model, and
-
UMI_per_cell
: Total UMI per cell obtained at sequencing saturation -
variation
: Variation parameter characterizing PCR amplification bias (between 0 and 1)
are the fitted parameters.
See Also
-
?reference_data_preprocessing_10x
: Detailed documentation ofreference_data_preprocessing_10x()
. -
?reference_data_processing
: Detailed documentation ofreference_data_processing()
. -
?obtain_qc_response_data
,?obtain_qc_read_umi_table
, and?obtain_mapping_efficiency
: Documentation on obtaining each component of the output inreference_data_preprocessing_10x()
. -
?obtain_expression_information
: Documentation on fitting the negative binomial expression model inreference_data_processing()
. -
?library_computation
: Documentation on fitting the read-UMI model inreference_data_processing()
.