
Preprocess Custom Reference Expression Data for Web App
preprocess-reference.RmdThis vignette shows how to create custom reference data for PerturbPlan. The process has three steps:
- Preprocess your data
- Extract summary statistics from preprocessed data
- Save the result
Step 1: Preprocess Your Data
The goal is to preprocess your data into the format expected by
reference_data_processing(). There are two cases.
Case 1: You Have Cell Ranger Output
If your data is organized as Cell Ranger run directories, use
reference_data_preprocessing_10x() to extract the inputs
needed by reference_data_processing().
Required Layout
The convenience helper expects Cell Ranger count output arranged under a top-level directory:
path_to_top_level_output/
├── SRR_run_1/
│ └── outs/
│ ├── filtered_feature_bc_matrix/
│ │ ├── barcodes.tsv.gz
│ │ ├── features.tsv.gz
│ │ └── matrix.mtx.gz
│ ├── molecule_info.h5
│ ├── filtered_feature_bc_matrix.h5
│ └── metrics_summary.csv
├── SRR_run_2/
│ └── ...
└── SRR_run_3/
└── ...
reference_data_preprocessing_10x() is currently designed
for Cell Ranger count, not Cell Ranger
multi.
Run reference_data_preprocessing_10x()
We illustrate this function with a tiny example directory,
cellranger_tiny.
extdata_path <- system.file("extdata", package = "perturbplan")
preprocessed_data <- reference_data_preprocessing_10x(
path_to_top_level_output = extdata_path,
path_to_run_level_output = "cellranger_tiny",
h5_rough = TRUE
)
names(preprocessed_data)
#> [1] "response_matrix" "read_umi_table" "mapping_efficiency"
str(preprocessed_data, max.level = 1)
#> List of 3
#> $ response_matrix :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> $ read_umi_table :'data.frame': 11 obs. of 5 variables:
#> $ mapping_efficiency: num 1After this step, preprocessed_data contains the three
objects consumed by reference_data_processing():
response_matrixread_umi_tablemapping_efficiency
Case 2: You Do Not Have Cell Ranger Output
If your data does not come from Cell Ranger, you can still use
PerturbPlan as long as you can provide compatible inputs to
reference_data_processing().
At minimum:
-
response_matrixshould be a gene-by-cell matrix with genes in rows and cells in columns -
read_umi_tableshould contain UMI-level molecule information in the format expected by the library-fitting step -
mapping_efficiencyis optional but recommended when available
In this case, manipulate your data yourself into the required format,
checking ?reference_data_processing for the expected
input-format details. Once those objects are ready, proceed to Step
2.
Step 2: Extract Summary Statistics From Preprocessed Data
The examples below continue using the preprocessed_data
object created in Step 1, Case 1. If you prepared compatible inputs
yourself in Step 1, Case 2, the same call pattern applies to your own
response_matrix, read_umi_table, and optional
mapping_efficiency.
There are two common cases. In both cases, the summary statistics are
extracted from the preprocessed data via
reference_data_processing().
Case 1: Perturb-seq
Use gene_list = NULL to keep all genes. If desired,
select a TPM threshold for genes to include in processing. Here, we
illustrate using a threshold of 0.1.
set.seed(123)
pilot_data_perturbseq <- reference_data_processing(
response_matrix = preprocessed_data$response_matrix,
read_umi_table = preprocessed_data$read_umi_table,
mapping_efficiency = preprocessed_data$mapping_efficiency,
gene_list = NULL,
TPM_thres = 0.1
)
#> Starting pilot data preprocessing @ 2026-04-13 14:11:15.493232
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-04-13 14:11:15.51799
#> Finish relative expression calculation @ 2026-04-13 14:11:15.519783
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (8 thread(s)) @ 2026-04-13 14:11:15.52056
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 5
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-04-13 14:11:15.523117
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-04-13 14:11:15.531704
#> Processed 5 genes
#> UMI_per_cell_at_saturation = 5, Model = ZTNB
#> Mapping efficiency = 1
names(pilot_data_perturbseq)
#> [1] "baseline_expression_stats" "library_parameters"
#> [3] "mapping_efficiency"Case 2: TAP-seq
If your reference expression data contain more genes than your
planned TAP-seq experiment, use gene_list to restrict the
analysis to a targeted panel. In that setting,
TPM_thres = 0 is often a reasonable choice because the
target panel is already preselected.
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_targeted <- reference_data_processing(
response_matrix = preprocessed_data$response_matrix,
read_umi_table = preprocessed_data$read_umi_table,
mapping_efficiency = preprocessed_data$mapping_efficiency,
gene_list = gene_list,
TPM_thres = 0
)
#> Starting pilot data preprocessing @ 2026-04-13 14:11:15.601852
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-04-13 14:11:15.613483
#> Finish relative expression calculation @ 2026-04-13 14:11:15.614167
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (8 thread(s)) @ 2026-04-13 14:11:15.614839
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 3
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-04-13 14:11:15.615836
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-04-13 14:11:15.620721
#> Processed 3 genes
#> UMI_per_cell_at_saturation = 6, Model = ZTNB
#> Mapping efficiency = 0.727
names(pilot_data_targeted)
#> [1] "baseline_expression_stats" "library_parameters"
#> [3] "mapping_efficiency"Step 3: Save the Result
The output of reference_data_processing() is your custom
reference dataset. Save that object as an RDS file:
saveRDS(pilot_data_perturbseq, "my_perturbplan_reference.rds")This RDS file is the file you will upload to the PerturbPlan web app as your custom reference dataset.