Skip to contents

This vignette shows how to create custom reference data for PerturbPlan. The process has three steps:

  1. Preprocess your data
  2. Extract summary statistics from preprocessed data
  3. Save the result

Step 1: Preprocess Your Data

The goal is to preprocess your data into the format expected by reference_data_processing(). There are two cases.

Case 1: You Have Cell Ranger Output

If your data is organized as Cell Ranger run directories, use reference_data_preprocessing_10x() to extract the inputs needed by reference_data_processing().

Required Layout

The convenience helper expects Cell Ranger count output arranged under a top-level directory:

path_to_top_level_output/
├── SRR_run_1/
│   └── outs/
│       ├── filtered_feature_bc_matrix/
│       │   ├── barcodes.tsv.gz
│       │   ├── features.tsv.gz
│       │   └── matrix.mtx.gz
│       ├── molecule_info.h5
│       ├── filtered_feature_bc_matrix.h5
│       └── metrics_summary.csv
├── SRR_run_2/
│   └── ...
└── SRR_run_3/
    └── ...

reference_data_preprocessing_10x() is currently designed for Cell Ranger count, not Cell Ranger multi.

Run reference_data_preprocessing_10x()

We illustrate this function with a tiny example directory, cellranger_tiny.

extdata_path <- system.file("extdata", package = "perturbplan")

preprocessed_data <- reference_data_preprocessing_10x(
  path_to_top_level_output = extdata_path,
  path_to_run_level_output = "cellranger_tiny",
  h5_rough = TRUE
)

names(preprocessed_data)
#> [1] "response_matrix"    "read_umi_table"     "mapping_efficiency"
str(preprocessed_data, max.level = 1)
#> List of 3
#>  $ response_matrix   :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>  $ read_umi_table    :'data.frame':  11 obs. of  5 variables:
#>  $ mapping_efficiency: num 1

After this step, preprocessed_data contains the three objects consumed by reference_data_processing():

  • response_matrix
  • read_umi_table
  • mapping_efficiency

Case 2: You Do Not Have Cell Ranger Output

If your data does not come from Cell Ranger, you can still use PerturbPlan as long as you can provide compatible inputs to reference_data_processing().

At minimum:

  • response_matrix should be a gene-by-cell matrix with genes in rows and cells in columns
  • read_umi_table should contain UMI-level molecule information in the format expected by the library-fitting step
  • mapping_efficiency is optional but recommended when available

In this case, manipulate your data yourself into the required format, checking ?reference_data_processing for the expected input-format details. Once those objects are ready, proceed to Step 2.

Step 2: Extract Summary Statistics From Preprocessed Data

The examples below continue using the preprocessed_data object created in Step 1, Case 1. If you prepared compatible inputs yourself in Step 1, Case 2, the same call pattern applies to your own response_matrix, read_umi_table, and optional mapping_efficiency.

There are two common cases. In both cases, the summary statistics are extracted from the preprocessed data via reference_data_processing().

Case 1: Perturb-seq

Use gene_list = NULL to keep all genes. If desired, select a TPM threshold for genes to include in processing. Here, we illustrate using a threshold of 0.1.

set.seed(123)

pilot_data_perturbseq <- reference_data_processing(
  response_matrix = preprocessed_data$response_matrix,
  read_umi_table = preprocessed_data$read_umi_table,
  mapping_efficiency = preprocessed_data$mapping_efficiency,
  gene_list = NULL,
  TPM_thres = 0.1
)
#> Starting pilot data preprocessing @ 2026-04-13 14:11:15.493232
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-04-13 14:11:15.51799
#> Finish relative expression calculation @ 2026-04-13 14:11:15.519783
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (8 thread(s)) @ 2026-04-13 14:11:15.52056
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 5
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-04-13 14:11:15.523117
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-04-13 14:11:15.531704
#> Processed 5 genes
#> UMI_per_cell_at_saturation = 5, Model = ZTNB
#> Mapping efficiency = 1

names(pilot_data_perturbseq)
#> [1] "baseline_expression_stats" "library_parameters"       
#> [3] "mapping_efficiency"

Case 2: TAP-seq

If your reference expression data contain more genes than your planned TAP-seq experiment, use gene_list to restrict the analysis to a targeted panel. In that setting, TPM_thres = 0 is often a reasonable choice because the target panel is already preselected.

gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")

pilot_data_targeted <- reference_data_processing(
  response_matrix = preprocessed_data$response_matrix,
  read_umi_table = preprocessed_data$read_umi_table,
  mapping_efficiency = preprocessed_data$mapping_efficiency,
  gene_list = gene_list,
  TPM_thres = 0
)
#> Starting pilot data preprocessing @ 2026-04-13 14:11:15.601852
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-04-13 14:11:15.613483
#> Finish relative expression calculation @ 2026-04-13 14:11:15.614167
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (8 thread(s)) @ 2026-04-13 14:11:15.614839
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 3
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-04-13 14:11:15.615836
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-04-13 14:11:15.620721
#> Processed 3 genes
#> UMI_per_cell_at_saturation = 6, Model = ZTNB
#> Mapping efficiency = 0.727

names(pilot_data_targeted)
#> [1] "baseline_expression_stats" "library_parameters"       
#> [3] "mapping_efficiency"

Step 3: Save the Result

The output of reference_data_processing() is your custom reference dataset. Save that object as an RDS file:

saveRDS(pilot_data_perturbseq, "my_perturbplan_reference.rds")

This RDS file is the file you will upload to the PerturbPlan web app as your custom reference dataset.