
Extract Statistical Parameters from Pilot Data for Power Analysis
reference_data_processing.RdFits statistical models to extract gene-level expression parameters, library saturation parameters, and adjusted mapping efficiency required by PerturbPlan for power analysis. This is Step 2 of the pilot data preprocessing workflow. Outputs are compatible with built-in pilot data examples and the PerturbPlan Shiny application.
Usage
reference_data_processing(
response_matrix = NULL,
read_umi_table,
mapping_efficiency = NULL,
gene_list = NULL,
TPM_thres = 0.1,
h5_only = FALSE,
n_threads = NULL
)Arguments
- response_matrix
Matrix or NULL. Gene-by-cell expression matrix, typically from
reference_data_preprocessing_10x. Required unlessh5_only = TRUE.- read_umi_table
Data frame. UMI-level molecule information from
obtain_qc_read_umi_table, typically obtained viareference_data_preprocessing_10x.- mapping_efficiency
Numeric or NULL. Naive mapping efficiency estimate from
obtain_mapping_efficiency. Will be adjusted based ongene_listif provided.- gene_list
Optional character vector of gene IDs (e.g., Ensembl IDs) to restrict analysis to specific genes. Used for TAP-seq experimental design. If NULL, all genes are used (suitable for perturb-seq).
- TPM_thres
Numeric. Threshold (in TPM) for filtering low-expression genes in gene expression model. Default: 0.1.
- h5_only
Logical. If TRUE, skips baseline expression step to save time. Useful for faster processing when only library parameters are needed. Default: FALSE.
- n_threads
Integer or NULL. Number of parallel processing threads. If NULL, uses single-threaded execution. Default: NULL.
Value
A list containing:
- baseline_expression_stats
Data frame with columns:
response_id: Ensembl gene identifierrelative_expression: Estimated relative expression proportions, normalized to sum to 1 across all genes of interestexpression_size: Estimated dispersion (size) parameter representing gene-specific expression variability
- library_parameters
List with saturation curve parameters from preseqR:
method_used: Selected method ("ZTNB" or "RFA")UMI_per_cell_at_saturation: Estimated maximum UMI count per cell at infinite sequencing depthreads_norm: Normalization constant (reads per cell in pilot data)n_cells: Number of cells in pilot dataAdditional method-specific parameters (e.g., L, size, mu for ZTNB; coefs and poles for RFA)
- mapping_efficiency
Numeric. Adjusted mapping efficiency accounting for fraction of reads mapped to genes of interest.
Details
Statistical Models
Gene Expression Model (Negative Binomial):
For each gene, the function fits a negative binomial (NB) model to characterize the distribution of gene expression levels across cells:
gene_expression ~ NB(mean = library_size * relative_expression, expression_size = expression_size)
where:
gene_expression: Number of observed UMIs for the gene in each cell (data)library_size: Total UMI count per cell (data)relative_expression: Relative expression level of the gene (fitted parameter)expression_size: Dispersion parameter; small values indicate high biological variability (fitted parameter)
Sequencing Saturation Model (preseqR):
The function uses the preseqR package to fit a saturation curve relating mapped reads per cell to observed UMIs per cell. The preseqR approach:
Data-adaptively selects between ZTNB (Zero-Truncated Negative Binomial) and RFA (Rational Function Approximation) methods based on the estimated shape parameter
ZTNB method (shape > 1): Uses closed-form negative binomial predictions
RFA method (shape ≤ 1): Uses rational function approximation for better extrapolation in low-complexity libraries
Edge case: If RFA estimator is invalid, returns constant predictions (rarely occurs in practice)
Key output:
UMI_per_cell_at_saturation- Maximum UMI per cell at infinite sequencing depth, plus method-specific parameters for curve prediction
Processing Steps
The function executes these steps:
If
gene_listprovided: adjusts mapping efficiency and filters genesFits negative binomial model using
obtain_expression_informationto estimate gene-level expression parametersFits saturation model using
library_estimationto estimate library parameters using preseqR's data-adaptive method (ZTNB or RFA)Returns structured output compatible with power analysis functions
See also
reference_data_preprocessing_10x for Step 1 of the preprocessing workflow.
obtain_expression_information for NB model fitting details.
library_estimation for S-M curve fitting details.
See the vignette "Preprocess Reference Expression data for Web App" for the
complete preprocessing workflow: vignette("preprocess-reference", package = "perturbplan")
Examples
# Set seed for reproducibility
set.seed(123)
# First, get aggregated raw data using reference_data_preprocessing_10x
extdata_path <- system.file("extdata", package = "perturbplan")
raw_data <- reference_data_preprocessing_10x(
path_to_top_level_output = extdata_path,
path_to_run_level_output = "cellranger_tiny",
h5_rough = TRUE
)
# Example 1: Process for perturb-seq experimental design (all genes)
pilot_data_perturbseq <- reference_data_processing(
response_matrix = raw_data$response_matrix,
read_umi_table = raw_data$read_umi_table,
mapping_efficiency = raw_data$mapping_efficiency,
gene_list = NULL, # Use all genes
TPM_thres = 0.1, # Default expression threshold for filtering
h5_only = FALSE, # Fit expression model
n_threads = NULL # No parallel processing
)
#> Starting pilot data preprocessing @ 2026-01-26 12:39:34.337678
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-01-26 12:39:34.343392
#> Finish relative expression calculation @ 2026-01-26 12:39:34.343607
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (12 thread(s)) @ 2026-01-26 12:39:34.343829
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 5
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-01-26 12:39:34.344555
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-01-26 12:39:34.347221
#> Processed 5 genes
#> UMI_per_cell_at_saturation = 5, Model = ZTNB
#> Mapping efficiency = 1
# Inspect structure
str(pilot_data_perturbseq)
#> List of 3
#> $ baseline_expression_stats:'data.frame': 5 obs. of 3 variables:
#> ..$ response_id : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#> ..$ relative_expression: num [1:5] 0.0909 0.3636 0.1818 0.1818 0.1818
#> ..$ expression_size : num [1:5] 401 484 468 468 468
#> $ library_parameters :List of 7
#> ..$ method_used : chr "ZTNB"
#> ..$ L : num 37.8
#> ..$ size : num 10000
#> ..$ mu : num 0.344
#> ..$ reads_norm : num 1.62
#> ..$ n_cells : num 8
#> ..$ UMI_per_cell_at_saturation: num 4.72
#> $ mapping_efficiency : num 1
# Example 2: Process for TAP-seq experimental design (targeted genes only)
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_tapseq <- reference_data_processing(
response_matrix = raw_data$response_matrix,
read_umi_table = raw_data$read_umi_table,
mapping_efficiency = raw_data$mapping_efficiency,
gene_list = gene_list, # Restrict to specific genes
TPM_thres = 0 # No expression threshold for filtering
)
#> Starting pilot data preprocessing @ 2026-01-26 12:39:34.350193
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-01-26 12:39:34.355016
#> Finish relative expression calculation @ 2026-01-26 12:39:34.355171
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (12 thread(s)) @ 2026-01-26 12:39:34.355361
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 3
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-01-26 12:39:34.355869
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-01-26 12:39:34.357909
#> Processed 3 genes
#> UMI_per_cell_at_saturation = 6, Model = ZTNB
#> Mapping efficiency = 0.727
# Inspect structure
str(pilot_data_tapseq)
#> List of 3
#> $ baseline_expression_stats:'data.frame': 3 obs. of 3 variables:
#> ..$ response_id : chr [1:3] "ENSG00000238009" "ENSG00000239945" "ENSG00000241860"
#> ..$ relative_expression: num [1:3] 0.5 0.25 0.25
#> ..$ expression_size : num [1:3] 487 476 476
#> $ library_parameters :List of 7
#> ..$ method_used : chr "ZTNB"
#> ..$ L : num 36.2
#> ..$ size : num 10000
#> ..$ mu : num 0.25
#> ..$ reads_norm : num 1.5
#> ..$ n_cells : num 6
#> ..$ UMI_per_cell_at_saturation: num 6.03
#> $ mapping_efficiency : num 0.727