Extract Statistical Parameters from Pilot Data for Power Analysis

Fits statistical models to extract gene-level expression parameters, library saturation parameters, and adjusted mapping efficiency required by PerturbPlan for power analysis. This is Step 2 of the pilot data preprocessing workflow. Outputs are compatible with built-in pilot data examples and the PerturbPlan Shiny application.

Usage

reference_data_processing(
  response_matrix = NULL,
  read_umi_table,
  mapping_efficiency = NULL,
  gene_list = NULL,
  TPM_thres = 0.1,
  h5_only = FALSE,
  n_threads = NULL
)

Arguments

response_matrix: Matrix or NULL. Gene-by-cell expression matrix, typically from reference_data_preprocessing_10x. Required unless h5_only = TRUE.
read_umi_table: Data frame. UMI-level molecule information from obtain_qc_read_umi_table, typically obtained via reference_data_preprocessing_10x.
mapping_efficiency: Numeric or NULL. Naive mapping efficiency estimate from obtain_mapping_efficiency. Will be adjusted based on gene_list if provided.
gene_list: Optional character vector of gene IDs (e.g., Ensembl IDs) to restrict analysis to specific genes. Used for TAP-seq experimental design. If NULL, all genes are used (suitable for perturb-seq).
TPM_thres: Numeric. Threshold (in TPM) for filtering low-expression genes in gene expression model. Default: 0.1.
h5_only: Logical. If TRUE, skips baseline expression step to save time. Useful for faster processing when only library parameters are needed. Default: FALSE.
n_threads: Integer or NULL. Number of parallel processing threads. If NULL, uses single-threaded execution. Default: NULL.

Value

A list containing:

baseline_expression_stats

Data frame with columns:

response_id: Ensembl gene identifier
relative_expression: Estimated relative expression proportions, normalized to sum to 1 across all genes of interest
expression_size: Estimated dispersion (size) parameter representing gene-specific expression variability

library_parameters

List with saturation curve parameters from preseqR:

method_used: Selected method ("ZTNB" or "RFA")
UMI_per_cell_at_saturation: Estimated maximum UMI count per cell at infinite sequencing depth
reads_norm: Normalization constant (reads per cell in pilot data)
n_cells: Number of cells in pilot data
Additional method-specific parameters (e.g., L, size, mu for ZTNB; coefs and poles for RFA)

mapping_efficiency

Numeric. Adjusted mapping efficiency accounting for fraction of reads mapped to genes of interest.

Details

Statistical Models

Gene Expression Model (Negative Binomial):

For each gene, the function fits a negative binomial (NB) model to characterize the distribution of gene expression levels across cells:

gene_expression ~ NB(mean = library_size * relative_expression, expression_size = expression_size)

where:

gene_expression: Number of observed UMIs for the gene in each cell (data)
library_size: Total UMI count per cell (data)
relative_expression: Relative expression level of the gene (fitted parameter)
expression_size: Dispersion parameter; small values indicate high biological variability (fitted parameter)

Sequencing Saturation Model (preseqR):

The function uses the preseqR package to fit a saturation curve relating mapped reads per cell to observed UMIs per cell. The preseqR approach:

Data-adaptively selects between ZTNB (Zero-Truncated Negative Binomial) and RFA (Rational Function Approximation) methods based on the estimated shape parameter
ZTNB method (shape > 1): Uses closed-form negative binomial predictions
RFA method (shape ≤ 1): Uses rational function approximation for better extrapolation in low-complexity libraries
Edge case: If RFA estimator is invalid, returns constant predictions (rarely occurs in practice)
Key output: UMI_per_cell_at_saturation - Maximum UMI per cell at infinite sequencing depth, plus method-specific parameters for curve prediction

Processing Steps

The function executes these steps:

If gene_list provided: adjusts mapping efficiency and filters genes
Fits negative binomial model using obtain_expression_information to estimate gene-level expression parameters
Fits saturation model using library_estimation to estimate library parameters using preseqR's data-adaptive method (ZTNB or RFA)
Returns structured output compatible with power analysis functions

Use Cases

Perturb-seq: Use all genes (gene_list = NULL)
TAP-seq: Provide targeted gene list, set TPM_thres = 0

Examples

# Set seed for reproducibility
set.seed(123)

# First, get aggregated raw data using reference_data_preprocessing_10x
extdata_path <- system.file("extdata", package = "perturbplan")
raw_data <- reference_data_preprocessing_10x(
  path_to_top_level_output = extdata_path,
  path_to_run_level_output = "cellranger_tiny",
  h5_rough = TRUE
)

# Example 1: Process for perturb-seq experimental design (all genes)
pilot_data_perturbseq <- reference_data_processing(
  response_matrix = raw_data$response_matrix,
  read_umi_table = raw_data$read_umi_table,
  mapping_efficiency = raw_data$mapping_efficiency,
  gene_list = NULL,     # Use all genes
  TPM_thres = 0.1,      # Default expression threshold for filtering
  h5_only = FALSE,      # Fit expression model
  n_threads = NULL      # No parallel processing
)
#> Starting pilot data preprocessing @ 2026-01-26 12:39:34.337678
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-01-26 12:39:34.343392
#> Finish relative expression calculation @ 2026-01-26 12:39:34.343607
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (12 thread(s)) @ 2026-01-26 12:39:34.343829
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 5
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-01-26 12:39:34.344555
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-01-26 12:39:34.347221
#> Processed 5 genes
#> UMI_per_cell_at_saturation = 5, Model = ZTNB
#> Mapping efficiency = 1

# Inspect structure
str(pilot_data_perturbseq)
#> List of 3
#>  $ baseline_expression_stats:'data.frame':	5 obs. of  3 variables:
#>   ..$ response_id        : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#>   ..$ relative_expression: num [1:5] 0.0909 0.3636 0.1818 0.1818 0.1818
#>   ..$ expression_size    : num [1:5] 401 484 468 468 468
#>  $ library_parameters       :List of 7
#>   ..$ method_used               : chr "ZTNB"
#>   ..$ L                         : num 37.8
#>   ..$ size                      : num 10000
#>   ..$ mu                        : num 0.344
#>   ..$ reads_norm                : num 1.62
#>   ..$ n_cells                   : num 8
#>   ..$ UMI_per_cell_at_saturation: num 4.72
#>  $ mapping_efficiency       : num 1

# Example 2: Process for TAP-seq experimental design (targeted genes only)
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_tapseq <- reference_data_processing(
  response_matrix = raw_data$response_matrix,
  read_umi_table = raw_data$read_umi_table,
  mapping_efficiency = raw_data$mapping_efficiency,
  gene_list = gene_list, # Restrict to specific genes
  TPM_thres = 0          # No expression threshold for filtering
)
#> Starting pilot data preprocessing @ 2026-01-26 12:39:34.350193
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2026-01-26 12:39:34.355016
#> Finish relative expression calculation @ 2026-01-26 12:39:34.355171
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (12 thread(s)) @ 2026-01-26 12:39:34.355361
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 3
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-01-26 12:39:34.355869
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2026-01-26 12:39:34.357909
#> Processed 3 genes
#> UMI_per_cell_at_saturation = 6, Model = ZTNB
#> Mapping efficiency = 0.727

# Inspect structure
str(pilot_data_tapseq)
#> List of 3
#>  $ baseline_expression_stats:'data.frame':	3 obs. of  3 variables:
#>   ..$ response_id        : chr [1:3] "ENSG00000238009" "ENSG00000239945" "ENSG00000241860"
#>   ..$ relative_expression: num [1:3] 0.5 0.25 0.25
#>   ..$ expression_size    : num [1:3] 487 476 476
#>  $ library_parameters       :List of 7
#>   ..$ method_used               : chr "ZTNB"
#>   ..$ L                         : num 36.2
#>   ..$ size                      : num 10000
#>   ..$ mu                        : num 0.25
#>   ..$ reads_norm                : num 1.5
#>   ..$ n_cells                   : num 6
#>   ..$ UMI_per_cell_at_saturation: num 6.03
#>  $ mapping_efficiency       : num 0.727