
Extract Statistical Parameters from Pilot Data for Power Analysis
reference_data_processing.Rd
Fits statistical models to extract gene-level expression parameters, library saturation parameters, and adjusted mapping efficiency required by PerturbPlan for power analysis. This is Step 2 of the pilot data preprocessing workflow. Outputs are compatible with built-in pilot data examples and the PerturbPlan Shiny application.
Usage
reference_data_processing(
response_matrix = NULL,
read_umi_table,
mapping_efficiency = NULL,
gene_list = NULL,
TPM_thres = 0.1,
downsample_ratio = 0.7,
D2_rough = 0.3,
h5_only = FALSE,
n_threads = NULL
)
Arguments
- response_matrix
Matrix or NULL. Gene-by-cell expression matrix, typically from
reference_data_preprocessing_10x
. Required unlessh5_only = TRUE
.- read_umi_table
Data frame. UMI-level molecule information from
obtain_qc_read_umi_table
, typically obtained viareference_data_preprocessing_10x
.- mapping_efficiency
Numeric or NULL. Naive mapping efficiency estimate from
obtain_mapping_efficiency
. Will be adjusted based ongene_list
if provided.- gene_list
Optional character vector of gene IDs (e.g., Ensembl IDs) to restrict analysis to specific genes. Used for TAP-seq experimental design. If NULL, all genes are used (suitable for perturb-seq).
- TPM_thres
Numeric. Threshold (in TPM) for filtering low-expression genes in gene expression model. Default: 0.1.
- downsample_ratio
Numeric. Proportion of downsampling in library saturation model fitting. Default: 0.7.
- D2_rough
Numeric. Rough prior value for library variation parameter. Typically higher in TAP-seq experiments (e.g., 0.8) than perturb-seq experiments (0.3). Default: 0.3.
- h5_only
Logical. If TRUE, skips baseline expression step to save time. Useful when tuning hyperparameters for library model fitting. Default: FALSE.
- n_threads
Integer or NULL. Number of parallel processing threads. If NULL, uses single-threaded execution. Default: NULL.
Value
A list containing:
- baseline_expression_stats
Data frame with columns:
response_id
: Ensembl gene identifierrelative_expression
: Estimated relative expression proportions, normalized to sum to 1 across all genes of interestexpression_size
: Estimated dispersion (size) parameter representing gene-specific expression variability
- library_parameters
List with:
UMI_per_cell
: Estimated maximum UMI count per cell at saturationvariation
: Estimated PCR amplification variation parameter (0 to 1)
- mapping_efficiency
Numeric. Adjusted mapping efficiency accounting for fraction of reads mapped to genes of interest.
Details
Statistical Models
Gene Expression Model (Negative Binomial):
For each gene, the function fits a negative binomial (NB) model to characterize the distribution of gene expression levels across cells:
gene_expression ~ NB(mean = library_size * relative_expression, expression_size = expression_size)
where:
gene_expression
: Number of observed UMIs for the gene in each cell (data)library_size
: Total UMI count per cell (data)relative_expression
: Relative expression level of the gene (fitted parameter)expression_size
: Dispersion parameter; small values indicate high biological variability (fitted parameter)
Sequencing Saturation Model (S-M Curve):
The function fits a saturation (S-M) curve that relates mapped reads per cell to observed UMIs per cell:
UMI = total_UMIs * (1 - exp(-reads/total_UMIs) * (1 + variation * reads^2/(2*total_UMIs^2)))
where:
reads
: Number of mapped reads per cell (data)UMI
: Number of observed UMIs per cell (data)total_UMIs
(UMI_per_cell): Maximum UMI per cell parameter at saturation (fitted parameter)variation
: Variation parameter characterizing PCR amplification bias, between 0 and 1 (fitted parameter)
Processing Steps
The function executes these steps:
If
gene_list
provided: adjusts mapping efficiency and filters genesFits negative binomial model using
obtain_expression_information
to estimate gene-level expression parametersFits saturation model using
library_computation
to estimate library parametersReturns structured output compatible with power analysis functions
See also
reference_data_preprocessing_10x
for Step 1 of the preprocessing workflow.
obtain_expression_information
for NB model fitting details.
library_computation
for S-M curve fitting details.
See the vignette "Preprocess Reference Expression data for Web App" for the
complete preprocessing workflow: vignette("preprocess-reference", package = "perturbplan")
Examples
# Set seed for reproducibility
set.seed(123)
# First, get aggregated raw data using reference_data_preprocessing_10x
extdata_path <- system.file("extdata", package = "perturbplan")
raw_data <- reference_data_preprocessing_10x(
path_to_top_level_output = extdata_path,
path_to_run_level_output = "cellranger_tiny",
h5_rough = TRUE
)
# Example 1: Process for perturb-seq experimental design (all genes)
pilot_data_perturbseq <- reference_data_processing(
response_matrix = raw_data$response_matrix,
read_umi_table = raw_data$read_umi_table,
mapping_efficiency = raw_data$mapping_efficiency,
gene_list = NULL, # Use all genes
TPM_thres = 0.1, # Default expression threshold for filtering
downsample_ratio = 0.6, # Downsampling for sequencing
D2_rough = 0.4, # Prior for variation parameter
h5_only = FALSE, # Fit expression model
n_threads = NULL # No parallel processing
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:19.203227
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:19.218907
#> Finish relative expression calculation @ 2025-10-13 22:56:19.220162
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:19.221114
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 5
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:19.223851
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:19.233203
#> Processed 5 genes
#> Library parameters: UMI_per_cell = 6, variation = 0.371
#> Mapping efficiency = 1
# Inspect structure
str(pilot_data_perturbseq)
#> List of 3
#> $ baseline_expression_stats:'data.frame': 5 obs. of 3 variables:
#> ..$ response_id : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#> ..$ relative_expression: num [1:5] 0.0909 0.3636 0.1818 0.1818 0.1818
#> ..$ expression_size : num [1:5] 401 484 468 468 468
#> $ library_parameters :List of 2
#> ..$ UMI_per_cell: num 6.12
#> ..$ variation : num 0.371
#> $ mapping_efficiency : num 1
# Example 2: Process for TAP-seq experimental design (targeted genes only)
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_tapseq <- reference_data_processing(
response_matrix = raw_data$response_matrix,
read_umi_table = raw_data$read_umi_table,
mapping_efficiency = raw_data$mapping_efficiency,
gene_list = gene_list, # Restrict to specific genes
TPM_thres = 0 # No expression threshold for filtering
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:19.243017
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:19.307621
#> Finish relative expression calculation @ 2025-10-13 22:56:19.308618
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:19.309567
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 3
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:19.311625
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:19.324222
#> Processed 3 genes
#> Library parameters: UMI_per_cell = 7, variation = 0.366
#> Mapping efficiency = 0.727
# Inspect structure
str(pilot_data_tapseq)
#> List of 3
#> $ baseline_expression_stats:'data.frame': 3 obs. of 3 variables:
#> ..$ response_id : chr [1:3] "ENSG00000238009" "ENSG00000239945" "ENSG00000241860"
#> ..$ relative_expression: num [1:3] 0.5 0.25 0.25
#> ..$ expression_size : num [1:3] 487 476 476
#> $ library_parameters :List of 2
#> ..$ UMI_per_cell: num 6.82
#> ..$ variation : num 0.366
#> $ mapping_efficiency : num 0.727