Skip to contents

Fits statistical models to extract gene-level expression parameters, library saturation parameters, and adjusted mapping efficiency required by PerturbPlan for power analysis. This is Step 2 of the pilot data preprocessing workflow. Outputs are compatible with built-in pilot data examples and the PerturbPlan Shiny application.

Usage

reference_data_processing(
  response_matrix = NULL,
  read_umi_table,
  mapping_efficiency = NULL,
  gene_list = NULL,
  TPM_thres = 0.1,
  downsample_ratio = 0.7,
  D2_rough = 0.3,
  h5_only = FALSE,
  n_threads = NULL
)

Arguments

response_matrix

Matrix or NULL. Gene-by-cell expression matrix, typically from reference_data_preprocessing_10x. Required unless h5_only = TRUE.

read_umi_table

Data frame. UMI-level molecule information from obtain_qc_read_umi_table, typically obtained via reference_data_preprocessing_10x.

mapping_efficiency

Numeric or NULL. Naive mapping efficiency estimate from obtain_mapping_efficiency. Will be adjusted based on gene_list if provided.

gene_list

Optional character vector of gene IDs (e.g., Ensembl IDs) to restrict analysis to specific genes. Used for TAP-seq experimental design. If NULL, all genes are used (suitable for perturb-seq).

TPM_thres

Numeric. Threshold (in TPM) for filtering low-expression genes in gene expression model. Default: 0.1.

downsample_ratio

Numeric. Proportion of downsampling in library saturation model fitting. Default: 0.7.

D2_rough

Numeric. Rough prior value for library variation parameter. Typically higher in TAP-seq experiments (e.g., 0.8) than perturb-seq experiments (0.3). Default: 0.3.

h5_only

Logical. If TRUE, skips baseline expression step to save time. Useful when tuning hyperparameters for library model fitting. Default: FALSE.

n_threads

Integer or NULL. Number of parallel processing threads. If NULL, uses single-threaded execution. Default: NULL.

Value

A list containing:

baseline_expression_stats

Data frame with columns:

  • response_id: Ensembl gene identifier

  • relative_expression: Estimated relative expression proportions, normalized to sum to 1 across all genes of interest

  • expression_size: Estimated dispersion (size) parameter representing gene-specific expression variability

library_parameters

List with:

  • UMI_per_cell: Estimated maximum UMI count per cell at saturation

  • variation: Estimated PCR amplification variation parameter (0 to 1)

mapping_efficiency

Numeric. Adjusted mapping efficiency accounting for fraction of reads mapped to genes of interest.

Details

Statistical Models

Gene Expression Model (Negative Binomial):

For each gene, the function fits a negative binomial (NB) model to characterize the distribution of gene expression levels across cells:

gene_expression ~ NB(mean = library_size * relative_expression, expression_size = expression_size)

where:

  • gene_expression: Number of observed UMIs for the gene in each cell (data)

  • library_size: Total UMI count per cell (data)

  • relative_expression: Relative expression level of the gene (fitted parameter)

  • expression_size: Dispersion parameter; small values indicate high biological variability (fitted parameter)

Sequencing Saturation Model (S-M Curve):

The function fits a saturation (S-M) curve that relates mapped reads per cell to observed UMIs per cell:

UMI = total_UMIs * (1 - exp(-reads/total_UMIs) * (1 + variation * reads^2/(2*total_UMIs^2)))

where:

  • reads: Number of mapped reads per cell (data)

  • UMI: Number of observed UMIs per cell (data)

  • total_UMIs (UMI_per_cell): Maximum UMI per cell parameter at saturation (fitted parameter)

  • variation: Variation parameter characterizing PCR amplification bias, between 0 and 1 (fitted parameter)

Processing Steps

The function executes these steps:

  1. If gene_list provided: adjusts mapping efficiency and filters genes

  2. Fits negative binomial model using obtain_expression_information to estimate gene-level expression parameters

  3. Fits saturation model using library_computation to estimate library parameters

  4. Returns structured output compatible with power analysis functions

Use Cases

  • Perturb-seq: Use all genes (gene_list = NULL), default D2_rough = 0.3

  • TAP-seq: Provide targeted gene list, higher D2_rough (e.g., 0.8), set TPM_thres = 0

See also

reference_data_preprocessing_10x for Step 1 of the preprocessing workflow.

obtain_expression_information for NB model fitting details.

library_computation for S-M curve fitting details.

See the vignette "Preprocess Reference Expression data for Web App" for the complete preprocessing workflow: vignette("preprocess-reference", package = "perturbplan")

Examples

# Set seed for reproducibility
set.seed(123)

# First, get aggregated raw data using reference_data_preprocessing_10x
extdata_path <- system.file("extdata", package = "perturbplan")
raw_data <- reference_data_preprocessing_10x(
  path_to_top_level_output = extdata_path,
  path_to_run_level_output = "cellranger_tiny",
  h5_rough = TRUE
)

# Example 1: Process for perturb-seq experimental design (all genes)
pilot_data_perturbseq <- reference_data_processing(
  response_matrix = raw_data$response_matrix,
  read_umi_table = raw_data$read_umi_table,
  mapping_efficiency = raw_data$mapping_efficiency,
  gene_list = NULL,     # Use all genes
  TPM_thres = 0.1,      # Default expression threshold for filtering
  downsample_ratio = 0.6,  # Downsampling for sequencing
  D2_rough = 0.4,       # Prior for variation parameter
  h5_only = FALSE,      # Fit expression model
  n_threads = NULL      # No parallel processing
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:19.203227
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:19.218907
#> Finish relative expression calculation @ 2025-10-13 22:56:19.220162
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:19.221114
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 5
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_mle_row result = 401.014
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_mle_row result = 484.176
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_mle_row result = 468.498
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_mle_row result = 468.498
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_mle_row result = 468.498
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:19.223851
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:19.233203
#> Processed 5 genes
#> Library parameters: UMI_per_cell = 6, variation = 0.371
#> Mapping efficiency = 1

# Inspect structure
str(pilot_data_perturbseq)
#> List of 3
#>  $ baseline_expression_stats:'data.frame':	5 obs. of  3 variables:
#>   ..$ response_id        : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#>   ..$ relative_expression: num [1:5] 0.0909 0.3636 0.1818 0.1818 0.1818
#>   ..$ expression_size    : num [1:5] 401 484 468 468 468
#>  $ library_parameters       :List of 2
#>   ..$ UMI_per_cell: num 6.12
#>   ..$ variation   : num 0.371
#>  $ mapping_efficiency       : num 1

# Example 2: Process for TAP-seq experimental design (targeted genes only)
gene_list <- c("ENSG00000241860", "ENSG00000238009", "ENSG00000239945")
pilot_data_tapseq <- reference_data_processing(
  response_matrix = raw_data$response_matrix,
  read_umi_table = raw_data$read_umi_table,
  mapping_efficiency = raw_data$mapping_efficiency,
  gene_list = gene_list, # Restrict to specific genes
  TPM_thres = 0          # No expression threshold for filtering
)
#> Starting pilot data preprocessing @ 2025-10-13 22:56:19.243017
#> Step 1: Computing gene expression information...
#> Start relative expression calculation @ 2025-10-13 22:56:19.307621
#> Finish relative expression calculation @ 2025-10-13 22:56:19.308618
#> Number of genes passing TPM threshold: 3
#> Start dispersion estimation (8 thread(s)) @ 2025-10-13 22:56:19.309567
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 3
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.5
#> [gene 0] t_0 = 1.33333
#> [gene 0] theta_mle_row result = 486.513
#> [gene 1] rel_expr = 0.25
#> [gene 1] t_0 = 0.5
#> [gene 1] theta_mle_row result = 475.754
#> [gene 2] rel_expr = 0.25
#> [gene 2] t_0 = 0.5
#> [gene 2] theta_mle_row result = 475.754
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:19.311625
#> Step 2: Estimating library parameters...
#> Completed pilot data preprocessing @ 2025-10-13 22:56:19.324222
#> Processed 3 genes
#> Library parameters: UMI_per_cell = 7, variation = 0.366
#> Mapping efficiency = 0.727

# Inspect structure
str(pilot_data_tapseq)
#> List of 3
#>  $ baseline_expression_stats:'data.frame':	3 obs. of  3 variables:
#>   ..$ response_id        : chr [1:3] "ENSG00000238009" "ENSG00000239945" "ENSG00000241860"
#>   ..$ relative_expression: num [1:3] 0.5 0.25 0.25
#>   ..$ expression_size    : num [1:3] 487 476 476
#>  $ library_parameters       :List of 2
#>   ..$ UMI_per_cell: num 6.82
#>   ..$ variation   : num 0.366
#>  $ mapping_efficiency       : num 0.727