Fit Negative Binomial Model to Estimate Gene Expression Parameters

Fits a negative binomial model to gene expression data to estimate relative expression levels and dispersion parameters for each gene. This function is used internally by reference_data_processing.

Usage

obtain_expression_information(
  response_matrix,
  TPM_thres = 0.1,
  rough = FALSE,
  n_threads = NULL
)

Arguments

response_matrix

Sparse matrix (genes as rows, cells as columns). Typically a CsparseMatrix from obtain_qc_response_data.

TPM_thres

Numeric. Expression threshold in TPM (Transcripts Per Million) for filtering low-expression genes. Genes with TPM below this threshold are excluded. Default: 0.1.

rough

Logical. If TRUE, uses fast C++ estimator for dispersion. If FALSE, uses refined maximum likelihood estimation. Default: FALSE.

n_threads

Integer or NULL controlling parallelism:

NULL – auto-detect (prefer NSLOTS environment variable, else use parallel::detectCores())
NA – force use of NSLOTS only
positive integer – user-specified thread count

Value

Data frame with three columns:

response_id: Gene identifier (e.g., Ensembl ID) for genes passing TPM threshold
relative_expression: Estimated relative expression proportion (sums to 1 across all genes)
expression_size: Estimated dispersion parameter $\theta$ from negative binomial model. Small values indicate high biological variability.

Details

Negative Binomial Model

For each gene, the model is:

$$\text{gene_expression} \sim \text{NB}(\text{mean} = \text{library_size} \times \text{relative_expression}, \text{size} = \text{expression_size})$$

where library_size is the total UMI count per cell and relative_expression and expression_size are the fitted parameters.

Processing Steps

Calculates library sizes (total UMIs per cell)
Computes relative expression (gene counts / total counts)
Converts to TPM scale and filters genes below threshold
Estimates dispersion parameters using C++ implementation
Returns data frame with fitted parameters

Examples

# Get response matrix from Cell Ranger output
cellranger_path <- system.file("extdata/cellranger_tiny", package = "perturbplan")
response_matrix <- obtain_qc_response_data(cellranger_path)

# Extract expression information
expr_info <- obtain_expression_information(
  response_matrix = response_matrix,
  TPM_thres = 0.1,
  rough = TRUE,
  n_threads = 1
)
#> Start relative expression calculation @ 2026-01-26 12:39:33.423449
#> Finish relative expression calculation @ 2026-01-26 12:39:33.424016
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (1 thread(s)) @ 2026-01-26 12:39:33.424226
#> [theta_batch_cpp] Starting computation
#>  - G (genes) = 5
#>  - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_refined_row result = 0.220983
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_refined_row result = 251.069
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_refined_row result = 242.683
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_refined_row result = 242.683
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_refined_row result = 242.683
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2026-01-26 12:39:33.425204

# Examine results
head(expr_info)
#>                     response_id relative_expression expression_size
#> ENSG00000243485 ENSG00000243485          0.09090909       0.2209832
#> ENSG00000238009 ENSG00000238009          0.36363636     251.0688935
#> ENSG00000239945 ENSG00000239945          0.18181818     242.6834302
#> ENSG00000241860 ENSG00000241860          0.18181818     242.6834302
#> ENSG00000286448 ENSG00000286448          0.18181818     242.6834302
dim(expr_info)
#> [1] 5 3
summary(expr_info$expression_size)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.221 242.683 242.683 195.868 242.683 251.069