
Fit Negative Binomial Model to Estimate Gene Expression Parameters
obtain_expression_information.RdFits a negative binomial model to gene expression data to estimate relative expression
levels and dispersion parameters for each gene. This function is used internally by
reference_data_processing.
Usage
obtain_expression_information(
response_matrix,
TPM_thres = 0.1,
rough = FALSE,
n_threads = NULL
)Arguments
- response_matrix
Sparse matrix (genes as rows, cells as columns). Typically a
CsparseMatrixfromobtain_qc_response_data.- TPM_thres
Numeric. Expression threshold in TPM (Transcripts Per Million) for filtering low-expression genes. Genes with TPM below this threshold are excluded. Default: 0.1.
- rough
Logical. If TRUE, uses fast C++ estimator for dispersion. If FALSE, uses refined maximum likelihood estimation. Default: FALSE.
- n_threads
Integer or NULL controlling parallelism:
NULL– auto-detect (preferNSLOTSenvironment variable, else useparallel::detectCores())NA– force use ofNSLOTSonlypositive integer – user-specified thread count
Value
Data frame with three columns:
- response_id
Gene identifier (e.g., Ensembl ID) for genes passing TPM threshold
- relative_expression
Estimated relative expression proportion (sums to 1 across all genes)
- expression_size
Estimated dispersion parameter \(\theta\) from negative binomial model. Small values indicate high biological variability.
Details
Negative Binomial Model
For each gene, the model is:
$$\text{gene_expression} \sim \text{NB}(\text{mean} = \text{library_size} \times \text{relative_expression}, \text{size} = \text{expression_size})$$
where library_size is the total UMI count per cell and relative_expression
and expression_size are the fitted parameters.
See also
reference_data_processing for the complete pilot data
preprocessing workflow
Examples
# Get response matrix from Cell Ranger output
cellranger_path <- system.file("extdata/cellranger_tiny", package = "perturbplan")
response_matrix <- obtain_qc_response_data(cellranger_path)
# Extract expression information
expr_info <- obtain_expression_information(
response_matrix = response_matrix,
TPM_thres = 0.1,
rough = TRUE,
n_threads = 1
)
#> Start relative expression calculation @ 2025-10-13 22:56:15.739316
#> Finish relative expression calculation @ 2025-10-13 22:56:15.742363
#> Number of genes passing TPM threshold: 5
#> Start dispersion estimation (1 thread(s)) @ 2025-10-13 22:56:15.743378
#> [theta_batch_cpp] Starting computation
#> - G (genes) = 5
#> - C (cells) = 8
#> [gene 0] rel_expr = 0.0909091
#> [gene 0] t_0 = 0.0747664
#> [gene 0] theta_refined_row result = 0.220983
#> [gene 1] rel_expr = 0.363636
#> [gene 1] t_0 = 1.06889
#> [gene 1] theta_refined_row result = 251.069
#> [gene 2] rel_expr = 0.181818
#> [gene 2] t_0 = 0.272921
#> [gene 2] theta_refined_row result = 242.683
#> [gene 3] rel_expr = 0.181818
#> [gene 3] t_0 = 0.272921
#> [gene 3] theta_refined_row result = 242.683
#> [gene 4] rel_expr = 0.181818
#> [gene 4] t_0 = 0.272921
#> [gene 4] theta_refined_row result = 242.683
#> [theta_batch_cpp] Done.
#> Finish dispersion estimation @ 2025-10-13 22:56:15.747636
# Examine results
head(expr_info)
#> response_id relative_expression expression_size
#> ENSG00000243485 ENSG00000243485 0.09090909 0.2209832
#> ENSG00000238009 ENSG00000238009 0.36363636 251.0688935
#> ENSG00000239945 ENSG00000239945 0.18181818 242.6834302
#> ENSG00000241860 ENSG00000241860 0.18181818 242.6834302
#> ENSG00000286448 ENSG00000286448 0.18181818 242.6834302
dim(expr_info)
#> [1] 5 3
summary(expr_info$expression_size)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.221 242.683 242.683 195.868 242.683 251.069