Aggregate Expression and QC Data from Multiple SRR-Level Cell Ranger Outputs

Aggregates Cell Ranger outputs from multiple sequencing runs (SRRs) into a single data structure containing gene expression matrices, UMI-level molecule information, and naive mapping efficiency estimates. This is Step 1 of the pilot data preprocessing workflow for PerturbPlan power analysis.

Usage

reference_data_preprocessing_10x(
  path_to_top_level_output,
  path_to_run_level_output = NULL,
  h5_rough = TRUE,
  skip_mapping_efficiency = FALSE
)

Arguments

path_to_top_level_output: Character. Path to the top-level directory containing Cell Ranger run-level subdirectories. Each subdirectory should contain Cell Ranger output in the standard structure with outs/ folders.
path_to_run_level_output: Optional character vector. Names of specific run-level directories to process (not full paths). Should match the basename of folders inside path_to_top_level_output. If NULL, all subdirectories are processed. Unmatched entries will trigger a warning.
h5_rough: Logical. If TRUE (default), extracts UMI-level molecule information from first SRR only for speed. If FALSE, combines UMI-level molecule information from all SRRs (slower but more comprehensive).
skip_mapping_efficiency: Logical. If TRUE, skips estimation of mapping efficiency. If FALSE (default), calculates naive mapping efficiency. Set to TRUE when metrics_summary.csv is unavailable or lacks "Number of Reads" column.

Value

A list with three elements:

response_matrix

Sparse gene-by-cell expression matrix (genes as rows, cells as columns) with row and column names. Contains only genes common across all processed SRR runs, combined across all runs.

read_umi_table

Data frame with UMI-level molecule information, including columns:

num_reads: Number of reads supporting this UMI-cell combination
UMI_id: UMI index
cell_id: Cell barcode
response_id: Gene identifier (e.g., Ensembl ID)
srr_idx: SRR run identifier

mapping_efficiency

Numeric value between 0 and 1 representing the proportion of reads that are confidently mapped to the transcriptome and assigned to a true cell barcode (i.e., a cell barcode that appears in the filtered expression matrix) among all raw reads recorded in metrics_summary.csv if skip_mapping_efficiency = FALSE, or NULL if skip_mapping_efficiency = TRUE.

Details

Input Requirements

Your data should be organized with Cell Ranger output directories under a top-level folder:


path_to_top_level_output/
├── SRR_run_1/
│   ├── outs/
│   │   ├── filtered_feature_bc_matrix/
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── molecule_info.h5
│   │   ├── filtered_feature_bc_matrix.h5
│   │   └── metrics_summary.csv
├── SRR_run_2/
│   └── ...
└── SRR_run_3/
    └── ...

## Processing Steps

The function performs the following operations:

Lists all SRR directories under the given top-level folder
Optionally filters to a subset of run-level names
Reads gene expression matrices using obtain_qc_response_data
Identifies and retains only genes shared across all SRRs
Reads UMI-level molecule information using obtain_qc_read_umi_table
Calculates naive mapping efficiency using obtain_mapping_efficiency

Important Notes

SRR directories should be generated by a recent version of Cell Ranger count (not Cell Ranger multi) configured for the perturbation (CRISPR or Perturb-seq) workflow using standard output format
Cell Ranger multi is not currently supported because: (1) its output directory structure places files under outs/per_sample_outs/{sample_id}/count/ instead of directly under outs/, and (2) its metrics_summary.csv uses a row-based format with "Library Type" and "Metric Name" columns, while Cell Ranger count uses a column-based format where metric names are column headers
In some cases, filtered_feature_bc_matrix/ may need to be produced by unzipping filtered_feature_bc_matrix.tar.gz
For mapping efficiency calculation, metrics_summary.csv must include a column named "Number of Reads" (Cell Ranger count format)
When h5_rough = FALSE, multiple mapping efficiencies are computed and the median value is returned

Examples

# Point to directory containing example Cell Ranger outputs
extdata_path <- system.file("extdata", package = "perturbplan")

# Aggregate data from all SRR runs
raw_data <- reference_data_preprocessing_10x(
  path_to_top_level_output = extdata_path,
  path_to_run_level_output = "cellranger_tiny",  # Only read subfolder cellranger_tiny
  h5_rough = TRUE,  # Use first SRR for QC data (faster)
  skip_mapping_efficiency = FALSE  # Estimate mapping efficiency
)

# Inspect structure
str(raw_data)
#> List of 3
#>  $ response_matrix   :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:11] 0 1 2 3 4 1 2 1 4 1 ...
#>   .. ..@ p       : int [1:9] 0 1 2 3 4 5 7 9 11
#>   .. ..@ Dim     : int [1:2] 5 8
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#>   .. .. ..$ : chr [1:8] "AGCAGCCGTCCAAGTT-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAACCTGGTATATGAG-1" ...
#>   .. ..@ x       : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ read_umi_table    :'data.frame':	11 obs. of  5 variables:
#>   ..$ num_reads  : int [1:11] 2 1 1 2 1 1 1 1 1 1 ...
#>   ..$ UMI_id     : num [1:11] 139105 723247 998389 622094 584568 ...
#>   ..$ cell_id    : chr [1:11] "AAACCTGGTATATGAG-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAAGTAGTCCAAATGC-1" ...
#>   ..$ response_id: chr [1:11] "ENSG00000241860" "ENSG00000238009" "ENSG00000239945" "ENSG00000286448" ...
#>   ..$ srr_idx    : chr [1:11] "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" ...
#>  $ mapping_efficiency: num 1

# Access components
head(raw_data$read_umi_table)
#>   num_reads UMI_id            cell_id     response_id         srr_idx
#> 1         2 139105 AAACCTGGTATATGAG-1 ENSG00000241860 cellranger_tiny
#> 2         1 723247 AAACGGGTCAGCTCGG-1 ENSG00000238009 cellranger_tiny
#> 3         1 998389 AAAGTAGCATCCCACT-1 ENSG00000239945 cellranger_tiny
#> 4         2 622094 AAAGTAGTCCAAATGC-1 ENSG00000286448 cellranger_tiny
#> 5         1 584568 AGCAGCCGTCCAAGTT-1 ENSG00000243485 cellranger_tiny
#> 6         1 956290 AGCGGTCCATTCCTGC-1 ENSG00000238009 cellranger_tiny
dim(raw_data$response_matrix)
#> [1] 5 8