Skip to contents

Aggregates Cell Ranger outputs from multiple sequencing runs (SRRs) into a single data structure containing gene expression matrices, UMI-level molecule information, and naive mapping efficiency estimates. This is Step 1 of the pilot data preprocessing workflow for PerturbPlan power analysis.

Usage

reference_data_preprocessing_10x(
  path_to_top_level_output,
  path_to_run_level_output = NULL,
  h5_rough = TRUE,
  skip_mapping_efficiency = FALSE
)

Arguments

path_to_top_level_output

Character. Path to the top-level directory containing Cell Ranger run-level subdirectories. Each subdirectory should contain Cell Ranger output in the standard structure with outs/ folders.

path_to_run_level_output

Optional character vector. Names of specific run-level directories to process (not full paths). Should match the basename of folders inside path_to_top_level_output. If NULL, all subdirectories are processed. Unmatched entries will trigger a warning.

h5_rough

Logical. If TRUE (default), extracts UMI-level molecule information from first SRR only for speed. If FALSE, combines UMI-level molecule information from all SRRs (slower but more comprehensive).

skip_mapping_efficiency

Logical. If TRUE, skips estimation of mapping efficiency. If FALSE (default), calculates naive mapping efficiency. Set to TRUE when metrics_summary.csv is unavailable or lacks "Number of Reads" column.

Value

A list with three elements:

response_matrix

Sparse gene-by-cell expression matrix (genes as rows, cells as columns) with row and column names. Contains only genes common across all processed SRR runs, combined across all runs.

read_umi_table

Data frame with UMI-level molecule information, including columns:

  • num_reads: Number of reads supporting this UMI-cell combination

  • UMI_id: UMI index

  • cell_id: Cell barcode

  • response_id: Gene identifier (e.g., Ensembl ID)

  • srr_idx: SRR run identifier

mapping_efficiency

Numeric value between 0 and 1 representing proportion of reads mapped to transcriptome if skip_mapping_efficiency = FALSE, or NULL if skip_mapping_efficiency = TRUE.

Details

Input Requirements

Your data should be organized with Cell Ranger output directories under a top-level folder:


path_to_top_level_output/
├── SRR_run_1/
│   ├── outs/
│   │   ├── filtered_feature_bc_matrix/
│   │   │   ├── barcodes.tsv.gz
│   │   │   ├── features.tsv.gz
│   │   │   └── matrix.mtx.gz
│   │   ├── molecule_info.h5
│   │   ├── filtered_feature_bc_matrix.h5
│   │   └── metrics_summary.csv
├── SRR_run_2/
│   └── ...
└── SRR_run_3/
    └── ...

## Processing Steps

The function performs the following operations:

  1. Lists all SRR directories under the given top-level folder

  2. Optionally filters to a subset of run-level names

  3. Reads gene expression matrices using obtain_qc_response_data

  4. Identifies and retains only genes shared across all SRRs

  5. Reads UMI-level molecule information using obtain_qc_read_umi_table

  6. Calculates naive mapping efficiency using obtain_mapping_efficiency

Important Notes

  • SRR directories should be generated by a recent version of Cell Ranger configured for the perturbation (CRISPR or Perturb-seq) workflow

  • In some cases, filtered_feature_bc_matrix/ may need to be produced by unzipping filtered_feature_bc_matrix.tar.gz

  • For mapping efficiency calculation, metrics_summary.csv must include a column named "Number of Reads". This column may need to be added or edited manually when Cell Ranger is run with multiple libraries or samples.

  • When h5_rough = FALSE, multiple mapping efficiencies are computed and the median value is returned

See also

reference_data_processing for Step 2 of the preprocessing workflow.

obtain_qc_response_data, obtain_qc_read_umi_table, and obtain_mapping_efficiency for details on data extraction.

See the vignette "Preprocess Reference Expression data for Web App" for the complete preprocessing workflow: vignette("preprocess-reference", package = "perturbplan")

Examples

# Point to directory containing example Cell Ranger outputs
extdata_path <- system.file("extdata", package = "perturbplan")

# Aggregate data from all SRR runs
raw_data <- reference_data_preprocessing_10x(
  path_to_top_level_output = extdata_path,
  path_to_run_level_output = "cellranger_tiny",  # Only read subfolder cellranger_tiny
  h5_rough = TRUE,  # Use first SRR for QC data (faster)
  skip_mapping_efficiency = FALSE  # Estimate mapping efficiency
)

# Inspect structure
str(raw_data)
#> List of 3
#>  $ response_matrix   :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#>   .. ..@ i       : int [1:11] 0 1 2 3 4 1 2 1 4 1 ...
#>   .. ..@ p       : int [1:9] 0 1 2 3 4 5 7 9 11
#>   .. ..@ Dim     : int [1:2] 5 8
#>   .. ..@ Dimnames:List of 2
#>   .. .. ..$ : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#>   .. .. ..$ : chr [1:8] "AGCAGCCGTCCAAGTT-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAACCTGGTATATGAG-1" ...
#>   .. ..@ x       : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
#>   .. ..@ factors : list()
#>  $ read_umi_table    :'data.frame':	11 obs. of  5 variables:
#>   ..$ num_reads  : int [1:11] 2 1 1 2 1 1 1 1 1 1 ...
#>   ..$ UMI_id     : num [1:11] 139105 723247 998389 622094 584568 ...
#>   ..$ cell_id    : chr [1:11] "AAACCTGGTATATGAG-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAAGTAGTCCAAATGC-1" ...
#>   ..$ response_id: chr [1:11] "ENSG00000241860" "ENSG00000238009" "ENSG00000239945" "ENSG00000286448" ...
#>   ..$ srr_idx    : chr [1:11] "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" ...
#>  $ mapping_efficiency: num 1

# Access components
head(raw_data$read_umi_table)
#>   num_reads UMI_id            cell_id     response_id         srr_idx
#> 1         2 139105 AAACCTGGTATATGAG-1 ENSG00000241860 cellranger_tiny
#> 2         1 723247 AAACGGGTCAGCTCGG-1 ENSG00000238009 cellranger_tiny
#> 3         1 998389 AAAGTAGCATCCCACT-1 ENSG00000239945 cellranger_tiny
#> 4         2 622094 AAAGTAGTCCAAATGC-1 ENSG00000286448 cellranger_tiny
#> 5         1 584568 AGCAGCCGTCCAAGTT-1 ENSG00000243485 cellranger_tiny
#> 6         1 956290 AGCGGTCCATTCCTGC-1 ENSG00000238009 cellranger_tiny
dim(raw_data$response_matrix)
#> [1] 5 8