
Aggregate Expression and QC Data from Multiple SRR-Level Cell Ranger Outputs
reference_data_preprocessing_10x.Rd
Aggregates Cell Ranger outputs from multiple sequencing runs (SRRs) into a single data structure containing gene expression matrices, UMI-level molecule information, and naive mapping efficiency estimates. This is Step 1 of the pilot data preprocessing workflow for PerturbPlan power analysis.
Usage
reference_data_preprocessing_10x(
path_to_top_level_output,
path_to_run_level_output = NULL,
h5_rough = TRUE,
skip_mapping_efficiency = FALSE
)
Arguments
- path_to_top_level_output
Character. Path to the top-level directory containing Cell Ranger run-level subdirectories. Each subdirectory should contain Cell Ranger output in the standard structure with
outs/
folders.- path_to_run_level_output
Optional character vector. Names of specific run-level directories to process (not full paths). Should match the basename of folders inside
path_to_top_level_output
. If NULL, all subdirectories are processed. Unmatched entries will trigger a warning.- h5_rough
Logical. If TRUE (default), extracts UMI-level molecule information from first SRR only for speed. If FALSE, combines UMI-level molecule information from all SRRs (slower but more comprehensive).
- skip_mapping_efficiency
Logical. If TRUE, skips estimation of mapping efficiency. If FALSE (default), calculates naive mapping efficiency. Set to TRUE when
metrics_summary.csv
is unavailable or lacks "Number of Reads" column.
Value
A list with three elements:
- response_matrix
Sparse gene-by-cell expression matrix (genes as rows, cells as columns) with row and column names. Contains only genes common across all processed SRR runs, combined across all runs.
- read_umi_table
Data frame with UMI-level molecule information, including columns:
num_reads
: Number of reads supporting this UMI-cell combinationUMI_id
: UMI indexcell_id
: Cell barcoderesponse_id
: Gene identifier (e.g., Ensembl ID)srr_idx
: SRR run identifier
- mapping_efficiency
Numeric value between 0 and 1 representing proportion of reads mapped to transcriptome if
skip_mapping_efficiency = FALSE
, or NULL ifskip_mapping_efficiency = TRUE
.
Details
Input Requirements
Your data should be organized with Cell Ranger output directories under a top-level folder:
path_to_top_level_output/
├── SRR_run_1/
│ ├── outs/
│ │ ├── filtered_feature_bc_matrix/
│ │ │ ├── barcodes.tsv.gz
│ │ │ ├── features.tsv.gz
│ │ │ └── matrix.mtx.gz
│ │ ├── molecule_info.h5
│ │ ├── filtered_feature_bc_matrix.h5
│ │ └── metrics_summary.csv
├── SRR_run_2/
│ └── ...
└── SRR_run_3/
└── ...
## Processing Steps
The function performs the following operations:
Lists all SRR directories under the given top-level folder
Optionally filters to a subset of run-level names
Reads gene expression matrices using
obtain_qc_response_data
Identifies and retains only genes shared across all SRRs
Reads UMI-level molecule information using
obtain_qc_read_umi_table
Calculates naive mapping efficiency using
obtain_mapping_efficiency
Important Notes
SRR directories should be generated by a recent version of Cell Ranger configured for the perturbation (CRISPR or Perturb-seq) workflow
In some cases,
filtered_feature_bc_matrix/
may need to be produced by unzippingfiltered_feature_bc_matrix.tar.gz
For mapping efficiency calculation,
metrics_summary.csv
must include a column named "Number of Reads". This column may need to be added or edited manually when Cell Ranger is run with multiple libraries or samples.When
h5_rough = FALSE
, multiple mapping efficiencies are computed and the median value is returned
See also
reference_data_processing
for Step 2 of the preprocessing workflow.
obtain_qc_response_data
, obtain_qc_read_umi_table
,
and obtain_mapping_efficiency
for details on data extraction.
See the vignette "Preprocess Reference Expression data for Web App" for the
complete preprocessing workflow: vignette("preprocess-reference", package = "perturbplan")
Examples
# Point to directory containing example Cell Ranger outputs
extdata_path <- system.file("extdata", package = "perturbplan")
# Aggregate data from all SRR runs
raw_data <- reference_data_preprocessing_10x(
path_to_top_level_output = extdata_path,
path_to_run_level_output = "cellranger_tiny", # Only read subfolder cellranger_tiny
h5_rough = TRUE, # Use first SRR for QC data (faster)
skip_mapping_efficiency = FALSE # Estimate mapping efficiency
)
# Inspect structure
str(raw_data)
#> List of 3
#> $ response_matrix :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
#> .. ..@ i : int [1:11] 0 1 2 3 4 1 2 1 4 1 ...
#> .. ..@ p : int [1:9] 0 1 2 3 4 5 7 9 11
#> .. ..@ Dim : int [1:2] 5 8
#> .. ..@ Dimnames:List of 2
#> .. .. ..$ : chr [1:5] "ENSG00000243485" "ENSG00000238009" "ENSG00000239945" "ENSG00000241860" ...
#> .. .. ..$ : chr [1:8] "AGCAGCCGTCCAAGTT-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAACCTGGTATATGAG-1" ...
#> .. ..@ x : num [1:11] 1 1 1 1 1 1 1 1 1 1 ...
#> .. ..@ factors : list()
#> $ read_umi_table :'data.frame': 11 obs. of 5 variables:
#> ..$ num_reads : int [1:11] 2 1 1 2 1 1 1 1 1 1 ...
#> ..$ UMI_id : num [1:11] 139105 723247 998389 622094 584568 ...
#> ..$ cell_id : chr [1:11] "AAACCTGGTATATGAG-1" "AAACGGGTCAGCTCGG-1" "AAAGTAGCATCCCACT-1" "AAAGTAGTCCAAATGC-1" ...
#> ..$ response_id: chr [1:11] "ENSG00000241860" "ENSG00000238009" "ENSG00000239945" "ENSG00000286448" ...
#> ..$ srr_idx : chr [1:11] "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" "cellranger_tiny" ...
#> $ mapping_efficiency: num 1
# Access components
head(raw_data$read_umi_table)
#> num_reads UMI_id cell_id response_id srr_idx
#> 1 2 139105 AAACCTGGTATATGAG-1 ENSG00000241860 cellranger_tiny
#> 2 1 723247 AAACGGGTCAGCTCGG-1 ENSG00000238009 cellranger_tiny
#> 3 1 998389 AAAGTAGCATCCCACT-1 ENSG00000239945 cellranger_tiny
#> 4 2 622094 AAAGTAGTCCAAATGC-1 ENSG00000286448 cellranger_tiny
#> 5 1 584568 AGCAGCCGTCCAAGTT-1 ENSG00000243485 cellranger_tiny
#> 6 1 956290 AGCGGTCCATTCCTGC-1 ENSG00000238009 cellranger_tiny
dim(raw_data$response_matrix)
#> [1] 5 8