
Extract UMI-Level Molecule Information from Cell Ranger HDF5 Files
obtain_qc_read_umi_table.RdExtracts QC-filtered UMI-level molecule information from Cell Ranger HDF5 files.
Only molecules with feature_type == "Gene Expression" are retained; other
feature types (e.g., "Antibody Capture", "CRISPR Guide Capture") are filtered out.
This function is used internally by reference_data_preprocessing_10x.
Arguments
- path_to_cellranger_output
Character. Path to Cell Ranger run folder containing:
outs/molecule_info.h5– Raw molecule information with required datasets:count: Number of reads per moleculeumi: UMI indicesbarcodes: Cell barcodes (without GEM group suffix)barcode_idx: Barcode indices (0-based)gem_group: GEM group identifiersfeature_idx: Feature indices (0-based)features/id: Gene identifiersfeatures/feature_type: Feature types (e.g., "Gene Expression")
outs/filtered_feature_bc_matrix.h5– QC-filtered cell barcodes with:matrix/barcodes: Cell barcodes passing QC filters
Value
Data frame with UMI-level molecule information containing columns:
- num_reads
Number of reads supporting this UMI-cell combination
- UMI_id
UMI index (1-based)
- cell_id
Cell barcode with GEM group suffix (e.g., "ACGTACGT-1")
- response_id
Gene identifier (e.g., Ensembl ID)
Details
The function:
Reads raw molecule information from
molecule_info.h5Reads QC-filtered cell barcodes from
filtered_feature_bc_matrix.h5Filters molecule data to retain only QC-passed cells
Filters to retain only molecules with
feature_type == "Gene Expression"Constructs cell IDs with GEM group suffixes
Returns data frame with read counts per UMI per cell for Gene Expression features only
This data is used for fitting the library saturation (S-M) curve in
library_estimation.
See also
reference_data_preprocessing_10x for aggregating data from multiple runs.
library_estimation for fitting saturation curves using this data.
Examples
# Extract read/UMI information from Cell Ranger output
cellranger_path <- system.file("extdata/cellranger_tiny", package = "perturbplan")
qc_table <- obtain_qc_read_umi_table(cellranger_path)
# Examine the data
head(qc_table)
#> num_reads UMI_id cell_id response_id
#> 1 2 139105 AAACCTGGTATATGAG-1 ENSG00000241860
#> 2 1 723247 AAACGGGTCAGCTCGG-1 ENSG00000238009
#> 3 1 998389 AAAGTAGCATCCCACT-1 ENSG00000239945
#> 4 2 622094 AAAGTAGTCCAAATGC-1 ENSG00000286448
#> 5 1 584568 AGCAGCCGTCCAAGTT-1 ENSG00000243485
#> 6 1 956290 AGCGGTCCATTCCTGC-1 ENSG00000238009
dim(qc_table)
#> [1] 11 4
summary(qc_table$num_reads)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.000 1.000 1.000 1.182 1.000 2.000