Extract UMI-Level Molecule Information from Cell Ranger HDF5 Files

Extracts QC-filtered UMI-level molecule information from Cell Ranger HDF5 files. Only molecules with feature_type == "Gene Expression" are retained; other feature types (e.g., "Antibody Capture", "CRISPR Guide Capture") are filtered out. This function is used internally by reference_data_preprocessing_10x.

Usage

obtain_qc_read_umi_table(path_to_cellranger_output)

Arguments

path_to_cellranger_output

Character. Path to Cell Ranger run folder containing:

outs/molecule_info.h5 – Raw molecule information with required datasets:
- count: Number of reads per molecule
- umi: UMI indices
- barcodes: Cell barcodes (without GEM group suffix)
- barcode_idx: Barcode indices (0-based)
- gem_group: GEM group identifiers
- feature_idx: Feature indices (0-based)
- features/id: Gene identifiers
- features/feature_type: Feature types (e.g., "Gene Expression")
outs/filtered_feature_bc_matrix.h5 – QC-filtered cell barcodes with:
- matrix/barcodes: Cell barcodes passing QC filters

Value

Data frame with UMI-level molecule information containing columns:

num_reads: Number of reads supporting this UMI-cell combination
UMI_id: UMI index (1-based)
cell_id: Cell barcode with GEM group suffix (e.g., "ACGTACGT-1")
response_id: Gene identifier (e.g., Ensembl ID)

Details

The function:

Reads raw molecule information from molecule_info.h5
Reads QC-filtered cell barcodes from filtered_feature_bc_matrix.h5
Filters molecule data to retain only QC-passed cells
Filters to retain only molecules with feature_type == "Gene Expression"
Constructs cell IDs with GEM group suffixes
Returns data frame with read counts per UMI per cell for Gene Expression features only

This data is used for fitting the library saturation (S-M) curve in library_estimation.

Examples

# Extract read/UMI information from Cell Ranger output
cellranger_path <- system.file("extdata/cellranger_tiny", package = "perturbplan")
qc_table <- obtain_qc_read_umi_table(cellranger_path)

# Examine the data
head(qc_table)
#>   num_reads UMI_id            cell_id     response_id
#> 1         2 139105 AAACCTGGTATATGAG-1 ENSG00000241860
#> 2         1 723247 AAACGGGTCAGCTCGG-1 ENSG00000238009
#> 3         1 998389 AAAGTAGCATCCCACT-1 ENSG00000239945
#> 4         2 622094 AAAGTAGTCCAAATGC-1 ENSG00000286448
#> 5         1 584568 AGCAGCCGTCCAAGTT-1 ENSG00000243485
#> 6         1 956290 AGCGGTCCATTCCTGC-1 ENSG00000238009
dim(qc_table)
#> [1] 11  4
summary(qc_table$num_reads)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.000   1.000   1.182   1.000   2.000

Usage

Arguments

Value

Details

See also

Examples