Skip to contents

Extracts QC-filtered UMI-level molecule information from Cell Ranger HDF5 files. Only molecules with feature_type == "Gene Expression" are retained; other feature types (e.g., "Antibody Capture", "CRISPR Guide Capture") are filtered out. This function is used internally by reference_data_preprocessing_10x.

Usage

obtain_qc_read_umi_table(path_to_cellranger_output)

Arguments

path_to_cellranger_output

Character. Path to Cell Ranger run folder containing:

  • outs/molecule_info.h5 – Raw molecule information with required datasets:

    • count: Number of reads per molecule

    • umi: UMI indices

    • barcodes: Cell barcodes (without GEM group suffix)

    • barcode_idx: Barcode indices (0-based)

    • gem_group: GEM group identifiers

    • feature_idx: Feature indices (0-based)

    • features/id: Gene identifiers

    • features/feature_type: Feature types (e.g., "Gene Expression")

  • outs/filtered_feature_bc_matrix.h5 – QC-filtered cell barcodes with:

    • matrix/barcodes: Cell barcodes passing QC filters

Value

Data frame with UMI-level molecule information containing columns:

num_reads

Number of reads supporting this UMI-cell combination

UMI_id

UMI index (1-based)

cell_id

Cell barcode with GEM group suffix (e.g., "ACGTACGT-1")

response_id

Gene identifier (e.g., Ensembl ID)

Details

The function:

  1. Reads raw molecule information from molecule_info.h5

  2. Reads QC-filtered cell barcodes from filtered_feature_bc_matrix.h5

  3. Filters molecule data to retain only QC-passed cells

  4. Filters to retain only molecules with feature_type == "Gene Expression"

  5. Constructs cell IDs with GEM group suffixes

  6. Returns data frame with read counts per UMI per cell for Gene Expression features only

This data is used for fitting the library saturation (S-M) curve in library_estimation.

See also

reference_data_preprocessing_10x for aggregating data from multiple runs.

library_estimation for fitting saturation curves using this data.

Examples

# Extract read/UMI information from Cell Ranger output
cellranger_path <- system.file("extdata/cellranger_tiny", package = "perturbplan")
qc_table <- obtain_qc_read_umi_table(cellranger_path)

# Examine the data
head(qc_table)
#>   num_reads UMI_id            cell_id     response_id
#> 1         2 139105 AAACCTGGTATATGAG-1 ENSG00000241860
#> 2         1 723247 AAACGGGTCAGCTCGG-1 ENSG00000238009
#> 3         1 998389 AAAGTAGCATCCCACT-1 ENSG00000239945
#> 4         2 622094 AAAGTAGTCCAAATGC-1 ENSG00000286448
#> 5         1 584568 AGCAGCCGTCCAAGTT-1 ENSG00000243485
#> 6         1 956290 AGCGGTCCATTCCTGC-1 ENSG00000238009
dim(qc_table)
#> [1] 11  4
summary(qc_table$num_reads)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.000   1.000   1.182   1.000   2.000