Skip to content

Instantly share code, notes, and snippets.

@mfansler
Created August 26, 2022 17:02
Show Gist options
  • Save mfansler/5cfcaad391b2e7862af54a260e1972a2 to your computer and use it in GitHub Desktop.
Save mfansler/5cfcaad391b2e7862af54a260e1972a2 to your computer and use it in GitHub Desktop.
Read anndata dataframes with pure R
library(rhdf5)
library(tidyverse)
read_ad_df <- function (file, name) {
x_attrs <- h5readAttributes(file, name)
## check requested entry is a dataframe
## TODO: do we need to check encoding-version?
stopifnot(x_attrs[['encoding-type']] == "dataframe")
## rownames and columns in order
idx_cols <- unlist(x_attrs[c("_index", "column-order")], use.names=FALSE)
## load the factor levels
x_levels <- h5read(file, str_c(name, "/__categories"))
## load dataframe
h5read(file, name)[idx_cols] %>% as_tibble() %>%
## replace categorical columns with proper factors
mutate(across(any_of(names(x_levels)), ~ factor(x_levels[[cur_column()]][.x+1L])))
}
@mfansler
Copy link
Author

This can be used on h5ad files from anndata to load what were originially Pandas DataFrame objects. For example, for single-cell data the equivalent of colData() on a SingleCellExperiment object would be found in /obs in the HDF5 object:

df_colData <- read_ad_df("scrna.h5ad", "/obs")

and rowData is found at /var

df_rowData <- read_ad_df("scrna.h5ad", "/var")

Note on Motivation

There is an anndata R package, so why not use that? Because anndata is a wrapper for calling the Python anndata package, rather than a pure R package. For a well-defined specification, it seems unnecessary to load an entire Python subprocess for what can be done directly in R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment