Skip to content

Instantly share code, notes, and snippets.

@kokitsuyuzaki
Last active January 15, 2022 17:42
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kokitsuyuzaki/5b6cebcaf37100c8794bdb89c7135fd5 to your computer and use it in GitHub Desktop.
Save kokitsuyuzaki/5b6cebcaf37100c8794bdb89c7135fd5 to your computer and use it in GitHub Desktop.
Saving the HDF5 file of 10X Genomics as CSV format

Converting the HDF5 file of 10X Genomics as CSV format

In this manuscript, we will explain how to extract gene × cell matrix from the HDF5 file provided by 10X Genomics and saving the data as CSV format.

Step.1 : Download the HDF5 file from the website of 10X Genomics

Firstly, we download the HDF5 file from 10X Genomics site. The data is stored at Amazon AWS and easily downloaded by wget commant like below.

wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5

This file contains 1306127 (1.3 M) cells of mouse. Despite of the huge number of cells, the file size is about 4GB and very compact. This is because, the corresponding data is stored as a sparse matrix format. However, this data is not easy to be used for data analysis. Hence, here we convert the data as a dense matrix. 10X Genomics provides two way of preprocess the HDF5 file, cellrangerRkit (R package) and cellranger (python command tools). In the case of 1.3 M data, the R package could not load the HDF5 appropriately. This may be because the H5Fopen function of rhdf5 package does not work against 64bit integer data.

# This code does not work against 1.3M data...
source("http://s3-us-west-2.amazonaws.com/10x.files/code/rkit-install-1.1.0.R")
library(cellrangerRkit)
neuron <- get_matrix_from_h5("1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5")

Hereafter, following example are performed by the cellranger.

Step.2 : Download and Install the cellranger

Next, we download and install the Cell Ranger (2.1, latest version at 2018/4/28) by wget.

wget --no-check-certificate -O cellranger-1.3.0.tar.gz "https://s3-us-west-2.amazonaws.com/10x.downloads/cellranger-1.3.0.tar.gz?AWSAccessKeyId=AKIAJAZONYDS6QUPQVBA&Expires=1487446357&Signature=Yt%2BqSTuJdJ8zqdAXzoV8fisZFXo%3D"

We also add the path of cellranger program to the PYTHONPATH.

export PYTHONPATH=./cellranger-1.3.0/cellranger-cs/1.3.0/lib/python:$PYTHONPATH
export PYTHONPATH=./cellranger-1.3.0/cellranger-cs/1.3.0/tenkit/lib/python:$PYTHONPATH
export PYTHONPATH=./cellranger-1.3.0/anaconda-cr-cs/2.2.0-anaconda-cr-cs-c7/lib/python2.7/site-packages/:$PYTHONPATH

Step.3 : Run the python script

Finaly, we boot the REPL mode of python and execute the script in the window as below. In addition to the cellranger, we also install other python packages like h5py, numpy, scipy, subprocess and scikit-learn by pip command. Because of the data size, we chunk the data as 1/100 size and incrementally save the data by appending mode.

# Python Version : 2.7
# coding:utf-8
import cellranger.matrix as cr_matrix
import h5py
import numpy
import subprocess
import os
from sklearn import preprocessing
from scipy.sparse import *

# Setting
step=100
orgname="mm10"
hdf5file="1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5"

# Data Loading from HDF5
matdata = cr_matrix.GeneBCMatrices.load_h5(hdf5file)
matdata = matdata.get_matrix(orgname)

# Remove ERCC spikein
erccpos = []
for i in range(matdata.m.shape[0]):
	genename = matdata.genes[i][1]
	if 'Ercc' in genename:
		erccpos.append(i)

target = list(set(range(matdata.m.shape[0])) - set(erccpos))
matdata.m = matdata.m[target, ]

# Remove Variance zero genes
zvpos = []
term1 = (matdata.m.multiply(matdata.m)).mean(axis=1)
term2 = matdata.m.mean(axis=1)
term2 = term2.multiply(term2)
rowvar = term1 - term2

for i in range(matdata.m.shape[0]):
	rv = rowvar[i]
	if rv == 0:
		zvpos.append(i)

target = list(set(range(matdata.m.shape[0])) - set(zvpos))
matdata.m = matdata.m[target, ]
# Data Saving as CSV
csvfile="1M_neurons/Data.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")

Corresponding CSV file is surely generated.

ls -lth 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv

We also generated some log-transformed, scaled, and transposed matrices.

libsize = True
cper = 1E4
log = True
center = True
transpose = True

def tenxh52csv(matdata, csvfile, step, libsize, cper, log, center, transpose, verbose):
	if os.path.exists(csvfile):
		subprocess.call("rm -rf " + csvfile)
	if libsize:
		sumvec = numpy.sum(matdata.m, axis=0)
	if transpose:
		N = matdata.m.shape[1]
		matdata = matdata.m.T
	else:
		N = matdata.m.shape[0]
		matdata = matdata.m
	for i in range(0, N/step+1):
		if verbose:
			print(i)
		start = i*step
		end = (i+1)*step-1
		if N - end + step < step:
			idx = range(start, N)
		else:
			idx = range(start, end)
		with open(csvfile, "a") as f:
			tmp = csr_matrix(matdata[idx, ], dtype=numpy.int64).todense()
			if libsize & not transpose:
				# どう割ればいいのか(ブロードキャストしてくれる?)
				tmp = (1.0 * tmp / sumvec) * cper
			if libsize & transpose:
				tmp = (1.0 * tmp / sumvec) * cper
			if log:
				tmp = numpy.log10(tmp + 1)
			if center & not transpose:
				tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
			if center & transpose:
				tmp = preprocessing.scale(tmp, axis=1, with_mean=True, with_std=False)
			numpy.savetxt(f, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(matdata.m, axis=0)

csvfile="1M_neurons/CPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/CPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(matdata.m, axis=0)

csvfile="1M_neurons/CenteredCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/CenteredCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Transposed matrix
t_matdata = matdata.m.T
# Data Saving as CSV
csvfile="1M_neurons/t_Data.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(t_matdata.m[idx], dtype=numpy.int64).todense()
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogData.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(t_matdata.m[idx], dtype=numpy.int64).todense()
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(t_matdata.m, axis=0)

csvfile="1M_neurons/t_CPM.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCPM.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_CP10K.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCP10K.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/t_CPMED.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCPMED.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")

Reference

Author

Koki Tsuyuzaki <koki.tsuyuzaki [at] gmail.com>

Last modified

2019/10/1

@krademaker
Copy link

krademaker commented Apr 1, 2019

Hello, this approach would be useful for my research, so I am attempting to replicate it on my local system.

Have you tried this approach with Cellranger 3.0.2? Step 2 worked for me, only the specific paths had to be changed:
export PYTHONPATH=./cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/:$PYTHONPATH
export PYTHONPATH=./cellranger-3.0.2/cellranger-cs/3.0.2/tenkit/lib/python/:$PYTHONPATH
export PYTHONPATH=./cellranger-3.0.2/miniconda-cr-cs/4.3.21-miniconda-cr-cs-c10/lib/python2.7/site-packages/:$PYTHONPATH

Afterwards, Cellranger could be imported to Python, while importing matrix resulted in this error:

Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cellranger
>>> import cellranger.matrix
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/matrix.py", line 22, in <module>
    import cellranger.utils as cr_utils
  File "/opt/cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/cellranger/utils.py", line 14, in <module>
    import tenkit.bam as tk_bam
  File "/opt/cellranger-3.0.2/cellranger-cs/3.0.2/tenkit/lib/python/tenkit/bam.py", line 8, in <module>
    import pysam
  File "/opt/cellranger-3.0.2/miniconda-cr-cs/4.3.21-miniconda-cr-cs-c10/lib/python2.7/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: libhts.so.2: cannot open shared object file: No such file or directory

Edit: Spelling

@kokitsuyuzaki
Copy link
Author

kokitsuyuzaki commented May 17, 2019

Hi,

I'm sorry for the late reply.
I just noticed your comment now.
The source code used above is based on Python2.7 and very old now.
Why don't you use Python 3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment