kokitsuyuzaki/10xh52csv.md

## 10xh52csv.md

      
    Raw
  

              10xh52csv.md
            
          
    Converting the HDF5 file of 10X Genomics as CSV format

In this manuscript, we will explain how to extract gene × cell matrix from the HDF5 file provided by 10X Genomics and saving the data as CSV format.
Step.1 : Download the HDF5 file from the website of 10X Genomics

Firstly, we download the HDF5 file from 10X Genomics site.
The data is stored at Amazon AWS and easily downloaded by wget commant like below.
wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
This file contains 1306127 (1.3 M) cells of mouse.
Despite of the huge number of cells, the file size is about 4GB and very compact.
This is because, the corresponding data is stored as a sparse matrix format.
However, this data is not easy to be used for data analysis.
Hence, here we convert the data as a dense matrix.
10X Genomics provides two way of preprocess the HDF5 file, cellrangerRkit (R package) and cellranger (python command tools).
In the case of 1.3 M data, the R package could not load the HDF5 appropriately.
This may be because the H5Fopen function of rhdf5 package does not work against 64bit integer data.
# This code does not work against 1.3M data...
source("http://s3-us-west-2.amazonaws.com/10x.files/code/rkit-install-1.1.0.R")
library(cellrangerRkit)
neuron <- get_matrix_from_h5("1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5")
Hereafter, following example are performed by the cellranger.
Step.2 : Download and Install the cellranger

Next, we download and install the Cell Ranger (2.1, latest version at 2018/4/28) by wget.
wget --no-check-certificate -O cellranger-1.3.0.tar.gz "https://s3-us-west-2.amazonaws.com/10x.downloads/cellranger-1.3.0.tar.gz?AWSAccessKeyId=AKIAJAZONYDS6QUPQVBA&Expires=1487446357&Signature=Yt%2BqSTuJdJ8zqdAXzoV8fisZFXo%3D"
We also add the path of cellranger program to the PYTHONPATH.
export PYTHONPATH=./cellranger-1.3.0/cellranger-cs/1.3.0/lib/python:$PYTHONPATH
export PYTHONPATH=./cellranger-1.3.0/cellranger-cs/1.3.0/tenkit/lib/python:$PYTHONPATH
export PYTHONPATH=./cellranger-1.3.0/anaconda-cr-cs/2.2.0-anaconda-cr-cs-c7/lib/python2.7/site-packages/:$PYTHONPATH

Step.3 : Run the python script

Finaly, we boot the REPL mode of python and execute the script in the window as below.
In addition to the cellranger, we also install other python packages like h5py, numpy, scipy, subprocess and scikit-learn by pip command.
Because of the data size, we chunk the data as 1/100 size and incrementally save the data by appending mode.
# Python Version : 2.7
# coding:utf-8
import cellranger.matrix as cr_matrix
import h5py
import numpy
import subprocess
import os
from sklearn import preprocessing
from scipy.sparse import *

# Setting
step=100
orgname="mm10"
hdf5file="1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5"

# Data Loading from HDF5
matdata = cr_matrix.GeneBCMatrices.load_h5(hdf5file)
matdata = matdata.get_matrix(orgname)

# Remove ERCC spikein
erccpos = []
for i in range(matdata.m.shape[0]):
	genename = matdata.genes[i][1]
	if 'Ercc' in genename:
		erccpos.append(i)

target = list(set(range(matdata.m.shape[0])) - set(erccpos))
matdata.m = matdata.m[target, ]

# Remove Variance zero genes
zvpos = []
term1 = (matdata.m.multiply(matdata.m)).mean(axis=1)
term2 = matdata.m.mean(axis=1)
term2 = term2.multiply(term2)
rowvar = term1 - term2

for i in range(matdata.m.shape[0]):
	rv = rowvar[i]
	if rv == 0:
		zvpos.append(i)

target = list(set(range(matdata.m.shape[0])) - set(zvpos))
matdata.m = matdata.m[target, ]
# Data Saving as CSV
csvfile="1M_neurons/Data.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
Corresponding CSV file is surely generated.
ls -lth 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv
We also generated some log-transformed, scaled, and transposed matrices.
libsize = True
cper = 1E4
log = True
center = True
transpose = True

def tenxh52csv(matdata, csvfile, step, libsize, cper, log, center, transpose, verbose):
	if os.path.exists(csvfile):
		subprocess.call("rm -rf " + csvfile)
	if libsize:
		sumvec = numpy.sum(matdata.m, axis=0)
	if transpose:
		N = matdata.m.shape[1]
		matdata = matdata.m.T
	else:
		N = matdata.m.shape[0]
		matdata = matdata.m
	for i in range(0, N/step+1):
		if verbose:
			print(i)
		start = i*step
		end = (i+1)*step-1
		if N - end + step < step:
			idx = range(start, N)
		else:
			idx = range(start, end)
		with open(csvfile, "a") as f:
			tmp = csr_matrix(matdata[idx, ], dtype=numpy.int64).todense()
			if libsize & not transpose:
				# どう割ればいいのか（ブロードキャストしてくれる？）
				tmp = (1.0 * tmp / sumvec) * cper
			if libsize & transpose:
				tmp = (1.0 * tmp / sumvec) * cper
			if log:
				tmp = numpy.log10(tmp + 1)
			if center & not transpose:
				tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
			if center & transpose:
				tmp = preprocessing.scale(tmp, axis=1, with_mean=True, with_std=False)
			numpy.savetxt(f, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(matdata.m, axis=0)

csvfile="1M_neurons/CPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/CPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(matdata.m, axis=0)

csvfile="1M_neurons/CenteredCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/CenteredCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if matdata.m.shape[0] - end + step < step:
		idx = range(start,matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = numpy.log10(tmp + 1)
		tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Transposed matrix
t_matdata = matdata.m.T
# Data Saving as CSV
csvfile="1M_neurons/t_Data.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(t_matdata.m[idx], dtype=numpy.int64).todense()
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogData.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = csr_matrix(t_matdata.m[idx], dtype=numpy.int64).todense()
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(t_matdata.m, axis=0)

csvfile="1M_neurons/t_CPM.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCPM.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_CP10K.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCP10K.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/t_CPMED.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCPMED.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
	print(i)
	start=i*step
	end=(i+1)*step-1
	if t_matdata.m.shape[0] - end + step < step:
		idx = range(start,t_matdata.m.shape[0])
	else:
		idx = range(start,end)
	with open(csvfile, "a") as f_handle:
		tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
		tmp = numpy.log10(tmp + 1)
		numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
Reference


https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
https://support.10xgenomics.com/single-cell/software/pipelines/latest/rkit
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/python
https://stackoverflow.com/questions/12169611/how-do-i-compute-the-variance-of-a-column-of-a-sparse-matrix-in-scipy

Author

Koki Tsuyuzaki <koki.tsuyuzaki [at] gmail.com>
Last modified

2019/10/1