kokitsuyuzaki/bestpractice10x.md

## bestpractice10x.md

      
    Raw
  

              bestpractice10x.md
            
          
    Best Practices for OnlinePCA.jl against 1.3M Mouse Brain Data

In this manuscript, we will explain how to perform OnlinePCA.jl against 1.3 million (1.3M) single cell dataset of ( https://community.10xgenomics.com/t5/10x-Blog/Our-1-3-million-single-cell-dataset-is-ready-to-download/ba-p/276 ), which is the largest single-cell RNA-Seq (scRNA-Seq) dataset at this time.
Step.1 : Prepare the dataset

Current version of OnlinePCA.jl assumes the input data to be CSV format for universal application to wide variety of research region.
Since the 1.3M data is saved as a HDF5 format which is 10X Genomics defined, we will firstly convert the HDF5 to CSV (c.f. Saving the HDF5 file of 10X Genomics as CSV format).
We know there is some attempt to unify such ultra-large scRNA-Seq data such as beachmat, Loom (LoomExperiment, Loompy), TENxGenomics, scanpy, Seurat, and 10X-HDF5, ...etc.
According to user's needs, we will implement the preprocess command to directly convert such specific binary file to julia binary file using below.
Step.2 : Confirm the version of julia language and dependency packages

In this manusctipt, users are assumed to install properly some following tools into the user's machine.

Julia : v-1.0.0 or higher

OnlinePCA.jl (Julia Package)


R : v-3.5.0 or higher

RColorBrewer : v-1.1-2 or higher (R Package)
scales : v-1.0.0 or higher (R Package)


Python : v-3.6.4 or higher

umap-learn : v-0.3.6 or higher (Python Package)
sklearn : v-0.20.0 or higher (Python Package)
numpy : v-1.15.2 or higher (Python Package)
matplotlib : v-3.0.0 or higher (Python Package)


Python : v-2.7.9

bhtsne : v-0.1.9


For the details of the installation of the OnlinePCA.jl package, see README.md of OnlinePCA.jl.
After the installation, the package is stored in the directory like below.
ls ~/.julia/packages
# ArgParse                    Mux
# Arpack                      NamedTuples
# AssetRegistry               Nullables
# BinDeps                     Observables
# BinaryProvider              OnlinePCA
# ...

Hereafter, we will use the latest julia (v-1.0.1, 2018/10/18):
alias julia="~/julia-1.0.1/bin/julia"

, and the CSV file of 1.3M data is assumed to be saved at the directory 1M_neurons.
ls 1M_neurons/*.csv
# ... 58G ... 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv

Step.3 : Binalize the CSV file

After converting the 10X-HDF5 to CSV, we binalize the data for data compression and acceleration of data I/O speed.
csv2bin command converts the CSV to Zstandard (Zstd), which is a compression format implemented by Facebook and known as its high I/O speed (c.f. https://facebook.github.io/zstd/).
julia ~/.julia/packages/OnlinePCA/*/bin/csv2bin \
--csvfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv \
--binfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst

ls -lth 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.*
# ... 58G ... 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv
# ... 3.2G ... 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst

After the data compression, large CSV file (58GB) have been compressed to Zstd format efficiently (3.2GB).
Step.4 : Calculate Gene/Cell-wise Summary Statistics

After the binalization, sumr command extracts the row/column-wise summary statistics of the data matrix.
julia ~/.julia/packages/OnlinePCA/*/bin/sumr \
--binfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst \
--outdir 1M_neurons

ls -lth 1M_neurons/{Feature,Sample}*
# ... 119K ... 1M_neurons/Feature_NoZeros.csv : Row-wise number of zeros vector
# ... 427K ... 1M_neurons/Feature_CV2s.csv : Row-wise CV2 value vector
# ... 478K ... 1M_neurons/Feature_FTTVars.csv : Row-wise variance vector (with FTT)
# ... 493K ... 1M_neurons/Feature_LogVars.csv : Row-wise variance vector  (with log10)
# ... 479K ... 1M_neurons/Feature_Vars.csv : Row-wise variance vector
# ... 434K ... 1M_neurons/Feature_FTTMeans.csv : Row-wise mean vector (with FTT)
# ... 488K ... 1M_neurons/Feature_LogMeans.csv : Row-wise mean vector (with log10)
# ... 481K ... 1M_neurons/Feature_Means.csv : Row-wise mean vector
# ... 6.3M ... 1M_neurons/Sample_NoCounts.csv : Column-wise number of counts vector

Step.5 : Extract Highly Variable Genes

hvg command finds highly variable genes (HVG), which is known as a feature selection method for scRNA-Seq datasets (c.f. Identifying highly variable genes).
julia ~/.julia/packages/OnlinePCA/*/bin/hvg \
--binfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst \
--rowmeanlist 1M_neurons/Feature_Means.csv \
--rowvarlist 1M_neurons/Feature_Vars.csv \
--rowcv2list 1M_neurons/Feature_CV2s.csv \
--outdir 1M_neurons

ls -lth 1M_neurons/HVG_*
# ..  19B .. 1M_neurons/HVG_a0.csv : Parameters of HVG
# ..  19B .. 1M_neurons/HVG_a1.csv : Parameters of HVG
# .. 430K .. 1M_neurons/HVG_afit.csv : Parameters of HVG
# ..   8B .. 1M_neurons/HVG_df.csv : Parameters of HVG
# ..  97K .. 1M_neurons/HVG_pval.csv : P-value of HVG
# ..  65K .. 1M_neurons/HVG_useForFit.csv : Parameters of HVG
# .. 438K .. 1M_neurons/HVG_varFitRatio.csv : Parameters of HVG

Step.6 : Filter Low-Quality Genes (row) and Cells (column)

The results of sumr and hvg are useful for quality control (QC) of the data matrix, that is, low-quality genes and cells can be trimmed by the values.
This step will avoid mistakenly regarding the false positive artefact as biologicaly meaningful signal.
Since the summary statistics and p-values of HVG are small size vectors, any other programming language such as R and Python can load them.
Here, we confirm the values by R.
# Data Loading
m <- unlist(read.csv("1M_neurons/Feature_Means.csv", header=FALSE))
v <- unlist(read.csv("1M_neurons/Feature_Vars.csv", header=FALSE))
c <- unlist(read.csv("1M_neurons/Feature_CV2s.csv", header=FALSE))
p <- unlist(read.csv("1M_neurons/HVG_Pvals.csv", header=FALSE))
s <- unlist(read.csv("1M_neurons/Sample_NoCounts.csv", header=FALSE))

# Gene-wise QC
highcv <- rep("black", length(p))
highcv[which(p <= 1E-5)] <- rgb(1,0,0,0.5)
par(mar=c(5,5,4,2))
par(ps=30)
png(file="1M_neurons/GeneQC.png")
plot(log10(m), log10(c), col=highcv, xlab="log10(Mean)", ylab="log10(CV2)", pch=16)
dev.off()

# Cell-wise QC
quantile(log10(s))
par(mar=c(5,5,4,2))
par(ps=30)
png(file="1M_neurons/CellQC.png")
plot(1:length(s), log10(s), type="l", xlab="Cell", ylab="log10(# Counts)")
dev.off()


Here, we extract only 1841 HVGs, which have lower 1E-5 p-values.
filtering command extract corresponding row-vectors and save as different ZSTD.
mkdir -p 1M_neurons/filtered
/usr/bin/time -v julia ~/.julia/packages/OnlinePCA/*/bin/filtering \
-i 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst \
--featurelist 1M_neurons/HVG_Pvals.csv \
--thr1 0.00001 \
--direct1 - \
-o 1M_neurons/filtered/filtered.zst

# Summary
julia ~/.julia/packages/OnlinePCA/*/bin/sumr \
--binfile 1M_neurons/filtered/filtered.zst \
--outdir 1M_neurons/filtered

ls -lth 1M_neurons/filtered/{Feature,Sample}*
# ..  33K .. Feature_CV2s.csv
# ..  34K .. Feature_FTTMeans.csv
# ..  36K .. Feature_FTTVars.csv
# ..  38K .. Feature_LogMeans.csv
# ..  38K .. Feature_LogVars.csv
# ..  37K .. Feature_Means.csv
# ..  14K .. Feature_NoZeros.csv
# ..  36K .. Feature_Vars.csv
# .. 5.0M .. Sample_NoCounts.csv
# .. 250M .. filtered.zst : Small binary data containing only HVGs

Step.7 : Perform Online PCA

Finally, we perform the online PCA by oja command.
The algorithm is based on stochastic gradient descent (SGD) and its convergence is susceptible to the value of learning rate (step size) parameter. To confirm the convergence, we set multiple stepsize options. Using Sun Grid Engine (SGE), this step can be easily performed in parallel.
# Array job of SGE
qsub ./ArrayJob.sh

We write the source code of ArrayJob.sh as follows.
#!/bin/bash

#$ -l nc=4
#$ -t 1-6
#$ -q node.q

i=$(expr $SGE_TASK_ID - 1)
stepsizes=(0.1 1 10 100 1000 10000)
stepsize=${stepsizes[$i]}

mkdir -p 1M_neurons/filtered/Step${stepsizes[$i]}
mkdir -p 1M_neurons/filtered/log/Step${stepsizes[$i]}

alias julia="~/julia-1.0.1/bin/julia"

julia ~/.julia/packages/OnlinePCA/*/bin/oja \
--scale ftt \
--input 1M_neurons/filtered/filtered.zst \
--outdir 1M_neurons/filtered/Step${stepsizes[$i]} \
--numepoch 5 \
--rowmeanlist 1M_neurons/filtered/Feature_FTTMeans.csv \
--dim 12 \
--evalfreq 10000000000 \
--stepsize $stepsize \
--logdir 1M_neurons/filtered/log/Step${stepsizes[$i]}

The log file in each stepsize and each epoch is stored in the directory 1M_neurons/filtered/log/Step${stepsizes[$i]} and the explainced variance (%) can be confirmed as follows:


Stepsize
Epoch1
Epoch2
Epoch3
Epoch4
Epoch5


0.1
7.24e-4
8.14e-4
8.76e-4
9.25e-4
9.67e-4


1
24.04
32.22
38.7
42.6
44.4


10
39.86
55.46
59.9
61.1
61.5


100
38.82
58.53
60.4
61.5
62.3


1000
30.12
39.93
45.3
48.4
50.9


10000
18.61
15.62
22.6
22.0
25.0


In this data, the calculation of stepsize 100 seems to be converged.
Next, we confirm the cellular distribution by pair plot of principal components (PCs) calculated by online PCA.
In the R-console, we type following source code:
# Package Loading
library("RColorBrewer")
library("scales")

# cellular label
label <- read.csv("1M_neurons/analysis/clustering/graphclust/clusters.csv", header=TRUE)[,2]
names(label) <- label
label[] <- c(brewer.pal(11, "Spectral"),
	brewer.pal(12, "Set3"), brewer.pal(8, "Dark2"),
	brewer.pal(8, "Set2"), brewer.pal(9, "Set1"),
	brewer.pal(8, "Pastel1"), brewer.pal(8, "Accent"))[label]

# Loading
PCs <- read.delim("1M_neurons/filtered/Step100/Eigen_vectors.csv", sep=",", header=FALSE)
colnames(PCs) <- NULL

png(file="1M_neurons/PairsPCA.png", width=2000, height=2000)
pairs(PCs, labels=paste0("PC", 1:ncol(PCs)), col=alpha(label, 0.5), pch=16, cex=0.5)
dev.off()


The higher PCs such as PC9-12 seem to be influenced by the outlier cells.
We also perform the permutation-based method, which randomly shuffles the data matrix,  calculates only the first eigenvalue and constructs null eigenvalues distribution.
In the same way as ArrayJob.sh, we also write ArrayJob2.sh to perform the permutation-based method 5000 times.
# Array job of SGE
qsub ./ArrayJob2.sh

We write the source code of ArrayJob2.sh as follows.
#!/bin/bash

#$ -l nc=4
#$ -t 1-1000
#$ -q node.q

mkdir -p 1M_neurons/filtered/Permutation
mkdir -p 1M_neurons/filtered/Permutation/$SGE_TASK_ID

alias julia="~/julia-1.0.1/bin/julia"

julia ~/.julia/packages/OnlinePCA/*/bin/oja \
    --perm true \
	--scale ftt \
	--input 1M_neurons/filtered/filtered.zst \
	--outdir 1M_neurons/filtered/Permutation/$SGE_TASK_ID \
	--numepoch 1 \
	--rowmeanlist 1M_neurons/filtered/Feature_FTTMeans.csv \
	--dim 1 \
	--evalfreq 10000000000 \
	--stepsize 100 \
	--logdir 1M_neurons/filtered/Permutation/$SGE_TASK_ID

Next, we compare the actual eigenvalues calculated by the real data and null eigenvalues distribution from the permutation-based method.
cat 1M_neurons/filtered/Permutation/*/Eigen_values.csv > 1M_neurons/filtered/Permutation/Permutaion.csv

permval <- unlist(read.delim("1M_neurons/filtered/Permutation/Permutaion.csv", header=FALSE))
realval <- unlist(read.delim("1M_neurons/filtered/Step100/Eigen_values.csv", header=FALSE))

mx <- max(permval, realval)
mi <- min(permval, realval)

# Plot
png(file="1M_neurons/Permutation.png", width=2500, height=1000)
# Setting
par(mar=c(4.5,4.5,3,10))
par(ps = 23)

# Histgram (Permutation)
hist(log10(permval), breaks=100, xlim=log10(c(mi, mx)),
	xlab="log10(Eigenvalues)", ylab="", main="")
abline(v=log10(min(permval)), col="black", lty=2)
abline(v=log10(max(permval)), col="black", lty=2)

# Histgram (Real Data)
hist(rep(log10(realval), 10), breaks=100, xlim=log10(c(mi, mx)),
	col="blue", add=TRUE)
dev.off()
}


This figure suggests that at least the eigenvalue of PC12 is very small and hard to be distinguished from random noise.
Finally, we save only PC1 to PC11 for next analysis.
# Loading
write.csv(PCs[, 1:8], file="1M_neurons/PC1_11.csv", row.names=FALSE)

Step.8 : Perform secondary analysis of the result of online PCA

Using the result of online PCA, we perform UMAP (Uniform Manifold Approximation and Projection), which is a sophisticated non-linear dimensional reduction technique.
# Package Loading
import umap
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap

# original function
def generate_cmap(colors):
	values = range(len(colors))
	vmax = np.ceil(np.max(values))
	color_list = []
	for v, c in zip(values, colors):
		color_list.append( ( v/ vmax, c) )
	return LinearSegmentedColormap.from_list('custom_cmap', color_list)

# Original Color map
cm = generate_cmap(["#9E0142", "#D53E4F", "#F46D43", "#FDAE61", "#FEE08B",
"#FFFFBF", "#E6F598", "#ABDDA4", "#66C2A5", "#3288BD",
"#5E4FA2", "#8DD3C7", "#FFFFB3", "#BEBADA", "#FB8072",
"#80B1D3", "#FDB462", "#B3DE69", "#FCCDE5", "#D9D9D9",
"#BC80BD", "#CCEBC5", "#FFED6F", "#1B9E77", "#D95F02",
"#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D",
"#666666", "#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3",
"#A6D854", "#FFD92F", "#E5C494", "#B3B3B3", "#E41A1C",
"#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999", "#FBB4AE", "#B3CDE3",
"#CCEBC5", "#DECBE4", "#FED9A6", "#FFFFCC", "#E5D8BD",
"#FDDAEC", "#7FC97F", "#BEAED4", "#FDC086", "#FFFF99"])

# Data Loading
pcs = np.loadtxt("1M_neurons/PC1_11.csv", delimiter=",")
label = np.loadtxt("1M_neurons/analysis/clustering/graphclust/clusters.csv", delimiter=",", skiprows=1, usecols=1)

# UMAP
embedding = umap.UMAP(metric='euclidean', local_connectivity=10.0, random_state=1234, verbose=True).fit_transform(pcs)

# Plot
plt.scatter(embedding[:,0], embedding[:,1], c=label, cmap=cm, s=1)
plt.colorbar()
plt.savefig('1M_neurons/umap.png')

ls -lth 1M_neurons/umap.png

UMAP plot is surely generated.

To compare the result of UMAP, we also performed t-SNE, which is a very popular non-linear dimensional reduction method in scRNA-Seq analysis.
Note that the following source code is performed by python v-2.7.9.
# Package Loading
import bhtsne
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap

# original function
def generate_cmap(colors):
	values = range(len(colors))
	vmax = np.ceil(np.max(values))
	color_list = []
	for v, c in zip(values, colors):
		color_list.append( ( v/ vmax, c) )
	return LinearSegmentedColormap.from_list('custom_cmap', color_list)

# original color map
cm = generate_cmap(["#9E0142", "#D53E4F", "#F46D43", "#FDAE61", "#FEE08B",
"#FFFFBF", "#E6F598", "#ABDDA4", "#66C2A5", "#3288BD",
"#5E4FA2", "#8DD3C7", "#FFFFB3", "#BEBADA", "#FB8072",
"#80B1D3", "#FDB462", "#B3DE69", "#FCCDE5", "#D9D9D9",
"#BC80BD", "#CCEBC5", "#FFED6F", "#1B9E77", "#D95F02",
"#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D",
"#666666", "#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3",
"#A6D854", "#FFD92F", "#E5C494", "#B3B3B3", "#E41A1C",
"#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999", "#FBB4AE", "#B3CDE3",
"#CCEBC5", "#DECBE4", "#FED9A6", "#FFFFCC", "#E5D8BD",
"#FDDAEC", "#7FC97F", "#BEAED4", "#FDC086", "#FFFF99"])

# Data Loading
pcs = np.loadtxt("1M_neurons/PC1_11.csv", delimiter=",")
label = np.loadtxt("1M_neurons/analysis/clustering/graphclust/clusters.csv", delimiter=",", skiprows=1, usecols=1)

# t-SNE
tsne = bhtsne.tsne(pcs, perplexity=50, rand_seed=123)

# Plot
plt.scatter(tsne[:,0], tsne[:,1], c=label, cmap=cm, s=1)
plt.colorbar()
plt.savefig('1M_neurons/tsne.png')

ls -lth 1M_neurons/tsne.png

t-SNE plot is also surely generated.

Although both methods captured some cluster structures, the calculation time of UMAP and t-SNE are about 46 mins and 42 hours, respectively, which means UMAP is much faster.
Elapsed time and memory usage in this manusctipt

Owing to the OnlinePCA.jl, large data matrix is compressed as small size PC vector without high-spec machines and the compressed form of data accerelates the down-stream analyses. In our computational enviromenet, total calculation time and maximum memory usage were 9.5 hours and 15.8 GB, respectively.


Command
Elapsed (wall clock) time (h:mm:ss or m:ss)
Maximum resident set size (kbytes)


csv2bin
2:05:56
376844


sumr (before filtering)
24:28.68
342320


hvg
0:39.99
352156


filtering
2:23.98
399196


sumr (after filtering)
2:12.70
340800


oja (5epoch, average)
49:33.33
661292


oja (Permutation, 1epoch, average)
3:06.20
427104


Pair plot (R)
17:04.09
1804880


UMAP and plot (Python v-3.6.4)
45:55.12
15896664


t-SNE and plot (Python v-2.7.9)
41:46:05
5950504


Machine Spec


OS : CentOS Linux release 7.2.1511 (Core)
CPU : Intel Xeon E5-2670 v3 (2.30GHz)
Memory : 96 GB
HDD : 1.9TB

Author

Koki Tsuyuzaki <koki.tsuyuzaki [at] gmail.com>
Last modified

2018/10/30
Stepsize	Epoch1	Epoch2	Epoch3	Epoch4	Epoch5
0.1	7.24e-4	8.14e-4	8.76e-4	9.25e-4	9.67e-4
1	24.04	32.22	38.7	42.6	44.4
10	39.86	55.46	59.9	61.1	61.5
100	38.82	58.53	60.4	61.5	62.3
1000	30.12	39.93	45.3	48.4	50.9
10000	18.61	15.62	22.6	22.0	25.0
Command	Elapsed (wall clock) time (h:mm:ss or m:ss)	Maximum resident set size (kbytes)
csv2bin	2:05:56	376844
sumr (before filtering)	24:28.68	342320
hvg	0:39.99	352156
filtering	2:23.98	399196
sumr (after filtering)	2:12.70	340800
oja (5epoch, average)	49:33.33	661292
oja (Permutation, 1epoch, average)	3:06.20	427104
Pair plot (R)	17:04.09	1804880
UMAP and plot (Python v-3.6.4)	45:55.12	15896664
t-SNE and plot (Python v-2.7.9)	41:46:05	5950504