KestindotC/0.DataVisualization.md

## 0.DataVisualization.md

      
    Raw
  

              0.DataVisualization.md
            
          
    Data Visualization in Bioinformatics

生物資訊上上有許多種Data Visualization的方式，不外乎就是在報告時能完整呈現自己的研究結果。另外在發Paper的時候也會要求高質量的圖檔，在平常訓練如何把資料漂亮、簡單、清楚的呈現出來會讓老闆很高興之外也可以讓自己的東西曝光率跟願意被聽的機率變高。所以我順手整理出了幾種在可能用到的視覺化Package，主要還是在Bioinformatics上的應用，當然如果有需要或者是很想一起研究如何畫出其他領域（新聞/財經等等）的圖也歡迎找我討論。
以下幾種是目前我覺得可能比較常使用到的ㄧ些圖，以R為主去介紹，Python或者是D3.js我會漸漸補上來。
一樣，如果有特別想要看到比較不是這麼主流的圖，也歡迎隨時找我討論。

Heatmap
Barplot
Chord Plot
Survival Curve
[Dendrogram/Phylogenetics]
[Network]
Other


## Bar.md

      
    Raw
  

              Bar.md
            
          
    Data Visualization in Bioinformatics

Bar plot的使用上滿普遍的，本篇沒什麼其特點，用任何工具甚至是excel都可以達成，因人喜好而已。但就自己的經驗其實Bar plot是一個最為直觀的圖，讓人能夠一眼清楚的知道想要表達的數量，有時候會比各種花俏的圖來的更簡單有力的說服讀者。最近正好需要處理到Mutation Gene的profile，通常以binary的形式去紀錄gene的突變狀況，需要粗略的統計到有多少手邊的樣本產生突變，因此以一個表格統計gene以及mutation人數的話會像下面這張表：


gene
SNP_samplenum
type


ABCD1
12
normal


KRAS
11
onco


BRCA1
8
onco


BRCA2
2
onco


TP53
18
onco


CD3
1
normal


type是手動調整的，由於有些基因在某些疾病中比較容易突變，而且被report過多次
你的老闆或者是研讀資料的人可能會想要知道他們長在哪些位置
所以把某些gene標註成oncogene使得ggplot自己去調整顏色

ggplot(snp,aes(x=reorder(gene,SNPsummary),y=SNPsummary,fill=common))+
	geom_bar(stat="identity")+
	theme_classic()+  # Personal require
	theme(axis.text.x = element_text(angle = 90, hjust = 1))+  # Rotate the axis label
	scale_fill_manual(values=c("#9999CC","#CC6666"))

圖的範例為自己研究所需，當然有些參數已經過調整，所以非真實Data


## Chord.md

      
    Raw
  

              Chord.md
            
          
    Data Visualization in Bioinformatics

Chord Plot
Chord plot在生物資訊學的應用上雖然沒有heatmap這麼廣泛，但基本上我看過適用的地方就包含了gene fusion以及copy number variation(CNV)的呈現。而下面的範例則是我之前的研究，關於T cell receptor模型的建立。
T細胞受體在基因體層次上包含了 Variable, Diversity 以及 Joint Region的小塊exon分佈。
這三種gene exon之間目前尚未被發現有無特殊的連接狀況，因此藉由定序資料分析得到他們的組合機率分佈，資料前處理完之後就可以用Chord Plot呈現。
目前R package - Circlize可以做到把很多二維呈現的資料拉成圓形可以看他的範例
library('circlize')
head(data)
        V_seg     J_seg
1  TRAV1-1*00 TRAJ23*00
2  TRAV1-2*00 TRAJ23*00
3  TRAV1-2*00 TRAJ28*00
4  TRAV1-2*00 TRAJ33*00
5   TRAV10*00 TRAJ20*00
6 TRAV12-1*00 TRAJ36*00

# Pair的組合表列在dataframe格式中，我這次的範例有2236個結果
# 只取出現次數大於4次的資料，方便示範而已

df <- as.data.frame(table(data))
df <- df[df$Freq>4,]
chordDiagram(df)

# VDJ的 segments combination 的統計不太需要 track 的刻度
chordDiagram(df, annotationTrack = c('name','grid'))


可以看到在text labeling的調整上可能還是要透過base package(circos)

建議需要產生高解析的圖的話還是用illustrator重篇編輯文字或已legend的方式呈現

缺點：當類別項目太複雜的時候，這樣的呈現需要考量到最重要的pair們怎麼Hightlight

結論：其實只要是有pair特性的資料，然後資料量不要太過複雜，以chord plot來呈現是一件很美好的事情。

  
## Clustergram.md

      
    Raw
  

              Clustergram.md
            
          
    Data Visualization in Bioinformatics

In genetic profiling, we use clustergram to present the gene expression frequently.
Besides gene expression value, the distance between samples and genes were also concerned.
So, here I list few common packages in our lab to analyze gene(gene set) expression values.
Following packages were recommended, and few of them were shown in examples.

pheatmap : R-package
heatmap.2: R-package # Lastest version of heatmap.3
ggplot2  : R-package # Base package
seaborn  : Python module

Example data set were genetic profiling with 31 genes and 600 samples approximately.
pheatmap

library(pheatmap)
ggheat <- pheatmap(data,
                   color = , col=redgreen(75)
                   border_color = "white",cellwidth = 0.6, cellheight = 9, scale = "row",
                   cluster_rows = TRUE,cluster_cols = TRUE,
                   clustering_distance_rows = "euclidean",clustering_distance_cols = "euclidean",
                   clustering_method = "average",fontsize = 10,
                   cutree_rows = NA, cutree_cols = NA,  	# Do it latter.
                   legend = TRUE, legend_breaks = NA,
                   legend_labels = NA, annotation_row = NA, annotation_col = NA,
                   annotation = NA, annotation_colors = NA, annotation_legend = TRUE,
                   annotation_names_row = TRUE, annotation_names_col = TRUE,
                   drop_levels = TRUE, show_rownames = F, show_colnames = F, main = NA,
                   fontsize_row = 10, fontsize_col = 10,
                   display_numbers = F, number_format = "%.2f", number_color = "black",
                   fontsize_number = 0.8 * fontsize, gaps_row = NULL, gaps_col = NULL,
                   labels_row = NULL, labels_col = NULL, filename = NA, width = NA,
                   height = NA, silent = FALSE)

heatmap.2

library(gplots)
heatmap.2(as.matrix(data), col=redgreen, scale="row", key=T, keysize=1.3,breaks=c(seq(-3, 3,0.01)),density.info="none", trace="none",labCol = F,margins = c(3,12),cexRow = 0.8)


Heatmap.2 有幾個比較需要自己調整的問題

像是常見的包含在RStudio上會出現的下面的error可藉由調整 margin 以及 par參數來除錯
Error in plot.new() : figure margins too large

Error in par(op) : invalid value specified for graphical parameter "pin"

ggplot2


ggplot2是最根本的方式，也是其他package內涵的基本方法.

由於筆者趕著畢業，所以不使用這方式，待更新...

seaborn

import seaborn as sb
import pandas as pd
import matplotlib.pyplot as plt

# Please import your data as pandas dataframe first.
# The default clustering parameters in R were (complete and euclidean)
fig = sb.clustermap(data,standard_scale=0,method='complete', metric='euclidean')
plt.setp(fig.ax_heatmap.get_yticklabels(), rotation=0)
plt.setp(fig.ax_heatmap.set_xticklabels([]))     # X labels ommited
fig.savefig("heatmap_result.png")


## Genomic_coor.md

      
    Raw
  

              Genomic_coor.md
            
          
    Data Visualization in Bioinformatics

DNA拷貝變異數的呈現通常會需要整個基因組的位置資訊，這時候在繪圖的時候其中一軸會很常使用到多塊小圖合併的原理。
以R-ggplot的名詞來說的話就是facet_grid，而我們今天拿到的資料範例如下表：


Sample
Chromosome
Start
End
Segment_Mean
length


S1
1
61735
258978
-0.1625
197243


S1
1
239324
259407
-1.8405
83


S1
1
259422
1892483
-0.1967
1633061


S1
1
1892631
1892631
-0.2738
1774063


S1
1
3668655
3893197
-0.1861
224542


這種資料格式是由常見的Copy Number Variation Detection演算法(ex.CBS or HMM)計算之後report。

Sample的名字以及Chromosome染色體位置
在哪些區間裡面他的拷貝變異數轉換數值Segment_Mean，以及我額外加上去的區間長度length。

當樣本數只有少數，可以單純對每一個樣本畫出 CNV genomic profile，或者是每個CNV區間出現的樣本比例。

不是做生物資訊的人也或許會用到，在某個時間序列或空間序列當中，觀察數值呈現正負的擺動，藉由這種圖表來呈現
以ggplot2的指令做如下圖的範例，概念上是將每個Chromosome區分成不同facet然後facet之間不要有空隙就可以了
ggplot(data=data) + 
  facet_grid(~Chromosome, space="free_x", scales="free_x") + 
  theme(axis.text.x=element_blank(), axis.ticks=element_blank(), axis.title.x=element_blank(),
        panel.spacing = unit(0, "lines")) +
  geom_bar(data = data[data$Segment_Mean>0,],aes(x=Start, y=Segment_Mean, width=length, fill = "gain"), stat="identity",position="identity") +
  geom_bar(data = data[data$Segment_Mean<0,],aes(x=Start, y=Segment_Mean, width=length, fill = "loss"), stat="identity",position="identity") +
  geom_hline(yintercept=0.2016, colour="grey", size=0.8, linetype="dashed") +
  geom_hline(yintercept=-0.2344, colour="grey", size=0.8, linetype="dashed") +
  ylim(c(-1.5,1.5))+
  scale_fill_manual(values = c("loss" = "darkblue", "gain" = "red"))


以圖來呈現非常廣泛的Genomic Range會造成當距離不是很長的資訓，會呈現不出Bar圖，以上圖來說10Mb的區間才呈現出一條細細的線
(畢竟3Gb的genome被壓成這樣很難看出有什麼區域是真的有gain或loss)

除此之外，事實上每一條Chromosome的極限尾端並不一定有被CNV caller report的前提下，根本不存在於圖上。   你說有沒有解決方法？目前建議的方式就是把有興趣的region抓出來看或者是每一條Chromosome抽出來個別視覺化

而個別的Sample看完了，想要進一步看所有Sample並且比較他們的Whole Genome Scale的分佈狀況就可以使用前面提過的Heatmap，差別在於其中一軸會變成Genomic Scale，另一軸維持Sample，並且利用facet_grid	跟geom_tile結合讓圖像是Heatmap的樣子
# 這裡的data是如同上面的資料格式，然後Sample有很多個，並且對於Sample2的Segment_Mean做了小幅度的調整

ggplot(data=data) + 
  facet_grid(~Chromosome, space="free", scales="free") +theme_minimal() +
  geom_tile(aes(x=Start,y=Sample,fill=Segment_Mean,width=length),stat="identity") +
  theme(axis.text.x=element_blank(),axis.ticks=element_blank(), 
        axis.title.x=element_blank(),axis.text.y=element_blank(),
        panel.spacing = unit(0, "lines"),legend.position="top") +
  scale_fill_gradient2(low = 'steelblue',mid='white',high = 'maroon',limits=c(-2,2),na.value = "white")


## Survival.md

      
    Raw
  

              Survival.md
            
          
    Data Visualization in Bioinformatics

在流行病學研究上很容易看到 Kaplan Meier Plot，它的作用是為了將我們給定病人條件的分類下視覺化病人群之間存活率是否有差異，
R package有非常多survival專用的package，surv就是一個計算時用到最大宗的套件，被包含在許多視覺套件內，我自己使用的是survminer，那麼病人的存活資料準備好之後就沒什麼好擔心的了。


Sample
Censor
Time
TIL_level


TCGA-G4-6294
1
858
0


TCGA-AA-3814
0
0
1


TCGA-AA-3672
0
28
1


TCGA-D5-6541
0
474
1


TCGA-AA-3861
0
914
1


TCGA-CM-6165
0
488
1


# data 的資料如上表
# Censor: 1 for dead 0 for alive
fit <- survfit(Surv(Time/30,event=Censor) ~ TIL_level, data = data)
ggsurvplot(fit,data = data,
           size=0.7,
           font.main = 10, font.x=8,font.y=8,
           palette=c('#E8B800','#2E9FDF'),
           xlab = "Time (Month)",
           pval = T, # Log-rank P value
           legend.labs = c('Low','High'),ggtheme = theme_minimal())


如果你需要更多功能像是加入Confidence interval或是Risk table的話可以參考官方網站特別整理出的Cheat sheet

  
## z_barplot.png

      
    Raw
  

              z_barplot.png
            
          
## z_chordplot_R.png

      
    Raw
  

              z_chordplot_R.png
            
          
## z_CNV_first_gainloss.png

      
    Raw
  

              z_CNV_first_gainloss.png
            
          
## z_CNV_heatmap.png

      
    Raw
  

              z_CNV_heatmap.png
            
          
## z_gsheatmap_R.png

      
    Raw
  

              z_gsheatmap_R.png
            
          
## z_surv.png

      
    Raw
  

              z_surv.png
gene	SNP_samplenum	type
ABCD1	12	normal
KRAS	11	onco
BRCA1	8	onco
BRCA2	2	onco
TP53	18	onco
CD3	1	normal
Sample	Chromosome	Start	End	Segment_Mean	length
S1	1	61735	258978	-0.1625	197243
S1	1	239324	259407	-1.8405	83
S1	1	259422	1892483	-0.1967	1633061
S1	1	1892631	1892631	-0.2738	1774063
S1	1	3668655	3893197	-0.1861	224542
Sample	Censor	Time	TIL_level
TCGA-G4-6294	1	858	0
TCGA-AA-3814	0	0	1
TCGA-AA-3672	0	28	1
TCGA-D5-6541	0	474	1
TCGA-AA-3861	0	914	1
TCGA-CM-6165	0	488	1