Skip to content

Instantly share code, notes, and snippets.

@ag1805x
Created September 16, 2019 06:41
Show Gist options
  • Save ag1805x/89cbc34ca9b20b877a3f070455289906 to your computer and use it in GitHub Desktop.
Save ag1805x/89cbc34ca9b20b877a3f070455289906 to your computer and use it in GitHub Desktop.
Using hclust() to cluster RNA-seq samples and visualize dendrogram
library(dendextend)
countsPC <- read.table("CountsPC_Gene_MultimapOverlap.tsv", header=T, row.names=c(1))
factors <- as.data.frame(read.table("factors.txt", sep="\t", header = TRUE, row.names=c(1)))
dist <- dist(t(countsPC), method="euclidean")
cluster <- hclust(dist)
# main, sub, xlab, ylab
dend <- as.dendrogram(cluster)
#Add colors to label
label_colors_to_use <- as.numeric(factors$SampleType) #Input groups of sample type for label color
label_colors_to_use <- label_colors_to_use[order.dendrogram(dend)] #Order group annotation as per in dendrogram
#Use to add custom label color
for(i in 1:length(label_colors_to_use)){
if(label_colors_to_use[i] == 1){
label_colors_to_use[i] = "blue"
}else if(label_colors_to_use[i] == 2){
label_colors_to_use[i] = "red"
}
}
#Add colored bars showing study accession
StudyAccession_colors_to_use <- as.numeric(factors$StudyAccession) #Input groups of sample type for label color
StudyAccession_colors_to_use <- StudyAccession_colors_to_use[order.dendrogram(dend)] #Order group annotation as per in dendrogram
dend <- set(dend, "labels_colors", label_colors_to_use) #Apply colors to labels
dend <- hang.dendrogram(dend, hang=-1) #Leaves at same height
dend <- set(dend, "branches_lwd", 2) #Thick branches
dend <- set(dend, "labels_cex", 1) #label font size
tiff(filename="Sample_hclust.tiff", height=10, width=20, units='in', res=300)
par(mar = c(10,5,3,1))
plot(dend, main = "Sample Clustering\ndist.method=Eucleadian", ylab = "Height") #Plot the dendrogram
colored_bars(colors = StudyAccession_colors_to_use, dend = dend, rowLabels = "Project", sort_by_labels_order = FALSE)
dev.off()
@ag1805x
Copy link
Author

ag1805x commented Nov 23, 2020

  1. The factors.txt file contains sample information. Each row for each sample. Columns for covariate/condition (eg. test or control)etc. As you mentioned, the factors is used for colours here. The same factors file can be used in other steps of analysis including DE testing.

  2. This is an issue even I have observed often. May be by reading the original paper more carefully about how they created the plots will help. You can also try the process you mentioned to consider top expressed genes or most variable genes. Instead of using all genes it may be better to filter genes with low/zero counts. eg. rowSums(cpm(counts_pc)>1) >1

@dartagnan323
Copy link

Thanks for a quick reply, I will try that. for 2) not sure yet how to use that "rowSums", but will search and come back if I'm really stuck.

@ag1805x
Copy link
Author

ag1805x commented Nov 23, 2020

Just use this on your raw count matrix:

keep <- rowSums(cpm(counts)>1) >1
counts_filt <- counts[keep, ]

After this continue as is with the filtered counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment