Skip to content

Instantly share code, notes, and snippets.

@ag1805x
Created September 16, 2019 06:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ag1805x/89cbc34ca9b20b877a3f070455289906 to your computer and use it in GitHub Desktop.
Save ag1805x/89cbc34ca9b20b877a3f070455289906 to your computer and use it in GitHub Desktop.
Using hclust() to cluster RNA-seq samples and visualize dendrogram
library(dendextend)
countsPC <- read.table("CountsPC_Gene_MultimapOverlap.tsv", header=T, row.names=c(1))
factors <- as.data.frame(read.table("factors.txt", sep="\t", header = TRUE, row.names=c(1)))
dist <- dist(t(countsPC), method="euclidean")
cluster <- hclust(dist)
# main, sub, xlab, ylab
dend <- as.dendrogram(cluster)
#Add colors to label
label_colors_to_use <- as.numeric(factors$SampleType) #Input groups of sample type for label color
label_colors_to_use <- label_colors_to_use[order.dendrogram(dend)] #Order group annotation as per in dendrogram
#Use to add custom label color
for(i in 1:length(label_colors_to_use)){
if(label_colors_to_use[i] == 1){
label_colors_to_use[i] = "blue"
}else if(label_colors_to_use[i] == 2){
label_colors_to_use[i] = "red"
}
}
#Add colored bars showing study accession
StudyAccession_colors_to_use <- as.numeric(factors$StudyAccession) #Input groups of sample type for label color
StudyAccession_colors_to_use <- StudyAccession_colors_to_use[order.dendrogram(dend)] #Order group annotation as per in dendrogram
dend <- set(dend, "labels_colors", label_colors_to_use) #Apply colors to labels
dend <- hang.dendrogram(dend, hang=-1) #Leaves at same height
dend <- set(dend, "branches_lwd", 2) #Thick branches
dend <- set(dend, "labels_cex", 1) #label font size
tiff(filename="Sample_hclust.tiff", height=10, width=20, units='in', res=300)
par(mar = c(10,5,3,1))
plot(dend, main = "Sample Clustering\ndist.method=Eucleadian", ylab = "Height") #Plot the dendrogram
colored_bars(colors = StudyAccession_colors_to_use, dend = dend, rowLabels = "Project", sort_by_labels_order = FALSE)
dev.off()
@dartagnan323
Copy link

Hi,
thanks for sharing your code.
I just tried it (just the main lines) and obtained a dendogram at the end as expected. I have 15 RNA-seq from repository 3 duplicates and 3 triplicates and I made a file of the featureCounts output with first colum gene names, and all other columns the 15 RNA-seq counts.
I'd like to ask a few questions if I may:

  1. what is the line for? "factors <- as.data.frame(read.table("factors.txt", sep="\t", header = TRUE, row.names=c(1)))" it seems to be used later to add colors. Should it include a list of all conditions? Could you clarify what's the format?
  2. Also I feel like, similar to the PCA I tried to do, I have some inconsistencies. Some replicates seem to be too far, while in the original papers they seemed close. But my deseq2 analysis seems good so it's not like I did something entirely wrong. However I used all genes for these analyses, maybe I should use only the 1000 or 3000 most expressed genes? Could you give me some info on how to go about this? Is using the whole featureCounts as input, like I did, correct?
    thanks a lot or any help,
    best regards

@ag1805x
Copy link
Author

ag1805x commented Nov 23, 2020

  1. The factors.txt file contains sample information. Each row for each sample. Columns for covariate/condition (eg. test or control)etc. As you mentioned, the factors is used for colours here. The same factors file can be used in other steps of analysis including DE testing.

  2. This is an issue even I have observed often. May be by reading the original paper more carefully about how they created the plots will help. You can also try the process you mentioned to consider top expressed genes or most variable genes. Instead of using all genes it may be better to filter genes with low/zero counts. eg. rowSums(cpm(counts_pc)>1) >1

@dartagnan323
Copy link

Thanks for a quick reply, I will try that. for 2) not sure yet how to use that "rowSums", but will search and come back if I'm really stuck.

@ag1805x
Copy link
Author

ag1805x commented Nov 23, 2020

Just use this on your raw count matrix:

keep <- rowSums(cpm(counts)>1) >1
counts_filt <- counts[keep, ]

After this continue as is with the filtered counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment