debangana234/ChIP-seq2.md

## ChIP-seq2.md

      
    Raw
  

              ChIP-seq2.md
            
          
  title
  author
  date
  theme
  highlight
  output
  tags
  
  
  ChIP-seq2
  Fadel Berakdar, Debangana Chakraborty
  
  united
  tango
  
  
  html_document
  
  
  toc
  toc_depth
  number_sections
  toc_float
  
  
  true
  3
  true
  
  
  collapsed
  smooth_scroll
  
  
  true
  false
  
  
  chipseq
  
  
General Tasks

Get to know the system


Run sinfo -lN on the master node. How many nodes do you see? How many CPUs and how much RAM does each node have?
There are 12 cores and 120000 MB of memory in each node.
$ sinfo -lN
Mon Aug  9 14:58:58 2021
NODELIST                             NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
bibigrid-worker-1-1-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none                
bibigrid-worker-1-2-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none                
bibigrid-worker-1-3-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none                
bibigrid-worker-1-4-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none


Run df -h /vol/COMPEPIWS/. What is the volume’s size?
The volume size is 3.9T
$ df -h /vol/COMPEPIWS/
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        3.9T  1.4T  2.4T  37% /vol/COMPEPIWS


Check the temp folder location by: echo $TMPDIR
$ echo $TMPDIR
/vol/COMPEPIWS/tmp


Your group working directory is /vol/COMPEPIWS/groups/
$ ls /vol/COMPEPIWS/groups/chipseq2/
conda  data  tasks


Each of this directories has the following structure:
/vol/COMPEPIWS/groups/
|__ data (where you will make symbolic links for the raw data)
|__ conda (for some of conda tasks)
|__ tasks (where you execute the pipelines and the downstream tasks,
try to keep yourself organized inside this folder)


The raw data (fastq files) that you are going to use to run the pipelines is available under this folder: /vol/COMPEPIWS/data/reduced/ and each file ID has the following scheme: <cell_type><assay/mark><biological_replicate>.fastq.gz
e.g. liver_14.5_RNAseq_1.fastq.gz
There is an exception for ATAC data that has an additional info for Read1 (R1) and Read2 (R2) since this data is paired end.
e.g. liver_14.5_ATAC_1{R1,R2}.fastq.gz
$ ls /vol/COMPEPIWS/data/reduced/ChIP-seq/
kidney_14.5_H3K27ac_1.fastq.gz   kidney_14.5_H3K9me3_2.fastq.gz   kidney_15.5_H3K9ac_1.fastq.gz   liver_14.5_H3K4me1_2.fastq.gz   liver_15.5_H3K36me3_1.fastq.gz
kidney_14.5_H3K27ac_2.fastq.gz   kidney_14.5_control_1.fastq.gz   kidney_15.5_H3K9ac_2.fastq.gz   liver_14.5_H3K4me3_1.fastq.gz   liver_15.5_H3K36me3_2.fastq.gz
kidney_14.5_H3K27me3_1.fastq.gz  kidney_14.5_control_2.fastq.gz   kidney_15.5_H3K9me3_1.fastq.gz  liver_14.5_H3K4me3_2.fastq.gz   liver_15.5_H3K4me1_1.fastq.gz
kidney_14.5_H3K27me3_2.fastq.gz  kidney_15.5_H3K27ac_1.fastq.gz   kidney_15.5_H3K9me3_2.fastq.gz  liver_14.5_H3K9ac_1.fastq.gz    liver_15.5_H3K4me1_2.fastq.gz
kidney_14.5_H3K36me3_1.fastq.gz  kidney_15.5_H3K27ac_2.fastq.gz   kidney_15.5_control_1.fastq.gz  liver_14.5_H3K9ac_2.fastq.gz    liver_15.5_H3K4me3_1.fastq.gz
kidney_14.5_H3K36me3_2.fastq.gz  kidney_15.5_H3K27me3_1.fastq.gz  kidney_15.5_control_2.fastq.gz  liver_14.5_H3K9me3_1.fastq.gz   liver_15.5_H3K4me3_2.fastq.gz
kidney_14.5_H3K4me1_1.fastq.gz   kidney_15.5_H3K27me3_2.fastq.gz  liver_14.5_H3K27ac_1.fastq.gz   liver_14.5_H3K9me3_2.fastq.gz   liver_15.5_H3K9ac_1.fastq.gz
kidney_14.5_H3K4me1_2.fastq.gz   kidney_15.5_H3K36me3_1.fastq.gz  liver_14.5_H3K27ac_2.fastq.gz   liver_14.5_control_1.fastq.gz   liver_15.5_H3K9ac_2.fastq.gz
kidney_14.5_H3K4me3_1.fastq.gz   kidney_15.5_H3K36me3_2.fastq.gz  liver_14.5_H3K27me3_1.fastq.gz  liver_14.5_control_2.fastq.gz   liver_15.5_H3K9me3_1.fastq.gz
kidney_14.5_H3K4me3_2.fastq.gz   kidney_15.5_H3K4me1_1.fastq.gz   liver_14.5_H3K27me3_2.fastq.gz  liver_15.5_H3K27ac_1.fastq.gz   liver_15.5_H3K9me3_2.fastq.gz
kidney_14.5_H3K9ac_1.fastq.gz    kidney_15.5_H3K4me1_2.fastq.gz   liver_14.5_H3K36me3_1.fastq.gz  liver_15.5_H3K27ac_2.fastq.gz   liver_15.5_control_1.fastq.gz
kidney_14.5_H3K9ac_2.fastq.gz    kidney_15.5_H3K4me3_1.fastq.gz   liver_14.5_H3K36me3_2.fastq.gz  liver_15.5_H3K27me3_1.fastq.gz  liver_15.5_control_2.fastq.gz
kidney_14.5_H3K9me3_1.fastq.gz   kidney_15.5_H3K4me3_2.fastq.gz   liver_14.5_H3K4me1_1.fastq.gz   liver_15.5_H3K27me3_2.fastq.gz


Tips and Tricks

Working with tables on the command line, or awk

We will explore and manipulate a GTF file that contains gene annotation


Do this task on one of the worker nodes


Run: head /vol/COMPEPIWS/pipelines/references/genome_genes.gtf to explore the first few lines of this file.
$ head /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
chr1	unknown	exon	3214482	3216968	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	stop_codon	3216022	3216024	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	CDS	3216025	3216968	.	-	2	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	CDS	3421702	3421901	.	-	1	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	exon	3421702	3421901	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	CDS	3670552	3671348	.	-	0	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	exon	3670552	3671498	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	start_codon	3671346	3671348	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
chr1	unknown	exon	4290846	4293012	.	-	.	gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";
chr1	unknown	stop_codon	4292981	4292983	.	-	.	gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";


Count the number of lines in this file
$ wc -l /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
671462 /vol/COMPEPIWS/pipelines/references/genome_genes.gtf


Count the number of "exon" in this file (hint: grep, awk)
$ awk '{ if ($3=="exon") {print $0} }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf | wc -l
324748

$ awk '{ if ($3=="exon") {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
324748


Count the number of "exon" that are longer than 1000bp considering cloumns 4 and 5 as coordinates of the exon
$ awk '{ if ($3=="exon" && $5-$4 + 1 > 1000) {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
19601


Note:  GTF format is 1-based


Count the number of "exon" that has Sox17 as gene_id
awk '{ if ($10=="\"Sox17\";" && $3=="exon") {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
20

$ awk -vFS="\t" '{ if ($3=="exon") {split($9, a, ";"); if (a[1]=="gene_id \"Sox17\"") {count++} }} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf 
20


Note:  Unexpected field separation using awk

$ cat -A /vol/COMPEPIWS/pipelines/references/genome_genes.gtf | head -1
chr1^Iunknown^Iexon^I3214482^I3216968^I.^I-^I.^Igene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";$


Count the number of "exon" in chr2
$ awk '{ if ($1== "chr2" && $3=="exon") {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
29373


Count the frequency of each feature in column 3 (hint: uniq)
$ cut -f3 /vol/COMPEPIWS/pipelines/references/genome_genes.gtf | sort | uniq -c 
 286783 CDS
 324748 exon
  29978 start_codon
  29953 stop_codon


Sort the file according to column1 and then column3. redirect the output into a new file.
$ sort -k1,1 -k3,3 /vol/COMPEPIWS/pipelines/references/genome_genes.gtf > /vol/COMPEPIWS/groups/chipseq2/tasks/genome_genes_sorted_1_then_3.gtf


Conda and Bioconda


Load the “base” environment if it is not loaded
$ source /vol/COMPEPIWS/conda/miniconda3/bin/activate


create a new conda environment under your group conda directory called as your group’s name (hint: create with -p option)
(base) ~$ conda create -p /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2 
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.9.2
  latest version: 4.10.3

Please update conda by running

    $ conda update -n base -c defaults conda


## Package Plan ##

  environment location: /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2


Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
#
# To deactivate an active environment, use
#
#     $ conda deactivate


activate the created environment
(base) $ source activate /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
(env_chipSeq2) $ conda env list
# conda environments:
#
base                     /vol/COMPEPIWS/conda/miniconda3
core                     /vol/COMPEPIWS/conda/miniconda3/envs/core
gembs                    /vol/COMPEPIWS/conda/miniconda3/envs/gembs
                      *  /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
                      *  


install the following tools using conda: ggplot2 (r-package), fastqc and the specific version of bedtools (version 2.22). (hint: search for the package here. The first hit is most probably what you need. Click on it and see the exact command for installing the package)
(env_chipSeq2) $ conda install -c conda-forge r-ggplot2
(env_chipSeq2) $ conda install -c bioconda fastqc bedtools=2.22


make sure the tools have been installed properly by: tool_name --help. For “ggplot2” start R in the terminal and load the library with library(ggplot2)
(env_chipSeq2) $ which R
/vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2/bin/R

(env_chipSeq2) $ R

R version 4.0.5 (2021-03-31) -- "Shake and Throw"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-conda-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library("ggplot2")
Keep up to date with changes at https://www.tidyverse.org/blog/


update bedtools to the latest version, check the new version number.
(env_chipSeq2) $ conda update -c bioconda bedtools


deactivate the environment
(env_chipSeq2) $ conda deactivate
(base) $ conda deactivate


Basics in R

In this section we will learn some basic operations in R and GenomicRanges and learn how to visualize data using “ggplot2”. We will use two data sets. Both represent counts matrix in genomic bins of mouse genome.
Data frames


activate the core conda environment and start R session
source /vol/COMPEPIWS/conda/miniconda3/bin/activate /vol/COMPEPIWS/conda/miniconda3/envs/core


Load the following libraries: “ggplot2”, “reshape2” and “GenomicRanges”
library("reshape2")
library("GenomicRanges")
library("ggplot2")


Read into R the following two tables: /vol/COMPEPIWS/groups/shared/*txt. Consider the first row as a header.
path_kidney <- "/Users/zingo/Dropbox/CompEpi/data/groups/shared/kidney_14.5_mouse_1_2kbW_bed_counts.txt"
path_liver <- "/Users/zingo/Dropbox/CompEpi/data/groups/shared/liver_14.5_mouse_1_2kbW_bed_counts.txt"

df_kidney <- read.table(path_kidney, header=T)
df_liver  <- read.table(path_liver, header=T)


What is the dimension of each table?
dim(df_kidney)
[1] 1362728       6
dim(df_liver)
[1] 1362728       6


What are the column names of each table?
colnames(df_kidney)
[1] "Chr"      "Start"    "End"      "H3K27me3" "H3K36me3" "H3K9me3" 
colnames(df_liver)
[1] "Chr"      "Start"    "End"      "H3K27me3" "H3K36me3" "H3K9me3"


Calculate the genome length from each data set
kidney_genome_size <- sum(df_kidney$End) - sum(df_kidney$Start)
kidney_genome_size
[1] 2725456000
liver_genome_size <- sum(df_liver$End) - sum(df_liver$Start)
liver_genome_size
[1] 2725456000


Create a data frame by concatenating vertically the two data sets. Add a new column called (cell_type) to annotate the rows of liver table with "liver" and the kidney’s rows with "kidney"
df_kidney$cell_type <- "kidney"
df_liver$cell_type <- "liver"
df_kidney_liver <- rbind(df_kidney, df_liver)


What is the dimension of the new data frame?
dim(df_kidney_liver)
[1] 2725456       7


Reshape the data frame (excluding chr, start, end) from wide into long format using "cell_type" as variable id. What are the dimensions of this dataframe? (hint: melt function)
df_marks <- df_kidney_liver %>% 
            select(c("H3K27me3", "H3K36me3", "H3K9me3", "cell_type"))  %>% 
            melt(id.vars = "cell_type", variable.name = "mark", value.name = "enrichment")
            
dim(df_marks)
[1] 8176368       3


Plot the density of each variable considering the cell_type in two different panels. Give different trans- parent color for each variable (hint: ggplot, geom_density, facet_wrap). Save the plot as a pdf (hint: ggsave).
p_density <- ggplot(df_marks, aes(x=enrichment, fill=mark)) + 
         geom_density(alpha=0.3) +
         facet_grid(cell_type~mark) +
         labs(x="Enrichment", y="Density")

ggsave(filename="/Users/zingo/Dropbox/CompEpi/density.png", plot=p_density, device="png")


The plots look very skewed, try to set the limit of X axis to 100 and re-plot. Save the plot as a pdf.
p_scaled_density <- ggplot(df_marks, aes(x=enrichment, fill=mark)) + 
                geom_density(alpha=0.3) +
                facet_grid(cell_type~mark) +
                labs(x="Enrichment", y="Density") +
                xlim(0, 100)

ggsave(filename="/Users/zingo/Dropbox/CompEpi/scaled_density.png", plot=p_scaled_density, device="pdf")


Create a new data frame by concatenating horizontally the two data sets. Consider only the following columns: H3K27me3, H3K36me3 and H3K9me3.
df_marks_cell <- cbind(select(df_kidney, c("H3K27me3", "H3K36me3", "H3K9me3")),
                   select(df_liver,  c("H3K27me3", "H3K36me3", "H3K9me3")) )


Rename the header of the new data frame into: H3K27me3_liver, H3K36me3_liver, ..........
colnames(df_marks_cell) <- c("H3K27me3_kidney", "H3K36me3_kidney", "H3K9me3_kidney", "H3K27me3_liver", "H3K36me3_liver", "H3K9me3_liver")


Perform Principal Components Analysis on this data frame (hint: prcomp)
PCA <- prcomp(df_marks_cell, scale=TRUE)


How much variance is explained by PC1 and PC2?
PC1 explains 50.63% and PC2 23.28%
summary(PCA)
Importance of components:
  PC1    PC2    PC3     PC4     PC5     PC6
Standard deviation     1.7429 1.1818 0.9590 0.51405 0.47017 0.40092
Proportion of Variance 0.5063 0.2328 0.1533 0.04404 0.03684 0.02679
Cumulative Proportion  0.5063 0.7391 0.8923 0.93637 0.97321 1.00000


Plot a screeplot and cumulative screeplot
    var_explained = PCA$sdev^2 / sum(PCA$sdev^2)
    screeplot <- qplot(c(1:length(var_explained)), var_explained) + 
                 geom_line() + 
                 xlab("Principal Component") + 
                 ylab("Variance Explained") +
                 ggtitle("Scree Plot") +
                 ylim(0, 1)
         
    cum_var_explained = cumsum(var_explained)
    cum_screeplot <- qplot(c(1:length(cum_var_explained)), cum_var_explained) + 
                     geom_line() + 
                     xlab("Principal Component") + 
                     ylab("Variance Explained") +
                     ggtitle("Cumulative Scree Plot") +
                     ylim(0, 1)


Explore the principal components loadings. How many are there?
> summary(PCA$x)
  PC1                PC2                PC3                 PC4                PC5           
 Min.   : -3.1453   Min.   :-26.6655   Min.   :-20.22737   Min.   :-92.9514   Min.   :-25.56574  
 1st Qu.: -0.7478   1st Qu.: -0.2213   1st Qu.: -0.45789   1st Qu.: -0.2635   1st Qu.: -0.24920  
 Median :  0.2027   Median :  0.1941   Median : -0.05347   Median : -0.0446   Median :  0.04311  
 Mean   :  0.0000   Mean   :  0.0000   Mean   :  0.00000   Mean   :  0.0000   Mean   :  0.00000  
 3rd Qu.:  0.8873   3rd Qu.:  0.5127   3rd Qu.:  0.45702   3rd Qu.:  0.2857   3rd Qu.:  0.22380  
 Max.   :413.5897   Max.   :149.2764   Max.   :136.07828   Max.   : 16.9057   Max.   : 41.20703  
      PC6            
 Min.   :-12.500283  
 1st Qu.: -0.160926  
 Median : -0.007328  
 Mean   :  0.000000  
 3rd Qu.:  0.149057  
 Max.   :  5.646389


Plot the PC1 and PC2 loadings. How are the variables distributed across PC1 and PC2?
library(PCAtools)

biplot(PCA, showLoadings = TRUE,
       labSize = 5, pointSize = 5, sizeLoadingsNames = 3)


GenomicRanges


Create two GenomicRange objects from the liver and kidney data frames using columns 4,5 and 6 as metadata
BiocManager::install("GenomicRanges")

library("GenomicRanges")

gr_kidney <- GRanges(seqnames = df_kidney$Chr,
                    ranges = IRanges(start=df_kidney$Start, end=df_kidney$End),
                    mcols= select(df_kidney,c("H3K27me3", "H3K36me3", "H3K9me3")))

gr_liver <- GRanges(seqnames = df_liver$Chr,
                    ranges = IRanges(start=df_liver$Start, end=df_liver$End),
                    mcols= select(df_liver,c("H3K27me3", "H3K36me3", "H3K9me3")))


Calculate the total number of bases covered in these two objects
coverage(gr_kidney)


Subset one of the objects to have only chr2 lines
gr_kidney[seqnames(gr_kidney) == "chr2"]
gr_liver[seqnames(gr_liver) == "chr2"]


Shift one of the object’s ranges by 100 upstream
gr_kidney_shfited <- shift(gr_kidney, +100)


Find overlapping regions between the two objects
findOverlaps(gr_kidney_shfited, gr_liver)
    Hits object with 2725435 hits and 0 metadata columns:
            queryHits subjectHits
            <integer>   <integer>
        [1]         1           1
        [2]         1           2
        [3]         2           2
        [4]         2           3
        [5]         3           3
        ...       ...         ...
  [2725431]   1362726     1362726
  [2725432]   1362726     1362727
  [2725433]   1362727     1362727
  [2725434]   1362727     1362728
  [2725435]   1362728     1362728
  -------
  queryLength: 1362728 / subjectLength: 1362728


Make a GRangeList object from the two GRange objects
GRangesList("Kidney" = gr_kidney, "Liver" = gr_liver)


SLUR


ChIP Seq Analysis

Nextflow ChIP-seq pipeline

Organize Data

Organize your raw data it is a good practice to have the raw data placed in one folder:


Enter your working directory /vol/COMPEPIWS/groups//tasks
cd /vol/COMPEPIWS/groups/chipseq2/tasks


List the raw fastq files under /vol/COMPEPIWS/data/reduced/ChIP-seq/
$ ls  /vol/COMPEPIWS/data/reduced/ChIP-seq/
kidney_14.5_H3K27ac_1.fastq.gz   kidney_14.5_H3K9me3_2.fastq.gz   kidney_15.5_H3K9ac_1.fastq.gz   liver_14.5_H3K4me1_2.fastq.gz   liver_15.5_H3K36me3_1.fastq.gz
kidney_14.5_H3K27ac_2.fastq.gz   kidney_14.5_control_1.fastq.gz   kidney_15.5_H3K9ac_2.fastq.gz   liver_14.5_H3K4me3_1.fastq.gz   liver_15.5_H3K36me3_2.fastq.gz
kidney_14.5_H3K27me3_1.fastq.gz  kidney_14.5_control_2.fastq.gz   kidney_15.5_H3K9me3_1.fastq.gz  liver_14.5_H3K4me3_2.fastq.gz   liver_15.5_H3K4me1_1.fastq.gz
kidney_14.5_H3K27me3_2.fastq.gz  kidney_15.5_H3K27ac_1.fastq.gz   kidney_15.5_H3K9me3_2.fastq.gz  liver_14.5_H3K9ac_1.fastq.gz    liver_15.5_H3K4me1_2.fastq.gz
kidney_14.5_H3K36me3_1.fastq.gz  kidney_15.5_H3K27ac_2.fastq.gz   kidney_15.5_control_1.fastq.gz  liver_14.5_H3K9ac_2.fastq.gz    liver_15.5_H3K4me3_1.fastq.gz
kidney_14.5_H3K36me3_2.fastq.gz  kidney_15.5_H3K27me3_1.fastq.gz  kidney_15.5_control_2.fastq.gz  liver_14.5_H3K9me3_1.fastq.gz   liver_15.5_H3K4me3_2.fastq.gz
kidney_14.5_H3K4me1_1.fastq.gz   kidney_15.5_H3K27me3_2.fastq.gz  liver_14.5_H3K27ac_1.fastq.gz   liver_14.5_H3K9me3_2.fastq.gz   liver_15.5_H3K9ac_1.fastq.gz
kidney_14.5_H3K4me1_2.fastq.gz   kidney_15.5_H3K36me3_1.fastq.gz  liver_14.5_H3K27ac_2.fastq.gz   liver_14.5_control_1.fastq.gz   liver_15.5_H3K9ac_2.fastq.gz
kidney_14.5_H3K4me3_1.fastq.gz   kidney_15.5_H3K36me3_2.fastq.gz  liver_14.5_H3K27me3_1.fastq.gz  liver_14.5_control_2.fastq.gz   liver_15.5_H3K9me3_1.fastq.gz
kidney_14.5_H3K4me3_2.fastq.gz   kidney_15.5_H3K4me1_1.fastq.gz   liver_14.5_H3K27me3_2.fastq.gz  liver_15.5_H3K27ac_1.fastq.gz   liver_15.5_H3K9me3_2.fastq.gz
kidney_14.5_H3K9ac_1.fastq.gz    kidney_15.5_H3K4me1_2.fastq.gz   liver_14.5_H3K36me3_1.fastq.gz  liver_15.5_H3K27ac_2.fastq.gz   liver_15.5_control_1.fastq.gz
kidney_14.5_H3K9ac_2.fastq.gz    kidney_15.5_H3K4me3_1.fastq.gz   liver_14.5_H3K36me3_2.fastq.gz  liver_15.5_H3K27me3_1.fastq.gz  liver_15.5_control_2.fastq.gz
kidney_14.5_H3K9me3_1.fastq.gz   kidney_15.5_H3K4me3_2.fastq.gz   liver_14.5_H3K4me1_1.fastq.gz   liver_15.5_H3K27me3_2.fastq.gz


How many files per cell_type per time point per replicate are there?
There are two types cell_type; kidney and liver. Each cell_type has 32 files.


What are the histone marks you are dealing with?
H3K27ac, H3K27me3, H3K4me1, H3K4me3, H3K9ac, H3K9me3, H3K36me3


how many "control" per replicate do you have?
There are 8 control in total and 4 controls per replicate.
    ls -lah /vol/COMPEPIWS/data/reduced/ChIP-seq/ | grep "control" 
    -rw-r--r-- 1 abdosa abdosa  63M Feb 24  2020 kidney_14.5_control_1.fastq.gz
    -rw-r--r-- 1 abdosa abdosa  40M Feb 24  2020 kidney_14.5_control_2.fastq.gz
    -rw-r--r-- 1 abdosa abdosa 141M Feb 24  2020 kidney_15.5_control_1.fastq.gz
    -rw-r--r-- 1 abdosa abdosa 138M Feb 24  2020 kidney_15.5_control_2.fastq.gz
    -rw-r--r-- 1 abdosa abdosa  32M Feb 24  2020 liver_14.5_control_1.fastq.gz
    -rw-r--r-- 1 abdosa abdosa  34M Feb 24  2020 liver_14.5_control_2.fastq.gz
    -rw-r--r-- 1 abdosa abdosa 139M Feb 24  2020 liver_15.5_control_1.fastq.gz
    -rw-r--r-- 1 abdosa abdosa 222M Feb 24  2020 liver_15.5_control_2.fastq.gz


Create a “data” folder in your group working directory and create symbolic links of the relevant files according to your group to the “data” folder
ln -s files/*.csv data/


Generating sample sheet


prepare the samplesheet. You can prepare it manually in any editor of your choice, but it is recom- mended to generate it on the command line taking the advantage of the file names, this is particularly useful when you have many samples. You can loop over the files and extract the information you need to create the samplesheet (hint: basename, cut).
#change the path to where you files are located. 
for i in  /vol/COMPEPIWS/data/reduced/ChIP-seq/*;
do 
    filename=$(basename ${i} .fastq.gz);
    fastq1=$i;
    group=$(basename ${i}|cut -d "_" -f 1-3);
    replicate=$(basename ${fastq1}|cut -d "." -f 1-2|cut -d "_" -f 4 );
    mark=$(echo ${group}|cut -d "_" -f 3);
    control=$(echo $group|cut -d _ -f 1-2|awk '{print $0"_Input"}' ); 
    if [ $mark == "control" ] ;
    then 
        echo "$control,$replicate,$fastq1,,,"; 
    else  
        echo "$group,$replicate,$fastq1,,$mark,$control"; 
    fi ; 
    done 
    |awk 'BEGIN { print "group,replicate,fastq_1,fastq_2,antibody,control" } {print}'  > sampleheet.csv 


Running the pipeline

Based on the contents of config file:
params {
    max_memory = 123.GB
    max_cpus = 12
    max_time = 24.h

    // Genome references
    fasta = '/vol/COMPEPIWS/pipelines/references/splitted/mm10_reduced.fa'
    fasta_index = '/vol/COMPEPIWS/pipelines/references/splitted/mm10_reduced.fa.fai'
    gtf = '/vol/COMPEPIWS/pipelines/references/genome_genes.gtf'
    blacklist = '/vol/COMPEPIWS/pipelines/references/mm10-blacklist-reduced.bed'
    macs_gsize = '152134205'
    bwa_index = '/vol/COMPEPIWS/pipelines/references/splitted/atacseq/BWAIndex/mm10_reduced.fa'

    single_end = 'True'
    narrow_peak = 'True'
    // skip some steps
    skip_diff_analysis = 'True'
    save_macs_pileup = 'False'
    skip_preseq = 'True'
    skip_plot_profile = 'True'
    skip_plot_fingerprint = 'True'
    skip_spp = 'True'
}


process{

    withLabel:process_low {
      cpus = 1
      memory = 2.GB
      time = 2.h
    }
    withLabel:process_medium {
      cpus = 2
      memory = 4.GB
      time = 4.h
    }
    withLabel:process_high {
      cpus = 4
      memory = 8.GB
      time = 8.h
    }

    errorStrategy = 'ignore'


Is the data paired end or single end?
The data is single-end


Which steps of the pipelines are skipped?

Skip Differential Analysis
Skip Preseq
Skip deepTools plotProfile
Skip deepTools plotFingerprint
Skip Phantompeakqualtools


Why is the blacklist file used? What do these regions represent?
Blacklist file is used in order to ensure that reads that align to specific genomic regions will be filtered out. These regions are often found at specific types of repeats such as centromeres, telomeres and satellite repeats. The ENCODE blacklists regions typically appear uniquely mappable so simple mappability filters do not remove them. These blacklists are applicable to functional genomic data like ChIP-seq based on short-read sequencing (20-100bp reads).


Which tool will be used for mapping?
Burrows-Wheeler Aligner - BWA


List in order the main steps of the pipeline for processing ChIP-seq data?
The successive steps of the PIpeline are listed below:

Raw read QC (FastQC)
Adapter trimming (Trim Galore)
Alignment (BWA)
Merge alignments from multiple libraries of the same sample (picard)
Create IGV session file containing bigWig tracks, peaks and       differential sites for data visualisation (IGV)
Present QC for raw read, alignment, peak-calling and differential binding results (MultiQC, R)


Quality Control

Exploring the result folder


Have a look into the pipeline_info/execution_report.html:
• How long did the pipeline take to finish?
7h 43m 20s
• Which step took the most CPU/Memory usage?
The most CPU usage has been noted in case og BWA_MEM.
Highest percentage of resources being alloted to Consensus_Peaks

Highest memory usage has been noted in case of MERGED_BAM


Where are the aligned reads located (*bam files)? You will use these files later
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/
find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run -name '*bam'
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/liver_15.5_H3K9me3_R2.mLb.clN.sorted.bam
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_15.5_H3K4me3_R2.mLb.clN.sorted.bam
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/liver_15.5_H3K9ac_R1.mLb.clN.sorted.bam
...


Where are the peaks files located (*narrowPeak)? which tool was used for peak calling?
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results//bwa/mergedLibrary/macs/narrowPeak
$ find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/ -name '*narrowPeak'
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/multiqc/narrowPeak
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/liver_14.5_H3K9ac_R2_peaks.narrowPeak
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/kidney_14.5_H3K4me1_R2_peaks.narrowPeak
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/kidney_15.5_H3K36me3_R2_peaks.narrowPeak
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/kidney_15.5_H3K27ac_R2_peaks.narrowPeak
...


Where are the consensus peaks located (*consensus_peaks.bed)?
For each histon marker under the following directory:
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/
$ find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/ -name '*consensus_peaks.bed'
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K4me1/H3K4me1.consensus_peaks.bed
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K27me3/H3K27me3.consensus_peaks.bed
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K27ac/H3K27ac.consensus_peaks.bed
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K36me3/H3K36me3.consensus_peaks.bed
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K9ac/H3K9ac.consensus_peaks.bed
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K9me3/H3K9me3.consensus_peaks.bed
...


Where are the signal tracks located (*bigWig)? You will use these files later
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/
$ find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/ -name '*bigWig'
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_14.5_H3K4me1_R2.bigWig
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_15.5_H3K27me3_R1.bigWig
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_14.5_H3K27ac_R2.bigWig
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_14.5_H3K9ac_R1.bigWig
/vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/liver_14.5_Input_R1.bigWig


MultiQC

. Explore the different section of the MultiQC HTML report generated by the pipeline. You find it under
results/multiqc/narrowPeak folder.

In the “LIB: FASTQC (raw)” section:
• How many reads do you have for each sample?


• How good is the quality of the bases in the reads?

• How good is the quality of the reads?

• What are the read lengths for the samples?

• Is there a sign for adapter contamination?
No samples found with any adapter contamination > 0.1%
• compare the number of reads after trimming? do you loose many reads after trimming?

In the “MERGED LIB: SAMTools (unfiltered)”:
• How is the mapping efficiency?

From the graph we can see that SAMTools gives us a pretty good mapping efficiency since most of the reads have been mapped.


• What is the duplication rate?
From the plot below, we notice that the percentage of duplicate reads is much less as compared to the unique reads. So we can say the duplication rate is much low.


In the “MERGED LIB: MACS2 PEAK COUNT” section:
• Do you get the similar number of peaks for all histone marks?

From the graph we notice
• Overall: which marks get the highest/lowest number of peaks?
5. In the “MERGED LIB: MACS2 FRIP SCORE” section:
• What does FRiP score mean?
• which marks get the highest/lowest FRiP scores? How is it related to the mark signal distribution
(narrow/broad)?
deepTools2

deepTools has different quality control tools that can be used to evaluate the ChIP-seq data quality. We will use some of them (read about the tools from the Documentation).


Activate the core environment.


Explore the batch effect between your samples (hint: multiBigwigSummary, plotPCA, plotCorrelation, recognize the different parameters of each tool)?
• PCA plot might not be very informative due to the large number of samples, try to plot it in R with ggplot. There is a ready file to be read in R if you provide “--outFileNameData” option to “plotPCA” function.
# install.packages("tidyverse")
# install.packages("reshape2")

library("tidyverse")
library("reshape2")
path_2_data <- "/Users/zingo/Dropbox/CompEpi/Pipline/deeptools/PCAoutput"

x <- read.table(path_2_data, header=T)
xm <- melt(x[1:2,-c(66)], id.vars="Component")
xmd <- dcast(xm, variable ~ Component)

xmd$cell_type[grepl("liver",xmd$variable)] <- "liver"
xmd$cell_type[grepl("kidney",xmd$variable)] <- "kidney"
xmd$mark[grepl("H3K27ac",xmd$variable)] <- "H3K27ac"
xmd$mark[grepl("H3K27me3",xmd$variable)] <- "H3K27me3"
xmd$mark[grepl("H3K9me3",xmd$variable)] <- "H3K9me3"
xmd$mark[grepl("H3K36me3",xmd$variable)] <- "H3K36me3"
xmd$mark[grepl("H3K4me3",xmd$variable)] <- "H3K4me3"
xmd$mark[grepl("H3K4me1",xmd$variable)] <- "H3K4me1"
xmd$mark[grepl("H3K9ac",xmd$variable)] <- "H3K9ac"
xmd$mark[grepl("Input",xmd$variable)] <- "Input"


colnames(xmd) <- c("variable","PC1","PC2", "cell_type", "mark")

ggplot(xmd, aes(x=PC1, y=PC2, color=mark)) + 
  geom_point(aes(shape=cell_type), size=3) + 
  theme(legend.position="top") + xlab("PC1:21.5%") + ylab("PC2:15.6%")

ggsave("PCA_R.png")

java -Xmx3000M -Djava.awt.headless=true -jar 

/vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar BinarizeBam 

-b  \ /vol/COMPEPIWS/softwares/ChromHMM/CHROMSIZES/mm10.txt 

inputs 

cellmarkfiletable.txt 

output_folder_name
(core) $ multiBigwigSummary bins -b /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/*bigWig  -o multiBigwigSummaryResults.npz
(core) $ plotCorrelation --corData multiBigwigSummaryResults.npz --corMethod spearman --whatToPlot heatmap --plotFile plotCorrelationSpearman
(core) $ plotCorrelation --corData multiBigwigSummaryResults.npz --corMethod pearson --whatToPlot heatmap --plotFile plotCorrelationPearson
(core) $ plotPCA  --corData multiBigwigSummaryResults.npz --plotFile PCA --outFileNameData PCAoutput


Based on the plotCorrelation outputs, which marks are more similar to each other.
In liver & kidney cells: H3K9ac ~ H3K4me3
In liver & kidney cells: H3K27ac smilar to H3K4me1
H3K36me3 kidney smilar to H3K36me3 liver
H3K27me3 kidney is similar H3K27me3 liver


How well are the signals (marks) separated from the background. Since you have many samples it would be wise to do each histone mark of all samples separately. i.e, one plot for all H3K4me3, another for H3K4me1 ........ etc (hint: plotFingerprint)


check the failed smaples and linked to here


![](https://i.imgur.com/4zAECUi.png)
![](https://i.imgur.com/pwuc46d.png)
![](https://i.imgur.com/iObY2hm.png)


``` {bash}
      /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_14.5_H3K27ac_R1.mLb.clN.sorted.bam
      /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_14.5_H3K27ac_R1.mLb.clN.sorted.bam
   /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_14.5_Input_R1.mLb.clN.sorted.bam

declare -a arr=("H3K27ac" "H3K27me3" "H3K4me1" "H3K4me3" "H3K9ac" "H3K9me3" "H3K36me3" "Input")

for mark in "${arr[@]}"
do
    bams=($(ls /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/*${mark}*bam)) 
    names=($(for i in ${bams[@]}; do echo $(basename $i)|cut -f 1-2 -d . ; done))

    echo -e "sbatch -J fp_${mark} -o output_${mark}.out --wrap=\"plotFingerprint -b ${bams[@]} -plot ${mark}_fingerprint.png --labels ${names[@]} --plotTitle ${mark}\ Fingerprint\""

done
```


Choose one of the replicates of one stage of one cell type and plot profiles of the 7 histone marks around TSS (+-1500bp) and across the gene body (+-1500bp). Use the gene file produced by your job under results/genome/genome_genes.bed (hint: plotProfile, plotHeatmap). Which marks are enriched at TSS and which are enriched at the gene body?
H3k4me3 is enriched at TTS
H3K36me3 is enriched at gene body
$ computeMatrix scale-regions \
-S /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_15.5_H*_R1.bigWig \
-R /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/genome/genome_genes.bed \
-b 1500 -a 1500
-o Kidney_15.5_R1.mat.gz

$ plotProfile -m Kidney_15.5_R1.mat.gz \
          -out Kidney_15.5_R1_Profile_heatmap.png \
          -plotType heatmap
          --numPlotsPerRow 2
          --plotTitle "Kidney_15.5_R1 Profile Plot"


Exploratory Analysis using IGV

Steps 1 to 4 done:
IGV installed on local machine.
Opened IGV session from results/igv/narrowPeak/igv_session.xml
Loaded the gene tracks from results/genome/genome_genes.bed
After removing the consensus tracks:

![](https://i.imgur.com/cipak9y.png)

Uploading the gene tracks from results/genome/genome_genes.bed:

![](https://i.imgur.com/c6jIUgd.png)


Do you see clear peaks for all marks? which marks are sharp and which ones are broad(er)?
From the picture it is clear that we get clear peaks for the marks.

If we consider this gene on chromosome 18 we notice that the we get the sharp peaks around the  Transciption start sites are enriched for the marks H3K9Ac, H3K27Ac and H3K4me3 both in the kidney and liver cells.
Enrichment at the TSS look like sharp peaks.
Broader marks are noticed where the  gene body has been enriched.


Realize the local distribution of the different histone marks. Which one(s) overlap with TSS and which of them ovelap with gene body?
Analysisng Chromosome 19, gene_name "E030024N20Rik"; transcript_id "NR_033228"
Histone marks that overlap with TSS: H3K9Ac, H3K27Ac and H3K4me3
Histone marks tha overlap with the gene body: H3K4me1, H3K36me3,H3K9me3, H3K27 me3


Do you always see an agreement between the called peaks of the two replicates of the same mark in the same cell type? To get a quantitative answer have a look to the Upset plots under results/bwa/mergedLibrary/macs/narrowPeak/consensus//*.boolean.intersect.plot.pdf. What do you understand from this plot (choose two of the sharp marks)
We have analysed the intersect plots we observe the following for the respective Histone marks:
H3K4me1:
We do not always see an agreement between peaks called between the to replicates of the same cell type.
Considering Kidney 15.5 cells replicate 1 and replicate 2: We notice that for most of the cases there is an agreement between the peaks called.
The peaks in liver liver_14.5 cells repliacte 1 and 2 are not always in agreement with each other.
H3K4me3:
Kidney 14.5 cells replicate 1 and replicate 2: The peaks are not always at agreement with one another.
LIver 15.5 cells replicate 1 and replicate 2 : The peaks are not always in agreement.


Chromatin segmentation with ChromHMM

ChromHMM Installation

Running ChromHMM


Segmentation folder created
Create a folder called “inputs” and create symbolic links of all aligned read files to this folder (there should
be 64 bam files)!
To run BinarizeBam you need to


Prepare a tab delimited file “cellmarkfiletable.txt”. Prepare this file in such away that the replicates of one stage per cell type are pooled together. i.e at the end we want to have one segmentation result per cell type per stage: liver_14.5/15.5 and kidney_14.5/15.5. Use the control samples in “cellmarkfiletable.txt” file to adjust the binarization threshold (hint: loop over the bam files, extract the cell type and histone mark names from the file name).
declare -a cells=("kidney_14.5" "kidney_15.5" "liver_14.5" "liver_15.5")
declare -a marks=("H3K27ac" "H3K27me3" "H3K4me1" "H3K4me3" "H3K9ac" "H3K9me3" "H3K36me3")
declare -a replicates=("R1" "R2")

path=""

for sample in "${cells[@]}"
do
    for replicate in "${replicates[@]}"
    do
        for mark in "${marks[@]}"
            do
            printf "${sample}\t${mark}\t${path}${sample}_${mark}_${replicate}.mLb.clN.sorted.bam\t${path}${sample}_Input_${replicate}.mLb.clN.sorted.bam\n"	
        done
    done
done


Run the BinarizeBam command with 200bp as a bin size (be careful in selecting the right chromosom- Lengthfile) [Make sure you run this task on one of the worker nodes]
java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar  BinarizeBam  -b 200   /vol/COMPEPIWS/softwares/ChromHMM/CHROMSIZES/mm10.txt inputs/ cellmarkfiletable.txt Chip_chrom_segmentation


Run the LearnModel command with different number of states:5-15(submitthe commandsas jobs(sbatch) to SLURM, refer to the generalTask 4.4): [Make sure you run this task on one of the master node]
states=($(seq 5 1 15 ))

for state in ${states[@]}
do
    sbatch --job-name=LM_state${state} -o job_$state.out --wrap="java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar LearnModel /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/binarizedBAMs /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/learnModels_CellTypeTimePoint/"${state}"  ${state} mm10"
done


Explore the output files called webpage_$state.html. Especially, emission heatmap, Ref- Seq_TSS_neighborhood and overlap figures. Comment on the “new” states you get.

Emission parameters for the model with 5 states:

State 1: H3K27me3 -> Bivalent Promoter
State 2: no histone marks are enriched -> Heterochromatin
State 3: H3k36me3 ->Strong Transcription
State 4: H3K27ac, H3K4me1 -> Strong, TSS-dist enhancer
State 5: H3k4me3, H3k9ac,H3k27ac,H3K4me1 -> Active Promoter
Emission parameters for the model with 6 states:

State 1: H3K27me3 -> Bivalent Promoter
State 2: no histone marks are enriched -> no signal
State 3: H3k9me3 -> Heterochromatin
State 4: H3k36me3 -> Strong Transcription
State 5: H3k27ac, H3k4me1 -> Strong TSS-dist enhancer
State 6: H3k4me3, H3k9ac,H3k27ac,H3K4me1 -> Active Promoter


Compare the different models to assess what is the proper minimum number of states.Use the model with the highest number of states as a reference (hint: CompareModels function). What is the minimum reasonable number of states you think is appropriate?
 java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar CompareModels emissions/emissions_15.txt emissions/ results_


A look at the graph suggests that the model with 5 and 6 states fails completely to enrich the 15th state abd the models with 5,6,7,8,9 states cannot enrich the state 4
So this makes it clear that a minimum of 10 state model is required to mark all the states.


The “cellmarkfiletable” in step 5 was prepared to produce 4 segmentation files (1 per stage per cell type). Repeat step 5 in a way that you merge the stages of each cell type in order to produce only 2 segmentation files (1 for each cell type).
https://i.imgur.com/icr4PAT.png
declare -a cells=("kidney" "liver")
declare -a stages=("14.5" "15.5")
declare -a marks=("H3K27ac" "H3K27me3" "H3K4me1" "H3K4me3" "H3K9ac" "H3K9me3" "H3K36me3")
declare -a replicates=("R1" "R2")

for sample in "${cells[@]}"
do
        for stage in "${stages[@]}"
        do
                for replicate in "${replicates[@]}"
                do
                        for mark in "${marks[@]}"
                        do
                                printf "${sample}\t${mark}\t${sample}_${stage}_${mark}_${replicate}.mLb.clN.sorted.bam\t${sample}_${stage}_Input_${replicate}.mLb.clN.sorted.bam\n"
                        done
                done
        done

done

java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar  BinarizeBam  -b 200   /vol/COMPEPIWS/softwares/ChromHMM/CHROMSIZES/mm10.txt inputs/ binarizedBAMs_CellType/cellmarkfiletable.txt binarizedBAMs_CellType


Repeat steps 6 with the new “cellmarkfiletable”. Do it only with 15 states model
state=15
sbatch --job-name=LM_state${state} -o job_$state.out --wrap="java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar LearnModel /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/binarizedBAMs_CellType /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/learnModels_CellType/"${state}"  ${state} mm10"


Based on the emission heatmap, overlap and enrichment of this new 15-state model, try to assign biological labels to the states. Use figure 2 as reference and use the shortcuts from table 1. In case you couldn’t find the proper annotation from figure 2 try to be creative and give a label by yourself.

Emission Heatmap

Overlap-Kidney cells:

Overlap-Liver cells:

From an analysis of the Fold enrichment plots of the kidney and liver cells it is clear that the
On analysis of the three graphs we can assign the following biological states to the 15 states:
Promoter -> Enhancer -> Transcription -> Heterochromatin
1 -> Tx_S    -> Strong Transcription
2 -> Tx_W    -> Weak Transcription
3 -> Tx_I    -> Transcription Initiation
4 -> Enh_W   -> Weak Enahncer
5 -> Enh_A   -> Active Enhancer
6 -> Enh_S   -> Strong TSS Prox Enhancer
7 -> Pr_A    -> Active promoter
8 -> Pr_W    -> Weak promoter
9 -> Pr_F    -> Flanking Promoter
10-> Pr_B    -> Bivalent Promoter(less enriched than state 7 for the regions show in figure 1)
11-> Enh_P   -> Poised Enhancer
12-> Het_P   -> Polycomb_associated heterochromatin
13-> NS      -> No signal
14-> Het_W   -> Weak Heterochromatin
15-> Mix


Rename the states in column 4 of the files *dense.bed (or *segments.bed) using the biological states from the previous question (consider the command MakeBrowserFiles from ChromHMM or awk in bash to do the task). Note: Redirect the output into new files in a subfolder called “relabeled” to avoid overwriting the original files.
java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar MakeBrowserFiles -m labels.txt \
                 -n 15 \
                 kidney_15_dense.bed \
                 kidney_15 \
                 results/kidney_15_dense_


Run "reorder" command to generate the emission heatmap with the new labels.
 java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar Reorder -m ../relabeled/labels.txt ../model_15.txt ./


Copied the relabeled segmentation files of Q12 and the emission heatmap of Q13 into the shared folder under your group sub-directory /vol/COMPEPIWS/groups/shared/


Exploration of chromatin state segmentation using IGV


Which state(s) is(are) more likely overlapping with TSS?
Gene: Rbbp8

Enhancer and Promoter regions are likely to overlap with TSS


Upload the segmentation of the other cell type (one replicate). Can you pin point some differential states
between the two cell types? save some snapshots.

It can be clearly seen from the plot that the two differential states are : Enh_W in liver cells and Enh_S in kidney cells.


Exploratory Analysis using R

In this section we will explore the segmentation results of 15-state model (the relabeled) by looking to their length distributions, genome coverage percentages. We will do this in R and ggplot package for plotting. This task will teach how to visualize multiple layers of information in one plot. You will try to summarize all what you need in a data frame for a later easy plotting. The final data frame should look like:
chromosome   start    end   state     sample     length
chr18        12334    13034  2_Enh_A    liver       700
The steps 1-9 will help you to prepare such a table. However, feel free to do it in your own way.


Activate the “core” environment.
source /vol/COMPEPIWS/conda/miniconda3/bin/activate /vol/COMPEPIWS/conda/miniconda3/envs/core


Start R


Read all *(reordered)dense.bed files from task 4.2.12 from “segmentation” directory results into R. (hint: use list.files, lapply, read.table functions)

dir <- "/Users/zingo/Dropbox/CompEpi/Pipline/CHMM/segmentation"
regex <- "*dense.bed$"
filenames <- list.files(path=dir,
                        pattern=regex,
                        full.names=TRUE)

filenames

BEDs <- lapply(filenames,
               read.table,
               header=FALSE,
               sep="\t",
               stringsAsFactors=FALSE,
               skip=1)

BEDs[[1]]$sample <- "kidney"
BEDs[[2]]$sample <- "liver"


Merge the results into one data frame (hint: rbind function).
df <- rbind(BEDs[[1]], BEDs[[2]])


Extract the sample names, they should be two unique names (hint: strsplit).


Make a vector of names with length equal to the data frame length and consider the different name numbers for each sample (hint: for, rep)


Add a sample name column to the data frame.


Calculate the segment length and add it as a new column to the data frame.
segments_df <- select(df, V1, V2, V3, V4, sample)
colnames(segments_df) <- c("chromosome", "start", "end", "state", "sample")
segments_df$length <- segments_df$end - segments_df$start


Calculate the segment percentage per sample (hint: melt, aggregate, dcast functions).
library(dplyr)
df_m  <- melt(segments_df, id.vars=c("state", "sample","length")) %>%
         group_by(state, sample) %>% 
         summarise(lengths = sum(length))

sum_geome_kidney <- sum(df_m$lengths[df_m$sample=="kidney"])
sum_geome_liver  <- sum(df_m$lengths[df_m$sample=="liver"])

df_mm <- df_m %>% mutate(percentage=case_when(sample=="kidney" ~ lengths / sum_geome_kidney * 100,
                                              sample == "liver" ~ lengths / sum_geome_liver * 100,
                                              T ~ 0))


Plot and save the length distributions per sample (hint: ggplot() + geom_boxplot() + facet_wrap(), ggsave).
p_length_ditribution <- ggplot(segments_df, aes(y=length, x=reorder(state,length), fill=sample)) + 
                    geom_boxplot() +
                    scale_y_continuous(trans='log2') +
                    xlab("States") +
                    ylab("Segments Lengths") 


Plot and save the segment genomic percentage per sample as stacked bar plots (hint: ggplot() +
geom_barplot(), ggsave).
ggplot(df_mm, aes(fill=state, y=percentage, x=sample)) + 
    geom_bar(position="stack", stat="identity")


Based on the plots from task 10 and 11, comment on the state distributions/lengths across the cell types and stages.
13_NS state in both samples are the longest segments with around 67% of the genomic regions.
8_Pr_W state in both samples are the shortest segments with around 0.14% of the genomic regions
4_Enh_W in kdiney 5.79% of the genomic regions while in liver 3.62%


Integrative Analysis

Loading the IGV session:

What are the methylation states of the promoters and the gene body in general (high, low)?

The gene body is Highly methylated region and the promoters are the unmethylated regions

Which states from DNA methylation segmentation (MethylSeekR) overlap with chromatin segmentation
(ChromHMM).

PMD: Enhancer region
UMR: Promoter, Weak TSS
HMR: TX_S(Strong TRanscription)

Do you observe overlap between the accessibility peaks and specific chromatin states? Which ones?


From the plot it is clearly visible that there is an overlap between the accessibility peaks and specific chromatin states. THis is because the annotated chromatin state corresponding to the accessible region Interval_362 is a promoter or a Transcription start site.

Can you find some examples for unexpressed genes. What are the methylation states of their promoters and
bodies? Which histone marks are enriched at the promoters and bodies of these genes?

Gene cndp1: No gene expression


Do you observe positive or negative associations between DNA methylation, chromatin accessibility, indi-
vidual histone marks and gene expression? Do the association differ in certain types of regions (e.g. promot-
ers, gene bodies, enhancers)


DNA methylation is negatively correlated to Chromatin accessibilty.
From the figure, it is clearly visible that the HMRs(Highly methylated regions) are the closed chromatin.
Interval_643:UMR:Gene expression occurs