Skip to content

Instantly share code, notes, and snippets.

@debangana234
Created July 31, 2024 17:30
Show Gist options
  • Save debangana234/5f25a6d2556778283fd481c9a6dbde41 to your computer and use it in GitHub Desktop.
Save debangana234/5f25a6d2556778283fd481c9a6dbde41 to your computer and use it in GitHub Desktop.
title author date theme highlight output tags
ChIP-seq2
Fadel Berakdar, Debangana Chakraborty
united
tango
html_document
toc toc_depth number_sections toc_float
true
3
true
collapsed smooth_scroll
true
false
chipseq

General Tasks

Get to know the system

  1. Run sinfo -lN on the master node. How many nodes do you see? How many CPUs and how much RAM does each node have?

    There are 12 cores and 120000 MB of memory in each node.

    $ sinfo -lN
    Mon Aug  9 14:58:58 2021
    NODELIST                             NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
    bibigrid-worker-1-1-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none                
    bibigrid-worker-1-2-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none                
    bibigrid-worker-1-3-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none                
    bibigrid-worker-1-4-etqcgckjrdsvsxe      1    debug*        idle   12   12:1:1 121000        0      1   (null) none
    
  2. Run df -h /vol/COMPEPIWS/. What is the volume’s size? The volume size is 3.9T

    $ df -h /vol/COMPEPIWS/
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/vdb        3.9T  1.4T  2.4T  37% /vol/COMPEPIWS
    
  3. Check the temp folder location by: echo $TMPDIR

    $ echo $TMPDIR
    /vol/COMPEPIWS/tmp
    
  4. Your group working directory is /vol/COMPEPIWS/groups/

    $ ls /vol/COMPEPIWS/groups/chipseq2/
    conda  data  tasks
    
  5. Each of this directories has the following structure: /vol/COMPEPIWS/groups/ |__ data (where you will make symbolic links for the raw data) |__ conda (for some of conda tasks) |__ tasks (where you execute the pipelines and the downstream tasks, try to keep yourself organized inside this folder)

  6. The raw data (fastq files) that you are going to use to run the pipelines is available under this folder: /vol/COMPEPIWS/data/reduced/ and each file ID has the following scheme: <cell_type><assay/mark><biological_replicate>.fastq.gz e.g. liver_14.5_RNAseq_1.fastq.gz There is an exception for ATAC data that has an additional info for Read1 (R1) and Read2 (R2) since this data is paired end. e.g. liver_14.5_ATAC_1{R1,R2}.fastq.gz

    $ ls /vol/COMPEPIWS/data/reduced/ChIP-seq/
    kidney_14.5_H3K27ac_1.fastq.gz   kidney_14.5_H3K9me3_2.fastq.gz   kidney_15.5_H3K9ac_1.fastq.gz   liver_14.5_H3K4me1_2.fastq.gz   liver_15.5_H3K36me3_1.fastq.gz
    kidney_14.5_H3K27ac_2.fastq.gz   kidney_14.5_control_1.fastq.gz   kidney_15.5_H3K9ac_2.fastq.gz   liver_14.5_H3K4me3_1.fastq.gz   liver_15.5_H3K36me3_2.fastq.gz
    kidney_14.5_H3K27me3_1.fastq.gz  kidney_14.5_control_2.fastq.gz   kidney_15.5_H3K9me3_1.fastq.gz  liver_14.5_H3K4me3_2.fastq.gz   liver_15.5_H3K4me1_1.fastq.gz
    kidney_14.5_H3K27me3_2.fastq.gz  kidney_15.5_H3K27ac_1.fastq.gz   kidney_15.5_H3K9me3_2.fastq.gz  liver_14.5_H3K9ac_1.fastq.gz    liver_15.5_H3K4me1_2.fastq.gz
    kidney_14.5_H3K36me3_1.fastq.gz  kidney_15.5_H3K27ac_2.fastq.gz   kidney_15.5_control_1.fastq.gz  liver_14.5_H3K9ac_2.fastq.gz    liver_15.5_H3K4me3_1.fastq.gz
    kidney_14.5_H3K36me3_2.fastq.gz  kidney_15.5_H3K27me3_1.fastq.gz  kidney_15.5_control_2.fastq.gz  liver_14.5_H3K9me3_1.fastq.gz   liver_15.5_H3K4me3_2.fastq.gz
    kidney_14.5_H3K4me1_1.fastq.gz   kidney_15.5_H3K27me3_2.fastq.gz  liver_14.5_H3K27ac_1.fastq.gz   liver_14.5_H3K9me3_2.fastq.gz   liver_15.5_H3K9ac_1.fastq.gz
    kidney_14.5_H3K4me1_2.fastq.gz   kidney_15.5_H3K36me3_1.fastq.gz  liver_14.5_H3K27ac_2.fastq.gz   liver_14.5_control_1.fastq.gz   liver_15.5_H3K9ac_2.fastq.gz
    kidney_14.5_H3K4me3_1.fastq.gz   kidney_15.5_H3K36me3_2.fastq.gz  liver_14.5_H3K27me3_1.fastq.gz  liver_14.5_control_2.fastq.gz   liver_15.5_H3K9me3_1.fastq.gz
    kidney_14.5_H3K4me3_2.fastq.gz   kidney_15.5_H3K4me1_1.fastq.gz   liver_14.5_H3K27me3_2.fastq.gz  liver_15.5_H3K27ac_1.fastq.gz   liver_15.5_H3K9me3_2.fastq.gz
    kidney_14.5_H3K9ac_1.fastq.gz    kidney_15.5_H3K4me1_2.fastq.gz   liver_14.5_H3K36me3_1.fastq.gz  liver_15.5_H3K27ac_2.fastq.gz   liver_15.5_control_1.fastq.gz
    kidney_14.5_H3K9ac_2.fastq.gz    kidney_15.5_H3K4me3_1.fastq.gz   liver_14.5_H3K36me3_2.fastq.gz  liver_15.5_H3K27me3_1.fastq.gz  liver_15.5_control_2.fastq.gz
    kidney_14.5_H3K9me3_1.fastq.gz   kidney_15.5_H3K4me3_2.fastq.gz   liver_14.5_H3K4me1_1.fastq.gz   liver_15.5_H3K27me3_2.fastq.gz
    

Tips and Tricks

Working with tables on the command line, or awk

We will explore and manipulate a GTF file that contains gene annotation

  1. Do this task on one of the worker nodes

  2. Run: head /vol/COMPEPIWS/pipelines/references/genome_genes.gtf to explore the first few lines of this file.

    $ head /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    chr1	unknown	exon	3214482	3216968	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	stop_codon	3216022	3216024	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	CDS	3216025	3216968	.	-	2	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	CDS	3421702	3421901	.	-	1	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	exon	3421702	3421901	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	CDS	3670552	3671348	.	-	0	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	exon	3670552	3671498	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	start_codon	3671346	3671348	.	-	.	gene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";
    chr1	unknown	exon	4290846	4293012	.	-	.	gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";
    chr1	unknown	stop_codon	4292981	4292983	.	-	.	gene_id "Rp1"; gene_name "Rp1"; p_id "P17361"; transcript_id "NM_001195662"; tss_id "TSS6138";
    
  3. Count the number of lines in this file

    $ wc -l /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    671462 /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    
  4. Count the number of "exon" in this file (hint: grep, awk)

    $ awk '{ if ($3=="exon") {print $0} }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf | wc -l
    324748
    
    $ awk '{ if ($3=="exon") {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    324748
    
  5. Count the number of "exon" that are longer than 1000bp considering cloumns 4 and 5 as coordinates of the exon

    $ awk '{ if ($3=="exon" && $5-$4 + 1 > 1000) {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    19601
    

    Note: GTF format is 1-based

  6. Count the number of "exon" that has Sox17 as gene_id

    awk '{ if ($10=="\"Sox17\";" && $3=="exon") {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    20
    
    $ awk -vFS="\t" '{ if ($3=="exon") {split($9, a, ";"); if (a[1]=="gene_id \"Sox17\"") {count++} }} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf 
    20
    

    Note: Unexpected field separation using awk

    $ cat -A /vol/COMPEPIWS/pipelines/references/genome_genes.gtf | head -1
    chr1^Iunknown^Iexon^I3214482^I3216968^I.^I-^I.^Igene_id "Xkr4"; gene_name "Xkr4"; p_id "P15391"; transcript_id "NM_001011874"; tss_id "TSS27105";$
    
  7. Count the number of "exon" in chr2

    $ awk '{ if ($1== "chr2" && $3=="exon") {count++}} END { print count }' /vol/COMPEPIWS/pipelines/references/genome_genes.gtf
    29373
    
  8. Count the frequency of each feature in column 3 (hint: uniq)

    $ cut -f3 /vol/COMPEPIWS/pipelines/references/genome_genes.gtf | sort | uniq -c 
     286783 CDS
     324748 exon
      29978 start_codon
      29953 stop_codon
    
  9. Sort the file according to column1 and then column3. redirect the output into a new file.

    $ sort -k1,1 -k3,3 /vol/COMPEPIWS/pipelines/references/genome_genes.gtf > /vol/COMPEPIWS/groups/chipseq2/tasks/genome_genes_sorted_1_then_3.gtf
    

Conda and Bioconda

  1. Load the “base” environment if it is not loaded

    $ source /vol/COMPEPIWS/conda/miniconda3/bin/activate
    
  2. create a new conda environment under your group conda directory called as your group’s name (hint: create with -p option)

    (base) ~$ conda create -p /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2 
    Collecting package metadata (current_repodata.json): done
    Solving environment: done
    
    
    ==> WARNING: A newer version of conda exists. <==
      current version: 4.9.2
      latest version: 4.10.3
    
    Please update conda by running
    
        $ conda update -n base -c defaults conda
    
    
    
    ## Package Plan ##
    
      environment location: /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
    
    
    
    Proceed ([y]/n)? y
    
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    #
    # To activate this environment, use
    #
    #     $ conda activate /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
    #
    # To deactivate an active environment, use
    #
    #     $ conda deactivate
    
  3. activate the created environment

    (base) $ source activate /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
    (env_chipSeq2) $ conda env list
    # conda environments:
    #
    base                     /vol/COMPEPIWS/conda/miniconda3
    core                     /vol/COMPEPIWS/conda/miniconda3/envs/core
    gembs                    /vol/COMPEPIWS/conda/miniconda3/envs/gembs
                          *  /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2
                          *  
    
  4. install the following tools using conda: ggplot2 (r-package), fastqc and the specific version of bedtools (version 2.22). (hint: search for the package here. The first hit is most probably what you need. Click on it and see the exact command for installing the package)

    (env_chipSeq2) $ conda install -c conda-forge r-ggplot2
    (env_chipSeq2) $ conda install -c bioconda fastqc bedtools=2.22
    
    
  5. make sure the tools have been installed properly by: tool_name --help. For “ggplot2” start R in the terminal and load the library with library(ggplot2)

    (env_chipSeq2) $ which R
    /vol/COMPEPIWS/groups/chipseq2/conda/env_chipSeq2/bin/R
    
    (env_chipSeq2) $ R
    
    R version 4.0.5 (2021-03-31) -- "Shake and Throw"
    Copyright (C) 2021 The R Foundation for Statistical Computing
    Platform: x86_64-conda-linux-gnu (64-bit)
    
    R is free software and comes with ABSOLUTELY NO WARRANTY.
    You are welcome to redistribute it under certain conditions.
    Type 'license()' or 'licence()' for distribution details.
    
    R is a collaborative project with many contributors.
    Type 'contributors()' for more information and
    'citation()' on how to cite R or R packages in publications.
    
    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.
    
    > library("ggplot2")
    Keep up to date with changes at https://www.tidyverse.org/blog/
    
  6. update bedtools to the latest version, check the new version number.

    (env_chipSeq2) $ conda update -c bioconda bedtools
    
  7. deactivate the environment

    (env_chipSeq2) $ conda deactivate
    (base) $ conda deactivate
    

Basics in R

In this section we will learn some basic operations in R and GenomicRanges and learn how to visualize data using “ggplot2”. We will use two data sets. Both represent counts matrix in genomic bins of mouse genome.

Data frames

  1. activate the core conda environment and start R session

    source /vol/COMPEPIWS/conda/miniconda3/bin/activate /vol/COMPEPIWS/conda/miniconda3/envs/core
    
  2. Load the following libraries: “ggplot2”, “reshape2” and “GenomicRanges”

    library("reshape2")
    library("GenomicRanges")
    library("ggplot2")
    
  3. Read into R the following two tables: /vol/COMPEPIWS/groups/shared/*txt. Consider the first row as a header.

    path_kidney <- "/Users/zingo/Dropbox/CompEpi/data/groups/shared/kidney_14.5_mouse_1_2kbW_bed_counts.txt"
    path_liver <- "/Users/zingo/Dropbox/CompEpi/data/groups/shared/liver_14.5_mouse_1_2kbW_bed_counts.txt"
    
    df_kidney <- read.table(path_kidney, header=T)
    df_liver  <- read.table(path_liver, header=T)
    
  4. What is the dimension of each table?

    dim(df_kidney)
    [1] 1362728       6
    dim(df_liver)
    [1] 1362728       6
    
  5. What are the column names of each table?

    colnames(df_kidney)
    [1] "Chr"      "Start"    "End"      "H3K27me3" "H3K36me3" "H3K9me3" 
    colnames(df_liver)
    [1] "Chr"      "Start"    "End"      "H3K27me3" "H3K36me3" "H3K9me3"
    
  6. Calculate the genome length from each data set

    kidney_genome_size <- sum(df_kidney$End) - sum(df_kidney$Start)
    kidney_genome_size
    [1] 2725456000
    liver_genome_size <- sum(df_liver$End) - sum(df_liver$Start)
    liver_genome_size
    [1] 2725456000
    
  7. Create a data frame by concatenating vertically the two data sets. Add a new column called (cell_type) to annotate the rows of liver table with "liver" and the kidney’s rows with "kidney"

    df_kidney$cell_type <- "kidney"
    df_liver$cell_type <- "liver"
    df_kidney_liver <- rbind(df_kidney, df_liver)
    
  8. What is the dimension of the new data frame?

    dim(df_kidney_liver)
    [1] 2725456       7
    
  9. Reshape the data frame (excluding chr, start, end) from wide into long format using "cell_type" as variable id. What are the dimensions of this dataframe? (hint: melt function)

    df_marks <- df_kidney_liver %>% 
                select(c("H3K27me3", "H3K36me3", "H3K9me3", "cell_type"))  %>% 
                melt(id.vars = "cell_type", variable.name = "mark", value.name = "enrichment")
                
    dim(df_marks)
    [1] 8176368       3
    
  10. Plot the density of each variable considering the cell_type in two different panels. Give different trans- parent color for each variable (hint: ggplot, geom_density, facet_wrap). Save the plot as a pdf (hint: ggsave).

    p_density <- ggplot(df_marks, aes(x=enrichment, fill=mark)) + 
             geom_density(alpha=0.3) +
             facet_grid(cell_type~mark) +
             labs(x="Enrichment", y="Density")
    
    ggsave(filename="/Users/zingo/Dropbox/CompEpi/density.png", plot=p_density, device="png")
    
    

  11. The plots look very skewed, try to set the limit of X axis to 100 and re-plot. Save the plot as a pdf.

    p_scaled_density <- ggplot(df_marks, aes(x=enrichment, fill=mark)) + 
                    geom_density(alpha=0.3) +
                    facet_grid(cell_type~mark) +
                    labs(x="Enrichment", y="Density") +
                    xlim(0, 100)
    
    ggsave(filename="/Users/zingo/Dropbox/CompEpi/scaled_density.png", plot=p_scaled_density, device="pdf")
    

  12. Create a new data frame by concatenating horizontally the two data sets. Consider only the following columns: H3K27me3, H3K36me3 and H3K9me3.

    df_marks_cell <- cbind(select(df_kidney, c("H3K27me3", "H3K36me3", "H3K9me3")),
                       select(df_liver,  c("H3K27me3", "H3K36me3", "H3K9me3")) )
    
  13. Rename the header of the new data frame into: H3K27me3_liver, H3K36me3_liver, ..........

    colnames(df_marks_cell) <- c("H3K27me3_kidney", "H3K36me3_kidney", "H3K9me3_kidney", "H3K27me3_liver", "H3K36me3_liver", "H3K9me3_liver")
    
  14. Perform Principal Components Analysis on this data frame (hint: prcomp)

    PCA <- prcomp(df_marks_cell, scale=TRUE)
    
  15. How much variance is explained by PC1 and PC2?

    PC1 explains 50.63% and PC2 23.28%

    summary(PCA)
    Importance of components:
      PC1    PC2    PC3     PC4     PC5     PC6
    Standard deviation     1.7429 1.1818 0.9590 0.51405 0.47017 0.40092
    Proportion of Variance 0.5063 0.2328 0.1533 0.04404 0.03684 0.02679
    Cumulative Proportion  0.5063 0.7391 0.8923 0.93637 0.97321 1.00000
    
  16. Plot a screeplot and cumulative screeplot

        var_explained = PCA$sdev^2 / sum(PCA$sdev^2)
        screeplot <- qplot(c(1:length(var_explained)), var_explained) + 
                     geom_line() + 
                     xlab("Principal Component") + 
                     ylab("Variance Explained") +
                     ggtitle("Scree Plot") +
                     ylim(0, 1)
             
        cum_var_explained = cumsum(var_explained)
        cum_screeplot <- qplot(c(1:length(cum_var_explained)), cum_var_explained) + 
                         geom_line() + 
                         xlab("Principal Component") + 
                         ylab("Variance Explained") +
                         ggtitle("Cumulative Scree Plot") +
                         ylim(0, 1)
    

  17. Explore the principal components loadings. How many are there?

    > summary(PCA$x)
      PC1                PC2                PC3                 PC4                PC5           
     Min.   : -3.1453   Min.   :-26.6655   Min.   :-20.22737   Min.   :-92.9514   Min.   :-25.56574  
     1st Qu.: -0.7478   1st Qu.: -0.2213   1st Qu.: -0.45789   1st Qu.: -0.2635   1st Qu.: -0.24920  
     Median :  0.2027   Median :  0.1941   Median : -0.05347   Median : -0.0446   Median :  0.04311  
     Mean   :  0.0000   Mean   :  0.0000   Mean   :  0.00000   Mean   :  0.0000   Mean   :  0.00000  
     3rd Qu.:  0.8873   3rd Qu.:  0.5127   3rd Qu.:  0.45702   3rd Qu.:  0.2857   3rd Qu.:  0.22380  
     Max.   :413.5897   Max.   :149.2764   Max.   :136.07828   Max.   : 16.9057   Max.   : 41.20703  
          PC6            
     Min.   :-12.500283  
     1st Qu.: -0.160926  
     Median : -0.007328  
     Mean   :  0.000000  
     3rd Qu.:  0.149057  
     Max.   :  5.646389
    
  18. Plot the PC1 and PC2 loadings. How are the variables distributed across PC1 and PC2?

    library(PCAtools)
    
    biplot(PCA, showLoadings = TRUE,
           labSize = 5, pointSize = 5, sizeLoadingsNames = 3)
    

GenomicRanges

  1. Create two GenomicRange objects from the liver and kidney data frames using columns 4,5 and 6 as metadata

    BiocManager::install("GenomicRanges")
    
    library("GenomicRanges")
    
    gr_kidney <- GRanges(seqnames = df_kidney$Chr,
                        ranges = IRanges(start=df_kidney$Start, end=df_kidney$End),
                        mcols= select(df_kidney,c("H3K27me3", "H3K36me3", "H3K9me3")))
    
    gr_liver <- GRanges(seqnames = df_liver$Chr,
                        ranges = IRanges(start=df_liver$Start, end=df_liver$End),
                        mcols= select(df_liver,c("H3K27me3", "H3K36me3", "H3K9me3")))
    
  2. Calculate the total number of bases covered in these two objects

    coverage(gr_kidney)
    
  3. Subset one of the objects to have only chr2 lines

    gr_kidney[seqnames(gr_kidney) == "chr2"]
    gr_liver[seqnames(gr_liver) == "chr2"]
    
  4. Shift one of the object’s ranges by 100 upstream

    gr_kidney_shfited <- shift(gr_kidney, +100)
    
  5. Find overlapping regions between the two objects

    findOverlaps(gr_kidney_shfited, gr_liver)
        Hits object with 2725435 hits and 0 metadata columns:
                queryHits subjectHits
                <integer>   <integer>
            [1]         1           1
            [2]         1           2
            [3]         2           2
            [4]         2           3
            [5]         3           3
            ...       ...         ...
      [2725431]   1362726     1362726
      [2725432]   1362726     1362727
      [2725433]   1362727     1362727
      [2725434]   1362727     1362728
      [2725435]   1362728     1362728
      -------
      queryLength: 1362728 / subjectLength: 1362728
    
  6. Make a GRangeList object from the two GRange objects

    GRangesList("Kidney" = gr_kidney, "Liver" = gr_liver)
    

SLUR


ChIP Seq Analysis

Nextflow ChIP-seq pipeline

Organize Data

Organize your raw data it is a good practice to have the raw data placed in one folder:

  1. Enter your working directory /vol/COMPEPIWS/groups//tasks

    cd /vol/COMPEPIWS/groups/chipseq2/tasks
    
  2. List the raw fastq files under /vol/COMPEPIWS/data/reduced/ChIP-seq/

    $ ls  /vol/COMPEPIWS/data/reduced/ChIP-seq/
    kidney_14.5_H3K27ac_1.fastq.gz   kidney_14.5_H3K9me3_2.fastq.gz   kidney_15.5_H3K9ac_1.fastq.gz   liver_14.5_H3K4me1_2.fastq.gz   liver_15.5_H3K36me3_1.fastq.gz
    kidney_14.5_H3K27ac_2.fastq.gz   kidney_14.5_control_1.fastq.gz   kidney_15.5_H3K9ac_2.fastq.gz   liver_14.5_H3K4me3_1.fastq.gz   liver_15.5_H3K36me3_2.fastq.gz
    kidney_14.5_H3K27me3_1.fastq.gz  kidney_14.5_control_2.fastq.gz   kidney_15.5_H3K9me3_1.fastq.gz  liver_14.5_H3K4me3_2.fastq.gz   liver_15.5_H3K4me1_1.fastq.gz
    kidney_14.5_H3K27me3_2.fastq.gz  kidney_15.5_H3K27ac_1.fastq.gz   kidney_15.5_H3K9me3_2.fastq.gz  liver_14.5_H3K9ac_1.fastq.gz    liver_15.5_H3K4me1_2.fastq.gz
    kidney_14.5_H3K36me3_1.fastq.gz  kidney_15.5_H3K27ac_2.fastq.gz   kidney_15.5_control_1.fastq.gz  liver_14.5_H3K9ac_2.fastq.gz    liver_15.5_H3K4me3_1.fastq.gz
    kidney_14.5_H3K36me3_2.fastq.gz  kidney_15.5_H3K27me3_1.fastq.gz  kidney_15.5_control_2.fastq.gz  liver_14.5_H3K9me3_1.fastq.gz   liver_15.5_H3K4me3_2.fastq.gz
    kidney_14.5_H3K4me1_1.fastq.gz   kidney_15.5_H3K27me3_2.fastq.gz  liver_14.5_H3K27ac_1.fastq.gz   liver_14.5_H3K9me3_2.fastq.gz   liver_15.5_H3K9ac_1.fastq.gz
    kidney_14.5_H3K4me1_2.fastq.gz   kidney_15.5_H3K36me3_1.fastq.gz  liver_14.5_H3K27ac_2.fastq.gz   liver_14.5_control_1.fastq.gz   liver_15.5_H3K9ac_2.fastq.gz
    kidney_14.5_H3K4me3_1.fastq.gz   kidney_15.5_H3K36me3_2.fastq.gz  liver_14.5_H3K27me3_1.fastq.gz  liver_14.5_control_2.fastq.gz   liver_15.5_H3K9me3_1.fastq.gz
    kidney_14.5_H3K4me3_2.fastq.gz   kidney_15.5_H3K4me1_1.fastq.gz   liver_14.5_H3K27me3_2.fastq.gz  liver_15.5_H3K27ac_1.fastq.gz   liver_15.5_H3K9me3_2.fastq.gz
    kidney_14.5_H3K9ac_1.fastq.gz    kidney_15.5_H3K4me1_2.fastq.gz   liver_14.5_H3K36me3_1.fastq.gz  liver_15.5_H3K27ac_2.fastq.gz   liver_15.5_control_1.fastq.gz
    kidney_14.5_H3K9ac_2.fastq.gz    kidney_15.5_H3K4me3_1.fastq.gz   liver_14.5_H3K36me3_2.fastq.gz  liver_15.5_H3K27me3_1.fastq.gz  liver_15.5_control_2.fastq.gz
    kidney_14.5_H3K9me3_1.fastq.gz   kidney_15.5_H3K4me3_2.fastq.gz   liver_14.5_H3K4me1_1.fastq.gz   liver_15.5_H3K27me3_2.fastq.gz
    
  3. How many files per cell_type per time point per replicate are there? There are two types cell_type; kidney and liver. Each cell_type has 32 files.

  4. What are the histone marks you are dealing with? H3K27ac, H3K27me3, H3K4me1, H3K4me3, H3K9ac, H3K9me3, H3K36me3

  5. how many "control" per replicate do you have? There are 8 control in total and 4 controls per replicate.

        ls -lah /vol/COMPEPIWS/data/reduced/ChIP-seq/ | grep "control" 
        -rw-r--r-- 1 abdosa abdosa  63M Feb 24  2020 kidney_14.5_control_1.fastq.gz
        -rw-r--r-- 1 abdosa abdosa  40M Feb 24  2020 kidney_14.5_control_2.fastq.gz
        -rw-r--r-- 1 abdosa abdosa 141M Feb 24  2020 kidney_15.5_control_1.fastq.gz
        -rw-r--r-- 1 abdosa abdosa 138M Feb 24  2020 kidney_15.5_control_2.fastq.gz
        -rw-r--r-- 1 abdosa abdosa  32M Feb 24  2020 liver_14.5_control_1.fastq.gz
        -rw-r--r-- 1 abdosa abdosa  34M Feb 24  2020 liver_14.5_control_2.fastq.gz
        -rw-r--r-- 1 abdosa abdosa 139M Feb 24  2020 liver_15.5_control_1.fastq.gz
        -rw-r--r-- 1 abdosa abdosa 222M Feb 24  2020 liver_15.5_control_2.fastq.gz
    
  6. Create a “data” folder in your group working directory and create symbolic links of the relevant files according to your group to the “data” folder

    ln -s files/*.csv data/
    

Generating sample sheet

  1. prepare the samplesheet. You can prepare it manually in any editor of your choice, but it is recom- mended to generate it on the command line taking the advantage of the file names, this is particularly useful when you have many samples. You can loop over the files and extract the information you need to create the samplesheet (hint: basename, cut).
    #change the path to where you files are located. 
    for i in  /vol/COMPEPIWS/data/reduced/ChIP-seq/*;
    do 
        filename=$(basename ${i} .fastq.gz);
        fastq1=$i;
        group=$(basename ${i}|cut -d "_" -f 1-3);
        replicate=$(basename ${fastq1}|cut -d "." -f 1-2|cut -d "_" -f 4 );
        mark=$(echo ${group}|cut -d "_" -f 3);
        control=$(echo $group|cut -d _ -f 1-2|awk '{print $0"_Input"}' ); 
        if [ $mark == "control" ] ;
        then 
            echo "$control,$replicate,$fastq1,,,"; 
        else  
            echo "$group,$replicate,$fastq1,,$mark,$control"; 
        fi ; 
        done 
        |awk 'BEGIN { print "group,replicate,fastq_1,fastq_2,antibody,control" } {print}'  > sampleheet.csv 
    

Running the pipeline

Based on the contents of config file:

params {
    max_memory = 123.GB
    max_cpus = 12
    max_time = 24.h

    // Genome references
    fasta = '/vol/COMPEPIWS/pipelines/references/splitted/mm10_reduced.fa'
    fasta_index = '/vol/COMPEPIWS/pipelines/references/splitted/mm10_reduced.fa.fai'
    gtf = '/vol/COMPEPIWS/pipelines/references/genome_genes.gtf'
    blacklist = '/vol/COMPEPIWS/pipelines/references/mm10-blacklist-reduced.bed'
    macs_gsize = '152134205'
    bwa_index = '/vol/COMPEPIWS/pipelines/references/splitted/atacseq/BWAIndex/mm10_reduced.fa'

    single_end = 'True'
    narrow_peak = 'True'
    // skip some steps
    skip_diff_analysis = 'True'
    save_macs_pileup = 'False'
    skip_preseq = 'True'
    skip_plot_profile = 'True'
    skip_plot_fingerprint = 'True'
    skip_spp = 'True'
}


process{

    withLabel:process_low {
      cpus = 1
      memory = 2.GB
      time = 2.h
    }
    withLabel:process_medium {
      cpus = 2
      memory = 4.GB
      time = 4.h
    }
    withLabel:process_high {
      cpus = 4
      memory = 8.GB
      time = 8.h
    }

    errorStrategy = 'ignore'
  1. Is the data paired end or single end? The data is single-end

  2. Which steps of the pipelines are skipped?

    • Skip Differential Analysis
    • Skip Preseq
    • Skip deepTools plotProfile
    • Skip deepTools plotFingerprint
    • Skip Phantompeakqualtools
  3. Why is the blacklist file used? What do these regions represent? Blacklist file is used in order to ensure that reads that align to specific genomic regions will be filtered out. These regions are often found at specific types of repeats such as centromeres, telomeres and satellite repeats. The ENCODE blacklists regions typically appear uniquely mappable so simple mappability filters do not remove them. These blacklists are applicable to functional genomic data like ChIP-seq based on short-read sequencing (20-100bp reads).

  4. Which tool will be used for mapping? Burrows-Wheeler Aligner - BWA

  5. List in order the main steps of the pipeline for processing ChIP-seq data? The successive steps of the PIpeline are listed below:

    • Raw read QC (FastQC)
    • Adapter trimming (Trim Galore)
    • Alignment (BWA)
    • Merge alignments from multiple libraries of the same sample (picard)
    • Create IGV session file containing bigWig tracks, peaks and differential sites for data visualisation (IGV)
    • Present QC for raw read, alignment, peak-calling and differential binding results (MultiQC, R)

Quality Control

Exploring the result folder

  1. Have a look into the pipeline_info/execution_report.html: • How long did the pipeline take to finish? 7h 43m 20s

    • Which step took the most CPU/Memory usage? The most CPU usage has been noted in case og BWA_MEM. Highest percentage of resources being alloted to Consensus_Peaks

    Highest memory usage has been noted in case of MERGED_BAM

  2. Where are the aligned reads located (*bam files)? You will use these files later /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/

    find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run -name '*bam'
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/liver_15.5_H3K9me3_R2.mLb.clN.sorted.bam
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_15.5_H3K4me3_R2.mLb.clN.sorted.bam
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/liver_15.5_H3K9ac_R1.mLb.clN.sorted.bam
    ...
    
  3. Where are the peaks files located (*narrowPeak)? which tool was used for peak calling? /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results//bwa/mergedLibrary/macs/narrowPeak

    $ find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/ -name '*narrowPeak'
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/multiqc/narrowPeak
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/liver_14.5_H3K9ac_R2_peaks.narrowPeak
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/kidney_14.5_H3K4me1_R2_peaks.narrowPeak
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/kidney_15.5_H3K36me3_R2_peaks.narrowPeak
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/kidney_15.5_H3K27ac_R2_peaks.narrowPeak
    ...
    
  4. Where are the consensus peaks located (*consensus_peaks.bed)?

    For each histon marker under the following directory: /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/

    $ find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/ -name '*consensus_peaks.bed'
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K4me1/H3K4me1.consensus_peaks.bed
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K27me3/H3K27me3.consensus_peaks.bed
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K27ac/H3K27ac.consensus_peaks.bed
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K36me3/H3K36me3.consensus_peaks.bed
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K9ac/H3K9ac.consensus_peaks.bed
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/macs/narrowPeak/consensus/H3K9me3/H3K9me3.consensus_peaks.bed
    ...
    
  5. Where are the signal tracks located (*bigWig)? You will use these files later /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/

    $ find /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/ -name '*bigWig'
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_14.5_H3K4me1_R2.bigWig
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_15.5_H3K27me3_R1.bigWig
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_14.5_H3K27ac_R2.bigWig
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_14.5_H3K9ac_R1.bigWig
    /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/liver_14.5_Input_R1.bigWig
    

MultiQC

. Explore the different section of the MultiQC HTML report generated by the pipeline. You find it under results/multiqc/narrowPeak folder.

  1. In the “LIB: FASTQC (raw)” section: • How many reads do you have for each sample?

• How good is the quality of the bases in the reads?

• How good is the quality of the reads?

• What are the read lengths for the samples?

• Is there a sign for adapter contamination?

No samples found with any adapter contamination > 0.1%

• compare the number of reads after trimming? do you loose many reads after trimming?

  1. In the “MERGED LIB: SAMTools (unfiltered)”: • How is the mapping efficiency? From the graph we can see that SAMTools gives us a pretty good mapping efficiency since most of the reads have been mapped.

• What is the duplication rate? From the plot below, we notice that the percentage of duplicate reads is much less as compared to the unique reads. So we can say the duplication rate is much low.

  1. In the “MERGED LIB: MACS2 PEAK COUNT” section: • Do you get the similar number of peaks for all histone marks?

From the graph we notice

• Overall: which marks get the highest/lowest number of peaks? 5. In the “MERGED LIB: MACS2 FRIP SCORE” section: • What does FRiP score mean? • which marks get the highest/lowest FRiP scores? How is it related to the mark signal distribution (narrow/broad)?

deepTools2

deepTools has different quality control tools that can be used to evaluate the ChIP-seq data quality. We will use some of them (read about the tools from the Documentation).

  1. Activate the core environment.

  2. Explore the batch effect between your samples (hint: multiBigwigSummary, plotPCA, plotCorrelation, recognize the different parameters of each tool)? • PCA plot might not be very informative due to the large number of samples, try to plot it in R with ggplot. There is a ready file to be read in R if you provide “--outFileNameData” option to “plotPCA” function.

    # install.packages("tidyverse")
    # install.packages("reshape2")
    
    library("tidyverse")
    library("reshape2")
    path_2_data <- "/Users/zingo/Dropbox/CompEpi/Pipline/deeptools/PCAoutput"
    
    x <- read.table(path_2_data, header=T)
    xm <- melt(x[1:2,-c(66)], id.vars="Component")
    xmd <- dcast(xm, variable ~ Component)
    
    xmd$cell_type[grepl("liver",xmd$variable)] <- "liver"
    xmd$cell_type[grepl("kidney",xmd$variable)] <- "kidney"
    xmd$mark[grepl("H3K27ac",xmd$variable)] <- "H3K27ac"
    xmd$mark[grepl("H3K27me3",xmd$variable)] <- "H3K27me3"
    xmd$mark[grepl("H3K9me3",xmd$variable)] <- "H3K9me3"
    xmd$mark[grepl("H3K36me3",xmd$variable)] <- "H3K36me3"
    xmd$mark[grepl("H3K4me3",xmd$variable)] <- "H3K4me3"
    xmd$mark[grepl("H3K4me1",xmd$variable)] <- "H3K4me1"
    xmd$mark[grepl("H3K9ac",xmd$variable)] <- "H3K9ac"
    xmd$mark[grepl("Input",xmd$variable)] <- "Input"
    
    
    colnames(xmd) <- c("variable","PC1","PC2", "cell_type", "mark")
    
    ggplot(xmd, aes(x=PC1, y=PC2, color=mark)) + 
      geom_point(aes(shape=cell_type), size=3) + 
      theme(legend.position="top") + xlab("PC1:21.5%") + ylab("PC2:15.6%")
    
    ggsave("PCA_R.png")
    

    java -Xmx3000M -Djava.awt.headless=true -jar
    /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar BinarizeBam
    -b \ /vol/COMPEPIWS/softwares/ChromHMM/CHROMSIZES/mm10.txt
    inputs
    cellmarkfiletable.txt
    output_folder_name

    (core) $ multiBigwigSummary bins -b /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/*bigWig  -o multiBigwigSummaryResults.npz
    (core) $ plotCorrelation --corData multiBigwigSummaryResults.npz --corMethod spearman --whatToPlot heatmap --plotFile plotCorrelationSpearman
    (core) $ plotCorrelation --corData multiBigwigSummaryResults.npz --corMethod pearson --whatToPlot heatmap --plotFile plotCorrelationPearson
    (core) $ plotPCA  --corData multiBigwigSummaryResults.npz --plotFile PCA --outFileNameData PCAoutput
    
  3. Based on the plotCorrelation outputs, which marks are more similar to each other. In liver & kidney cells: H3K9ac ~ H3K4me3 In liver & kidney cells: H3K27ac smilar to H3K4me1 H3K36me3 kidney smilar to H3K36me3 liver H3K27me3 kidney is similar H3K27me3 liver

  4. How well are the signals (marks) separated from the background. Since you have many samples it would be wise to do each histone mark of all samples separately. i.e, one plot for all H3K4me3, another for H3K4me1 ........ etc (hint: plotFingerprint)

check the failed smaples and linked to here
![](https://i.imgur.com/4zAECUi.png)
![](https://i.imgur.com/pwuc46d.png)
![](https://i.imgur.com/iObY2hm.png)



``` {bash}
      /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_14.5_H3K27ac_R1.mLb.clN.sorted.bam
      /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_14.5_H3K27ac_R1.mLb.clN.sorted.bam
   /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/kidney_14.5_Input_R1.mLb.clN.sorted.bam

declare -a arr=("H3K27ac" "H3K27me3" "H3K4me1" "H3K4me3" "H3K9ac" "H3K9me3" "H3K36me3" "Input")

for mark in "${arr[@]}"
do
    bams=($(ls /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/*${mark}*bam)) 
    names=($(for i in ${bams[@]}; do echo $(basename $i)|cut -f 1-2 -d . ; done))

    echo -e "sbatch -J fp_${mark} -o output_${mark}.out --wrap=\"plotFingerprint -b ${bams[@]} -plot ${mark}_fingerprint.png --labels ${names[@]} --plotTitle ${mark}\ Fingerprint\""

done
```
  1. Choose one of the replicates of one stage of one cell type and plot profiles of the 7 histone marks around TSS (+-1500bp) and across the gene body (+-1500bp). Use the gene file produced by your job under results/genome/genome_genes.bed (hint: plotProfile, plotHeatmap). Which marks are enriched at TSS and which are enriched at the gene body? H3k4me3 is enriched at TTS H3K36me3 is enriched at gene body
    $ computeMatrix scale-regions \
    -S /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/bwa/mergedLibrary/bigwig/kidney_15.5_H*_R1.bigWig \
    -R /vol/COMPEPIWS/groups/chipseq2/tasks/nextflow_run/results/genome/genome_genes.bed \
    -b 1500 -a 1500
    -o Kidney_15.5_R1.mat.gz
    
    $ plotProfile -m Kidney_15.5_R1.mat.gz \
              -out Kidney_15.5_R1_Profile_heatmap.png \
              -plotType heatmap
              --numPlotsPerRow 2
              --plotTitle "Kidney_15.5_R1 Profile Plot"
    
    

Exploratory Analysis using IGV

Steps 1 to 4 done: IGV installed on local machine. Opened IGV session from results/igv/narrowPeak/igv_session.xml Loaded the gene tracks from results/genome/genome_genes.bed

After removing the consensus tracks:

![](https://i.imgur.com/cipak9y.png)

Uploading the gene tracks from results/genome/genome_genes.bed:

![](https://i.imgur.com/c6jIUgd.png)
  1. Do you see clear peaks for all marks? which marks are sharp and which ones are broad(er)?

    From the picture it is clear that we get clear peaks for the marks.

    If we consider this gene on chromosome 18 we notice that the we get the sharp peaks around the Transciption start sites are enriched for the marks H3K9Ac, H3K27Ac and H3K4me3 both in the kidney and liver cells. Enrichment at the TSS look like sharp peaks. Broader marks are noticed where the gene body has been enriched.

  2. Realize the local distribution of the different histone marks. Which one(s) overlap with TSS and which of them ovelap with gene body?

    Analysisng Chromosome 19, gene_name "E030024N20Rik"; transcript_id "NR_033228" Histone marks that overlap with TSS: H3K9Ac, H3K27Ac and H3K4me3 Histone marks tha overlap with the gene body: H3K4me1, H3K36me3,H3K9me3, H3K27 me3

  3. Do you always see an agreement between the called peaks of the two replicates of the same mark in the same cell type? To get a quantitative answer have a look to the Upset plots under results/bwa/mergedLibrary/macs/narrowPeak/consensus//*.boolean.intersect.plot.pdf. What do you understand from this plot (choose two of the sharp marks)

    We have analysed the intersect plots we observe the following for the respective Histone marks:

    H3K4me1: We do not always see an agreement between peaks called between the to replicates of the same cell type. Considering Kidney 15.5 cells replicate 1 and replicate 2: We notice that for most of the cases there is an agreement between the peaks called.

    The peaks in liver liver_14.5 cells repliacte 1 and 2 are not always in agreement with each other.

    H3K4me3: Kidney 14.5 cells replicate 1 and replicate 2: The peaks are not always at agreement with one another. LIver 15.5 cells replicate 1 and replicate 2 : The peaks are not always in agreement.

Chromatin segmentation with ChromHMM

ChromHMM Installation

Running ChromHMM


Segmentation folder created Create a folder called “inputs” and create symbolic links of all aligned read files to this folder (there should be 64 bam files)!

To run BinarizeBam you need to

  1. Prepare a tab delimited file “cellmarkfiletable.txt”. Prepare this file in such away that the replicates of one stage per cell type are pooled together. i.e at the end we want to have one segmentation result per cell type per stage: liver_14.5/15.5 and kidney_14.5/15.5. Use the control samples in “cellmarkfiletable.txt” file to adjust the binarization threshold (hint: loop over the bam files, extract the cell type and histone mark names from the file name).

    declare -a cells=("kidney_14.5" "kidney_15.5" "liver_14.5" "liver_15.5")
    declare -a marks=("H3K27ac" "H3K27me3" "H3K4me1" "H3K4me3" "H3K9ac" "H3K9me3" "H3K36me3")
    declare -a replicates=("R1" "R2")
    
    path=""
    
    for sample in "${cells[@]}"
    do
        for replicate in "${replicates[@]}"
        do
            for mark in "${marks[@]}"
                do
                printf "${sample}\t${mark}\t${path}${sample}_${mark}_${replicate}.mLb.clN.sorted.bam\t${path}${sample}_Input_${replicate}.mLb.clN.sorted.bam\n"	
            done
        done
    done
    
  2. Run the BinarizeBam command with 200bp as a bin size (be careful in selecting the right chromosom- Lengthfile) [Make sure you run this task on one of the worker nodes]

    java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar  BinarizeBam  -b 200   /vol/COMPEPIWS/softwares/ChromHMM/CHROMSIZES/mm10.txt inputs/ cellmarkfiletable.txt Chip_chrom_segmentation
    
  3. Run the LearnModel command with different number of states:5-15(submitthe commandsas jobs(sbatch) to SLURM, refer to the generalTask 4.4): [Make sure you run this task on one of the master node]

    states=($(seq 5 1 15 ))
    
    for state in ${states[@]}
    do
        sbatch --job-name=LM_state${state} -o job_$state.out --wrap="java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar LearnModel /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/binarizedBAMs /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/learnModels_CellTypeTimePoint/"${state}"  ${state} mm10"
    done
    
    
  4. Explore the output files called webpage_$state.html. Especially, emission heatmap, Ref- Seq_TSS_neighborhood and overlap figures. Comment on the “new” states you get.

    Emission parameters for the model with 5 states:

    State 1: H3K27me3 -> Bivalent Promoter State 2: no histone marks are enriched -> Heterochromatin State 3: H3k36me3 ->Strong Transcription State 4: H3K27ac, H3K4me1 -> Strong, TSS-dist enhancer State 5: H3k4me3, H3k9ac,H3k27ac,H3K4me1 -> Active Promoter

    Emission parameters for the model with 6 states:

    State 1: H3K27me3 -> Bivalent Promoter State 2: no histone marks are enriched -> no signal State 3: H3k9me3 -> Heterochromatin State 4: H3k36me3 -> Strong Transcription State 5: H3k27ac, H3k4me1 -> Strong TSS-dist enhancer State 6: H3k4me3, H3k9ac,H3k27ac,H3K4me1 -> Active Promoter

  5. Compare the different models to assess what is the proper minimum number of states.Use the model with the highest number of states as a reference (hint: CompareModels function). What is the minimum reasonable number of states you think is appropriate?

     java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar CompareModels emissions/emissions_15.txt emissions/ results_
    

    A look at the graph suggests that the model with 5 and 6 states fails completely to enrich the 15th state abd the models with 5,6,7,8,9 states cannot enrich the state 4 So this makes it clear that a minimum of 10 state model is required to mark all the states.

  6. The “cellmarkfiletable” in step 5 was prepared to produce 4 segmentation files (1 per stage per cell type). Repeat step 5 in a way that you merge the stages of each cell type in order to produce only 2 segmentation files (1 for each cell type). https://i.imgur.com/icr4PAT.png

    declare -a cells=("kidney" "liver")
    declare -a stages=("14.5" "15.5")
    declare -a marks=("H3K27ac" "H3K27me3" "H3K4me1" "H3K4me3" "H3K9ac" "H3K9me3" "H3K36me3")
    declare -a replicates=("R1" "R2")
    
    for sample in "${cells[@]}"
    do
            for stage in "${stages[@]}"
            do
                    for replicate in "${replicates[@]}"
                    do
                            for mark in "${marks[@]}"
                            do
                                    printf "${sample}\t${mark}\t${sample}_${stage}_${mark}_${replicate}.mLb.clN.sorted.bam\t${sample}_${stage}_Input_${replicate}.mLb.clN.sorted.bam\n"
                            done
                    done
            done
    
    done
    
    java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar  BinarizeBam  -b 200   /vol/COMPEPIWS/softwares/ChromHMM/CHROMSIZES/mm10.txt inputs/ binarizedBAMs_CellType/cellmarkfiletable.txt binarizedBAMs_CellType
    
  7. Repeat steps 6 with the new “cellmarkfiletable”. Do it only with 15 states model

    state=15
    sbatch --job-name=LM_state${state} -o job_$state.out --wrap="java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar LearnModel /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/binarizedBAMs_CellType /vol/COMPEPIWS/groups/chipseq2/tasks/segmentation/learnModels_CellType/"${state}"  ${state} mm10"
    
  8. Based on the emission heatmap, overlap and enrichment of this new 15-state model, try to assign biological labels to the states. Use figure 2 as reference and use the shortcuts from table 1. In case you couldn’t find the proper annotation from figure 2 try to be creative and give a label by yourself.

    Emission Heatmap

    Overlap-Kidney cells:

    Overlap-Liver cells:

    From an analysis of the Fold enrichment plots of the kidney and liver cells it is clear that the On analysis of the three graphs we can assign the following biological states to the 15 states:

    Promoter -> Enhancer -> Transcription -> Heterochromatin 1 -> Tx_S -> Strong Transcription 2 -> Tx_W -> Weak Transcription 3 -> Tx_I -> Transcription Initiation 4 -> Enh_W -> Weak Enahncer 5 -> Enh_A -> Active Enhancer 6 -> Enh_S -> Strong TSS Prox Enhancer 7 -> Pr_A -> Active promoter 8 -> Pr_W -> Weak promoter 9 -> Pr_F -> Flanking Promoter 10-> Pr_B -> Bivalent Promoter(less enriched than state 7 for the regions show in figure 1) 11-> Enh_P -> Poised Enhancer 12-> Het_P -> Polycomb_associated heterochromatin 13-> NS -> No signal 14-> Het_W -> Weak Heterochromatin 15-> Mix

  9. Rename the states in column 4 of the files *dense.bed (or *segments.bed) using the biological states from the previous question (consider the command MakeBrowserFiles from ChromHMM or awk in bash to do the task). Note: Redirect the output into new files in a subfolder called “relabeled” to avoid overwriting the original files.

    java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar MakeBrowserFiles -m labels.txt \
                     -n 15 \
                     kidney_15_dense.bed \
                     kidney_15 \
                     results/kidney_15_dense_
    
  10. Run "reorder" command to generate the emission heatmap with the new labels.

     java -Xmx3000M -Djava.awt.headless=true -jar  /vol/COMPEPIWS/softwares/ChromHMM/ChromHMM.jar Reorder -m ../relabeled/labels.txt ../model_15.txt ./
    
  11. Copied the relabeled segmentation files of Q12 and the emission heatmap of Q13 into the shared folder under your group sub-directory /vol/COMPEPIWS/groups/shared/

Exploration of chromatin state segmentation using IGV

  1. Which state(s) is(are) more likely overlapping with TSS? Gene: Rbbp8

    Enhancer and Promoter regions are likely to overlap with TSS

  2. Upload the segmentation of the other cell type (one replicate). Can you pin point some differential states between the two cell types? save some snapshots.

    It can be clearly seen from the plot that the two differential states are : Enh_W in liver cells and Enh_S in kidney cells.

Exploratory Analysis using R

In this section we will explore the segmentation results of 15-state model (the relabeled) by looking to their length distributions, genome coverage percentages. We will do this in R and ggplot package for plotting. This task will teach how to visualize multiple layers of information in one plot. You will try to summarize all what you need in a data frame for a later easy plotting. The final data frame should look like: chromosome start end state sample length chr18 12334 13034 2_Enh_A liver 700

The steps 1-9 will help you to prepare such a table. However, feel free to do it in your own way.

  1. Activate the “core” environment.

    source /vol/COMPEPIWS/conda/miniconda3/bin/activate /vol/COMPEPIWS/conda/miniconda3/envs/core
    
  2. Start R

  3. Read all *(reordered)dense.bed files from task 4.2.12 from “segmentation” directory results into R. (hint: use list.files, lapply, read.table functions)

    
    dir <- "/Users/zingo/Dropbox/CompEpi/Pipline/CHMM/segmentation"
    regex <- "*dense.bed$"
    filenames <- list.files(path=dir,
                            pattern=regex,
                            full.names=TRUE)
    
    filenames
    
    BEDs <- lapply(filenames,
                   read.table,
                   header=FALSE,
                   sep="\t",
                   stringsAsFactors=FALSE,
                   skip=1)
    
    BEDs[[1]]$sample <- "kidney"
    BEDs[[2]]$sample <- "liver"
    
  4. Merge the results into one data frame (hint: rbind function).

    df <- rbind(BEDs[[1]], BEDs[[2]])
    
  5. Extract the sample names, they should be two unique names (hint: strsplit).

  6. Make a vector of names with length equal to the data frame length and consider the different name numbers for each sample (hint: for, rep)

  7. Add a sample name column to the data frame.

  8. Calculate the segment length and add it as a new column to the data frame.

    segments_df <- select(df, V1, V2, V3, V4, sample)
    colnames(segments_df) <- c("chromosome", "start", "end", "state", "sample")
    segments_df$length <- segments_df$end - segments_df$start
    
  9. Calculate the segment percentage per sample (hint: melt, aggregate, dcast functions).

    library(dplyr)
    df_m  <- melt(segments_df, id.vars=c("state", "sample","length")) %>%
             group_by(state, sample) %>% 
             summarise(lengths = sum(length))
    
    sum_geome_kidney <- sum(df_m$lengths[df_m$sample=="kidney"])
    sum_geome_liver  <- sum(df_m$lengths[df_m$sample=="liver"])
    
    df_mm <- df_m %>% mutate(percentage=case_when(sample=="kidney" ~ lengths / sum_geome_kidney * 100,
                                                  sample == "liver" ~ lengths / sum_geome_liver * 100,
                                                  T ~ 0))
    
  10. Plot and save the length distributions per sample (hint: ggplot() + geom_boxplot() + facet_wrap(), ggsave).

    p_length_ditribution <- ggplot(segments_df, aes(y=length, x=reorder(state,length), fill=sample)) + 
                        geom_boxplot() +
                        scale_y_continuous(trans='log2') +
                        xlab("States") +
                        ylab("Segments Lengths") 
    

  11. Plot and save the segment genomic percentage per sample as stacked bar plots (hint: ggplot() + geom_barplot(), ggsave).

    ggplot(df_mm, aes(fill=state, y=percentage, x=sample)) + 
        geom_bar(position="stack", stat="identity")
    

  12. Based on the plots from task 10 and 11, comment on the state distributions/lengths across the cell types and stages. 13_NS state in both samples are the longest segments with around 67% of the genomic regions. 8_Pr_W state in both samples are the shortest segments with around 0.14% of the genomic regions 4_Enh_W in kdiney 5.79% of the genomic regions while in liver 3.62%


Integrative Analysis

Loading the IGV session:

  1. What are the methylation states of the promoters and the gene body in general (high, low)?

The gene body is Highly methylated region and the promoters are the unmethylated regions

  1. Which states from DNA methylation segmentation (MethylSeekR) overlap with chromatin segmentation (ChromHMM).

PMD: Enhancer region UMR: Promoter, Weak TSS HMR: TX_S(Strong TRanscription)

  1. Do you observe overlap between the accessibility peaks and specific chromatin states? Which ones?

From the plot it is clearly visible that there is an overlap between the accessibility peaks and specific chromatin states. THis is because the annotated chromatin state corresponding to the accessible region Interval_362 is a promoter or a Transcription start site.

  1. Can you find some examples for unexpressed genes. What are the methylation states of their promoters and bodies? Which histone marks are enriched at the promoters and bodies of these genes?

Gene cndp1: No gene expression

  1. Do you observe positive or negative associations between DNA methylation, chromatin accessibility, indi- vidual histone marks and gene expression? Do the association differ in certain types of regions (e.g. promot- ers, gene bodies, enhancers)

DNA methylation is negatively correlated to Chromatin accessibilty. From the figure, it is clearly visible that the HMRs(Highly methylated regions) are the closed chromatin.

Interval_643:UMR:Gene expression occurs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment