Skip to content

Instantly share code, notes, and snippets.

@arq5x

arq5x/hw1.md Secret

Last active March 11, 2024 15:17
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save arq5x/c0eb84bce2086fbfbe9184668ef87b31 to your computer and use it in GitHub Desktop.
Save arq5x/c0eb84bce2086fbfbe9184668ef87b31 to your computer and use it in GitHub Desktop.
Applied Computational Genomics Homeworks

Homework #1

Setup

For this homework, we will learn a bit about protein coding genes and transcripts in the human genome by working with a GTF file from Ensembl that describes all of the protein coding and non coding genes that have been annotated for human. Now, we know that most have more than one isoform, and accordingly, this GTF reports information about the exons, introns, and UTRs of each transcript for each gene. But first, a couple of warm-up questions.

Question #1

Create a new directory in your home directory called homework-01. Navigate into that directory. Show your work by providing the command you used.

Question #2

Using a UNIX pipe, write a command that counts how many directories are in the /home directory on malibu. Show your work by providing the command you used.

Question #3

Unix computers have a convenient file located at /usr/share/dict/words that stores all (or at least the majority) of words in the English language. What is the 55,000th word in that file? Show your work by providing the command you used.

Question #4

Using a Unix command, how many visible and imvisible characters are in the 55,000th word? Show your work by providing the command you used.

Collecting the data we will use for the rest of this homework

Let's start by downloading this GTF file from Ensemble using a new UNIX command called curl. curl transfers data from a remote file (i.e., on an FTP or HTTP site) directly to your terminal. In other words, it is the command line way of going to a website and downloading something. This command is super useful because most real world analyses require getting data or annotations from other resources such as the UCSC genome browser.

So, you must start by downloading this file. Note that we use the ">" to redirect the data retrieved by curl into a new file called human.genes.gtf.gz

curl ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz > human.genes.gtf.gz

Now, notice that the filename ends in ".gz". This means that it is compressed with a program called gzip. If you have ever used the compression utilities on your Mac or Windows machine, you have likely seen files that end with ".zip". This is the exact same concept, but ".gz" file were compressed with a slightly different tool. So, before we get started, we have to first "decompress" this file into a plain old text file (in GTF format as we discuss in class) that we can easily work with using the UNIX commands we have learned thus far. We do this as follows using yet another new UNIX command called gzip. The -d option tells gzip to decompress the file.

gzip -d human.genes.gtf.gz

Now you should see that the file is now called "human.genes.gtf" --- gzip automatically removed the ".gz" extension once it finished decompressing the file.

Question #5

How many GTF data lines are in this file? Note that the first few lines in the file beginning with "#" are so-called "header" lines describing thing like the creation date, the genome version (more on that later in the course), etc. Header lines should not be counted as data lines.

Show your work by providing the command you used.

Question #6

How many GTF data lines are in this file for protein coding genes? Note that entries (lines in the file for protein coding genes will contain the following text: gene_biotype "protein_coding"

Use the string above with a command you have learned to find such lines.

Hint: the UNIX pipe may come in handy here...

Show your work by providing the single command you used.

Question #7

How many GTF data lines are in this file for exons from protein coding genes?

Show your work by providing the single command you used.

Question #8

How many CDS exons ("CDS" in column 3) from protein coding genes exist on per each chromosome (column 1 in the GTF file)? That is, the count per chromsome.

Show your work by providing the command or commands you used.

Question #9

Explain how you might design an analysis of this file that would reveal how many distinct protein coding genes there are in the human genome. Hint: you may not have learned all of the command you might need - the point is to think about what what you could do with the commands you know of and what limitations would have to be addressed.

Bonus

Using grep and the Unix words file from above, write a command that returns only five-letter words. Hint: it is okay to do some googling.

Homework #2

Setup

First, we need to download the FASTA file for chr22 from build 38 of the human genome.

curl https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz > chr22.fa.gz
gzip -d chr22.fa.gz

Question #1

How many G or C nucleotides (the GC content) are there on chr22? Show your work by providing the command(s) you used.

Question #2

First read this tutorial on using nano (https://linuxize.com/post/how-to-use-nano-text-editor/). Nano is a text editor for Unix that allows you to create and edit text files. You can skip the installation section, as nano is already installed on malibu. Using nano, you will add all of the commands from question 1 into a new file in nano and save it as gc_content.sh. Once saved, you have created your first shell script. A shell script is a series of one or more Unix commands that can be run as an automated "recipe". After being saved, you can run the shell script by typing bash gc_content.sh on the command line. Create the script, run it, and provide both the command you ran and the output.

Question #3

Recall question #8 from Homework 1. Create a UNIX pipeline using sort and uniq that answers it with a single command. Refer to the slides at the end of the "Pattern searching in the human genome" deck for a brief tutorial on using sort and uniq. There are also many examples of this online. Show your work by providing the single command you used.

Question #4

How many lines in the chr22 file have exactly 15 cytosines? Show your work by providing the command(s) you used.

Question #5

How many lines in the chr22 file have >=15 cytosines? Note: checkout the -n option for the sort command. Show your work by providing the command(s) you used.

Question #6

What fraction of chr22 is unknown sequence (Ns?) Show your work by providing the command(s) you used.

Question #7

Restriction enzymes cleave DNA molecules at or near a specific sequence of bases. For example, the HindIII (Hin D 3) enzyme cuts at the "/" in this motif: 5'A/AGCTT3'. How many perfectly matching HindIII restriction enzyme cut sites are there on chr22? Don't worry about sites that span two lines - just care about sites that are fully contained on a single line.

Question #8

Given the chr22 fasta file we have used for the rest of this homework, devise a conceptual strategy for using Unix commands you know of so far (and possibly a wish list of methods that you are not yet aware of) to describe how you would conduct an in silico digest of chr22 using HindIII. That is, if HindIII cuts at the sites described above, how could you predict exactly the sequences and their lengths that would result from chr22?

Bonus (not req'd but worth extra points)

(1 pt) 1. How many distinct gaps (continuous runs of Ns) are there on chromosome 22? Show your work.

Homework #3

Installing new software!!!

One of the biggest barriers for new genomics researchers is the confusion and complexity associated with installing new pieces of software to run on UNIX platforms. The basic workflow is as follows:

# Download the code for the software with curl or wget, or git.
# Compile the code into an executable (a.k.a. binary) program
# Copy the executable (e.g., seqtk, samtools, bedtools, etc.) into a directory that is in your PATH

To get your feet wet, we will install both seqtk for use in this homework. First, let's create a new directory in your home directory called "src". This is where we will download and compile all of the software we download. Like "bin" for your binaries, "src" is a traditional name for the directory we create for storing custom software installations.

cd ~
mkdir src
cd src

###Installing seqtk

Download the source code from github (this is where most software is hosted now) using the git command.

git clone https://github.com/lh3/seqtk

This creates a new directory called "seqtk". Navigate into it.

cd seqtk

Now we compile the source code for the seqtk binary with the make command, which runs a series of instructions for building the program that are outlined in the file called "Makefile" in the seqtk directory. You may see some warnings. You can ignore them.

make

This should have created a new binary called "seqtk". Let's check.

ls

Now copy this binary to the bin directory in your home directory. To copy files in UNIX we use cp, where the convention is cp FROM TO:

cp seqtk ~/bin/

You should now be able to run seqtk from any directory since the "~/bin" directory is in your PATH.

seqtk

Homework setup

Create a new directory in your home directory with the prosaic name "hw4". Move into that directory. Now we will download real FASTQ files resulting from an Illumina paired-end sequencing run of human genomic DNA. Since this is from a paired-end sequencing run, there are two files --- one for each end of each DNA fragment. As such, the two files are named 1805.1.fq and 1805.2.fq.

curl https://home.chpc.utah.edu/~u1007787/1805.1.fq > 1805.1.fq
curl https://home.chpc.utah.edu/~u1007787/1805.2.fq > 1805.2.fq

Conveniently, the sequence for each end of each fragment is consistently ordered in each file. For example, let's look at the first 12 lines (3 sequences as each sequence record occupies 4 lines) of each file:

@C19G9ACXX:3:1101:1147:71334/1
TGGCTCGAAGCGGGCACTGGCGATCTTGGCCACGGACAGCTGCTCATGGT
+
=BCDGE;DFBG=FEGDEFDEE>EDFGFBFFDAA;EEB;<A?AABCCB??@
@C19G9ACXX:3:1101:1232:59804/1
TTTTAATTTTTCAATGATTTCATCAATAATATTAGCAATAGCTATTTTCA
+
ABCCDFEEEEEFEFEGEEFFGEEGECDCEECDFBFGEEDDGGGDEFFEEC
@C19G9ACXX:3:1101:1236:78358/1
CTTCCTCTTCTTCACCTCCCTGCTGCAGCACTTCAGCTTCTCCGTGGCCG
+
ADCCFGEGEGGDGDFFFGFFGEFGGG=EFEEFFFCBFGFGGGE?B@BED=
@C19G9ACXX:3:1101:1147:71334/2
GTCAGCACAACCTGGACATCGAGTGTCCCATGTACACCAACCTCAGTCGT
+
>CBB@EEDDFCFGDBEDDDG>CGAEBFEEEDF<DEEEEEGFFGGCFBE=?
@C19G9ACXX:3:1101:1232:59804/2
ATTACAGAAAATGATATACAAATTGCATTAGATGATGCCAAAATCAACTT
+
@BCBCDDEFFDDEDDDDCEEFFEFGGEEFCGEDFEDGGFEFFGEGEFEFD
@C19G9ACXX:3:1101:1236:78358/2
GATCCCAACGAGGGCGTGAGCAGGGGACCCGAGTTGGAACTACCACATTG
+
@CBCEEDFE:CEFFB>B7CDFCGBFFEFFF>ED@DCGEEEFDEDDDCBA<

Notice that aside from the /1 and /2 the sequence IDs for each record are identical in each file, indicating that they came from the two 5' ends of the same clonally amplified clusters on the Illumina flowcell. This makes life easier for us --- remember, sorting is good!

Now, let's get to the fun part. Please note that many of these questions will need to be answered with a mixture of standard UNIX commands you have learned thus far, but also with the seqtk toolkit that we just installed. Please consult their documentation websites and help menus (e.g., just type seqtk then Enter) for ideas of how to answer the questions. Note that seqtk has many subcommands (e.g., seqtk comp) and one can get further information about what the subcommands do and what their output means by typing the subcommand followed by -h (e.g., seqtk comp -h).

Question #1

First, a quick question about DNA sequencing technologies. If you wanted to find all single-nucleotide polymorphism in a baby's genome with the fewest errors, what modern DNA sequencing technology would you choose? Why?

Question #2

Is the length of every sequence in the FASTQ files the same? Show your work.

Question #3

How many nucleotides were sequenced in total for this experiment? Show your work.

Question #4

What is the overall GC content of the FASTQ files? Show your work.

Question #5

How does the average Phred quality score for the first read (sequence) position compare to the average Phred quality score for the last position? Why is this? Show your work.

Question #6

Offline, I aligned these fastq files to build 37 human genome using bwa mem. The result is a SAM that you can download with the following command.

curl https://home.chpc.utah.edu/~u1007787/1805.sam > 1805.sam

Convert the SAM file to a BAM file named 1805.bam with samtools. Then use samtools to sort the BAM file by chromosome and name the sorted file 1805.sorted.bam. Then index the sorted BAM file with samtools. Show your work.

Question #7

What is the most common mapping quality (MAPQ) for the alignments in the BAM file? How many alignments have that mapping quality? What does that mapping quality reflect in terms of the estimated probability that the mapping location is wrong?

Question #8

The samtools mpileup command reports the depth of coverage and the alleles observed at each position in the genome. Using this command, figure out which column of the output represents the depth of sequencing coverage and use the UNIX sort command to report which two positions in the genome have the highest depth of sequencing coverage in this BAM file. You will want to read up on how to use the UNIX sort command to sort data by a specific column (in this case, the depth column of the output of samtools mpileup): https://www.geeksforgeeks.org/sort-command-linuxunix-examples/#:~:text=%2Dk%20Option%3A%20Unix%20provides%20the,sort%20on%20the%20second%20column.

Question #9

Use the scp command to transfer the sorted BAM file and index to your OSX or Windows Desktop.

Load this BAM file into IGV using the File>Load from URL option.

Using IGV only, what is the total aligned sequencing depth on chromosome 2, position 21,236,251? How many G alleles and T alleles were observed?

Question #10

Given the alleles observed on chromosome 2, position 21,236,251, do you think this is evidence of a genetic variant in this individual? Why or why not?

Overview

The goal of this homework is to hone your skills working with both UNIX and R for data processing, and analysis. Submit your answers as an HTML output of an RMarkDown file named unid.hw3.html. You can learn about using RMarkdown for creating "computable documents" in this series

UNIX

1 (1 pt). Using curl or wget, download the RNA-seq cDNA counts file we used on the Feb 9 to your HOME directory on malibu. Rename it to genecounts.tsv. Show your work. https://raw.githubusercontent.com/quinlan-lab/sllobs-biostats/master/data/lecture-03/airway_scaledcounts.subset.euro.fixed.tsv

2 (2 pts). This dataset includes genes where at least one of the treated or control replicates has 0 read counts or the data are "NA" (not available). One can make an argument that such genes should be excluded from analyss because they are likely problematic. Use grep to exclude these lines (be careful that you are excluding what you want to exclude) and save the results as genecounts.clean.tsv. Show your work.

Interlude (not graded). Transfer genecounts.clean.tsv from malibu to your computer (you choose the directory) with scp for Mac users and Putty SCP for Windows users. For example, the command below will copy the genecounts.clean.tsv from your home directory on malibu to your Desktop on an OSX machine. You need to replace UNID with your UNID and you will be prompted for your password.

scp UNID@malibu.genetics.utah.edu:~/genecounts.clean.tsv ~/Desktop/

RStudio

3 (0.5 pt). In Rstudio, create a new R markdown document called rnaseq-analysis.R. Establish your working directory to be the same directory to which you scped the file above. Show your work.

4 (1.0 pt). Load genecounts.clean.tsv into a data frame. Remember that you have a header line and that the decimal indicator is ,. Show your work.

5 (0.5 pt). How many genes are in the data frame? Show your work.

6 (2 pts). Use R functions and named data frame columns to find the gene with the max cDNA count in the control_2 condition. Hint, you might need to google how to find rows in a data frame that match a condition. This is also a nice resource. Show your work.

7 (1 pt). What are the total cDNA counts for each condition? Show your work.

8 (1 pt). Make a scatter plot of the mean expression for the control condition (x) and the mean expression for the treated condition (y). Use abline to plot y=x as a comparison. Show your work.

9 (1 pt). Comment on how (or not) the total cDNA counts for each condition relate to the pattern you see in the scatter plot from #7?

10 (2 pts). Plot the mean cDNA counts for all four conditions (x-axis) versus the log2 ratio of the treated / control (y-axis). Please describe what pattern do you observe and comment on why you think you are seeing this pattern? You may be able to see the pattern more clearly if you restrict the x-axis range to genes with a mean cDNA count of <10000. Hint for this problem: you will need to read up on and use the rowMeans function. Show your work.

Homework #4

Installing new software!!!

One of the biggest barriers for new genomics researchers is the confusion and complexity associated with installing new pieces of software to run on UNIX platforms. The basic workflow is as follows:

# Download the code for the software with curl or wget, or git.
# Compile the code into an executable (a.k.a. binary) program
# Copy the executable (e.g., samtools, bedtools, etc.) into a directory that is in your PATH

To get your feet wet, we will install both seqtk and bioawk for use in this homework. First let's create a new directory in your home directory called "src". This is where we will download and compile all of the software we download. Like "bin" for your binaries, "src" is a traditional name for the directory we create for storing custom software installations.

cd ~
mkdir src
cd src

###Installing seqtk

Download the source code from github (this is where most software is hosted now) using the git command.

git clone https://github.com/lh3/seqtk

This creates a new directory called "seqtk". Navigate into it.

cd seqtk

Now we compile the source code for the seqtk binary with the make command, which runs a series of instructions for building the program that are outlined in the file called "Makefile" in the seqtk directory. You may see some warnings. You can ignore them.

make

This should have created a new binary called "seqtk". Let's check.

ls

Now copy this binary to the bin directory in your home directory. To copy files in UNIX we use cp, where the convention is cp FROM TO:

cp seqtk ~/bin/

You should now be able to run seqtk from any directory since the "~/bin" directory is in your PATH.

seqtk

###Installing bioawk

The pattern is the exact same - we are just getting source code from a different "repository" on Github.

cd ~/src
git clone https://github.com/lh3/bioawk
cd bioawk
make
cp bioawk ~/bin/

Homework setup

There is a new directory in your home directory with the prosaic name "hw3". Therein lies real live FASTQ files resulting from an Illumina paired end sequencing run of genomic DNA from C. elegans (details). Since this is from a paired-end sequencing run, there are two files --- one for each end of each DNA fragment. As such, the two files are named SRR3750603_1.fastq and SRR3750603_2.fastq.

# for reproducibility:
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR375/003/SRR3750603/SRR3750603_1.fastq.gz | zcat | head -n 1000000 | gzip > SRR3750603_1.fastq.gz
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR375/003/SRR3750603/SRR3750603_2.fastq.gz | zcat | head -n 1000000 | gzip > SRR3750603_2.fastq.gz
gunzip SRR3750603_1.fastq.gz &
gunzip SRR3750603_2.fastq.gz &

Conveniently, the sequence for each end of each fragment is consistently ordered in each file. For example, let's look at the first 12 lines (3 sequences as each sequence record occupies 4 lines) of each file:

$ head -n 12 SRR3750603_1.fastq
@SRR3750603.1 HWI-D00372:204:HA3E2ADXX:1:1101:1384:2116/1
NTCGGAACATTTTTTCTTCAAAAATATGAAAAATCACCTAATTTATCTGAAAATGACATTTANNNCAGTNNNNNNNATTGGGAAAGTGCTCGATTTNCGGA
+
#0<FFFFFFFFFFIIIIIIFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII###00<B#######00<BBFFFFFBFFFFFFFFF#00BB
@SRR3750603.2 HWI-D00372:204:HA3E2ADXX:1:1101:1290:2192/1
TGTAATTTACTTTGTTCAGTTAGACTCTTAATTAGACTAAAAACGGTCTCAAAAAGTATAATTTCATAATGAGACACCTTTAAAAATTCTACGTTTTTATG
+
BBBFFFFFFBFFFFIIIIIIIIFFFIFFIIFIIIIIFFFFIIFFFIIFFIIIIFIIFFFIFFFIFIIIIIIIIIFFFFFFFFFFFFFFBBFBBBFFFFFFF
@SRR3750603.3 HWI-D00372:204:HA3E2ADXX:1:1101:1256:2198/1
AGTTTTCTCAAACACAGAAAACATATGGGAGTTTCTCAAACAATGGACAATGAGTGATCACCGATATTTGATACAAATCGACCAACTCGGCTCATATTCTC
+
BBBFFFFFFFFFFFBFFFFIIFFFIIIIIFFFFFBFFFFFFFIIIIIFFIIIIIIIIIFFFFFFIFFFFFFFFFBFFFBBFB<<BBB7<B7B7BBBFF<B<

$ head -n 12 SRR3750603_2.fastq
@SRR3750603.1 HWI-D00372:204:HA3E2ADXX:1:1101:1384:2116/2
TCTNNAATAAAATTTATTTCAGCTGGCATATTTGTGAAAATTTTCCGAAAATCGAGCACTTTCCCAATTATTTCAACTGAAATAAATGTCATTTTCAGATA
+
<<<##0<BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBBB<B<BBBBBBBBB<<<<<<<
@SRR3750603.2 HWI-D00372:204:HA3E2ADXX:1:1101:1290:2192/2
TTTGAAAAATAACTTTTAAGAAAACCTCTTTCCTCTATTAGTAAGGTAATTTAACGATTTTAAGAAAACGTTAAAACCAATTCTCTAAAAAAATTTCAAAA
+
BBBFFFBFFFFBFFFIIFFIIIFFB<BFFFFFFFBFFFIFIFIFIIIIIIIIIFBFFFIIIFFIIIFFFFFIIFFFB<FBBF<BBBFBFBB<<BFFBBBBB
@SRR3750603.3 HWI-D00372:204:HA3E2ADXX:1:1101:1256:2198/2
TTGATTTTTTCATAAATAATTTACAATCTATCACATGAACCTTTGAACAAAAAGAAGCGCTCCTGGAGAATATGAGCCGAGTTGGTCGATTTGTATCAAAT
+
BBBFFFFFFFBFFFFFFFFFIIIFFFIFFFIFFFFFFFIFFFFFIFFBBBFIIIIIIBFBFBFFFFBFFFFFFFFFFBBBFFBFFFBBBFFFFFFFBBBFB

Notice that aside from the /1 and /2 the sequence IDs for each record are identical in each file, indicating that they came from the two 5' ends of the same clonally amplified clusters on the Illumina flowcell. This makes life easier for us --- remember, sorting is good!

Now, let's get to the fun part. Please note that many of these questions will need to be answered with a mixture of standard UNIX commands you have learned thus far, but also with the seqtk and bioawk toolkits that we just installed. Please consult their documentation websites and help menus (e.g., just type seqtk then Enter) for ideas of how to answer the questions. Note that seqtk has many subcommands (e.g., seqtk comp) and one can get further information about what the subcommands do and what their output means by typing the subcommand followed by -h (e.g., seqtk comp -h).

Question #1

If you wanted to find all SNPs in a baby's genome with the fewest errors, what modern DNA sequencing technology would you choose? Why?

Question #2

You sequence 1000 genomes from individuals in Tibet and 1000 individuals in Hawaii. You observe 10,000 SNPs where the alternate allele is present on 40% of Tibetan chromosomes, yet only 1% of Hawaiian chromosomes. What processes could explain this observation?

Question #3

How does a mutation become a polymorphism?

Question #4

For giggles, imagine we raised a bunch of money to sequence TA Jon's genome. We find all the genetic variation in his genome and compare it to all the known genetic variants in the human population and find that 1,356 variants observed in Jon's genome have never been observed before. Are they all de novo mutations in his genome? Why or why not?

Question #5

How many paired end sequences resulted from this run? Show your work.

Question #6

How many nucleotides were sequenced in total for this experiment? Show your work.

Question #7

What is the overall GC content of the FASTQ files? How does it compare to the GC content of the C. elegans genome (you will need to look this up). Show your work.

Question #8

How does the average Phred quality score for the first read (sequence) position compare to the average Phred quality score for the last position? Why is this? Show your work.

Question #9

Is the length of every sequence in the FASTQ files the same? Show your work.

Question #10

What is the longest mononucleotide run in any sequence? Show your work.

Question #11

In class we talked about the fact that PCR amplification can cause dupicate DNA fragments (PCR duplicates) in the resulting sequencing data. How many PCR duplicates do you find in these data? Note that a PCR duplicate would have to be identified by inspecting both ends of each sequenced fragment. What fraction of the sequenced fragments were PCR duplicates? Hint: you will need to use bioawk for this question.

Note that while not strictly necessary to answer this question, the UNIX paste command may come in handy. While it can do many different things (http://www.theunixschool.com/2012/07/10-examples-of-paste-command-usage-in.html), but it's in simplest form paste places the lines from two files side by side one another:

$ cat a.txt
yuge
loud
head

$ cat b.txt
small
quiet
tail

$ paste a.txt b.txt
yuge    small
loud    quiet
head    tail

Question #1

  1. Let's pretend that the SARS-CoV-2 genome is exactly 10,000 base pairs long. Research around the world has converged on a mutation rate (i.e., the probability of seeing a mutation) of 2 X 10^-4. What random process could you use to model the count of observed mutations given this rate? What is the probability of observing at least 1 mutation given the genome size and the mutation rate?

Question #2

  1. Let's assume that the number of aligned cDNA molecules in an RNA-seq experiment at a given gene is a Poisson random variable. Your lab is obsessed with the UVLOVER gene and the mean number of reads to UVLOVER in skin from the ear lobe cell type is 10. You do RNA-seq on skin collected from a control individual without any sun exposure in the last week (quarantine) and also an experimental individual who sunbathes her earlobes daily. You observe 18 reads aligned to UVLOVER in the experimental individual and 9 in the control.

    2a. Using your intuition, do you think this is likely to reflect statistically-significant differential expression in UVLOVER between the two individuals?

    2b. Write a simulation in R that computes how likely it to see at least a 2:1 difference in expression by chance in either direction (i.e., 2:1 contol:exp or 2:1 exp:control) if the mean expression for UVLOVER is 10 and cDNA sampling is a Poisson process. What is the probability of seeing this by chance? Show you work.

    2c. What is the probability of seeing at least a 4 fold expression difference by chance if the mean expression is 30?

Installing new software!!!

One of the biggest barriers for new genomics researchers is the confusion and complexity associated with installing new pieces of software to run on UNIX platforms. The basic workflow is as follows:

# Download the code for the software with curl or wget, or git.
# Compile the code into an executable (a.k.a. binary) program
# Copy the executable (e.g., samtools, bedtools, etc.) into a directory that is in your PATH

To get your feet wet, we will install both seqtk and bioawk for use in this homework. First let's create a new directory in your home directory called "src". This is where we will download and compile all of the software we download. Like "bin" for your binaries, "src" is a traditional name for the directory we create for storing custom software installations.

cd ~
mkdir src
cd src

###Installing seqtk

Download the source code from github (this is where most software is hosted now) using the git command.

git clone https://github.com/lh3/seqtk

This creates a new directory called "seqtk". Navigate into it.

cd seqtk

Now we compile the source code for the seqtk binary with the make command, which runs a series of instructions for building the program that are outlined in the file called "Makefile" in the seqtk directory. You may see some warnings. You can ignore them.

make

This should have created a new binary called "seqtk". Let's check.

ls

Now copy this binary to the bin directory in your home directory. To copy files in UNIX we use cp, where the convention is cp FROM TO:

cp seqtk ~/bin/

You should now be able to run seqtk from any directory since the "~/bin" directory is in your PATH.

seqtk

###Installing bioawk

The pattern is the exact same - we are just getting source code from a different "repository" on Github.

cd ~/src
git clone https://github.com/lh3/bioawk
cd bioawk
make
cp bioawk ~/bin/

Homework setup

Make a new directory in your home directory with the prosaic name "hw4". Therein lies real live FASTQ files resulting from an Illumina paired end sequencing run of genomic DNA from C. elegans (details). Since this is from a paired-end sequencing run, there are two files --- one for each end of each DNA fragment. As such, the two files are named SRR3750603_1.fastq and SRR3750603_2.fastq.

curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR375/003/SRR3750603/SRR3750603_1.fastq.gz | zcat | head -n 100000 | gzip > SRR3750603_1.fastq.gz
curl ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR375/003/SRR3750603/SRR3750603_2.fastq.gz | zcat | head -n 100000 | gzip > SRR3750603_2.fastq.gz
gunzip SRR3750603_1.fastq.gz &
gunzip SRR3750603_2.fastq.gz &

Conveniently, the sequence for each end of each fragment is consistently ordered in each file. For example, let's look at the first 12 lines (3 sequences as each sequence record occupies 4 lines) of each file:

$ head -n 12 SRR3750603_1.fastq
@SRR3750603.1 HWI-D00372:204:HA3E2ADXX:1:1101:1384:2116/1
NTCGGAACATTTTTTCTTCAAAAATATGAAAAATCACCTAATTTATCTGAAAATGACATTTANNNCAGTNNNNNNNATTGGGAAAGTGCTCGATTTNCGGA
+
#0<FFFFFFFFFFIIIIIIFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII###00<B#######00<BBFFFFFBFFFFFFFFF#00BB
@SRR3750603.2 HWI-D00372:204:HA3E2ADXX:1:1101:1290:2192/1
TGTAATTTACTTTGTTCAGTTAGACTCTTAATTAGACTAAAAACGGTCTCAAAAAGTATAATTTCATAATGAGACACCTTTAAAAATTCTACGTTTTTATG
+
BBBFFFFFFBFFFFIIIIIIIIFFFIFFIIFIIIIIFFFFIIFFFIIFFIIIIFIIFFFIFFFIFIIIIIIIIIFFFFFFFFFFFFFFBBFBBBFFFFFFF
@SRR3750603.3 HWI-D00372:204:HA3E2ADXX:1:1101:1256:2198/1
AGTTTTCTCAAACACAGAAAACATATGGGAGTTTCTCAAACAATGGACAATGAGTGATCACCGATATTTGATACAAATCGACCAACTCGGCTCATATTCTC
+
BBBFFFFFFFFFFFBFFFFIIFFFIIIIIFFFFFBFFFFFFFIIIIIFFIIIIIIIIIFFFFFFIFFFFFFFFFBFFFBBFB<<BBB7<B7B7BBBFF<B<

$ head -n 12 SRR3750603_2.fastq
@SRR3750603.1 HWI-D00372:204:HA3E2ADXX:1:1101:1384:2116/2
TCTNNAATAAAATTTATTTCAGCTGGCATATTTGTGAAAATTTTCCGAAAATCGAGCACTTTCCCAATTATTTCAACTGAAATAAATGTCATTTTCAGATA
+
<<<##0<BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBBB<B<BBBBBBBBB<<<<<<<
@SRR3750603.2 HWI-D00372:204:HA3E2ADXX:1:1101:1290:2192/2
TTTGAAAAATAACTTTTAAGAAAACCTCTTTCCTCTATTAGTAAGGTAATTTAACGATTTTAAGAAAACGTTAAAACCAATTCTCTAAAAAAATTTCAAAA
+
BBBFFFBFFFFBFFFIIFFIIIFFB<BFFFFFFFBFFFIFIFIFIIIIIIIIIFBFFFIIIFFIIIFFFFFIIFFFB<FBBF<BBBFBFBB<<BFFBBBBB
@SRR3750603.3 HWI-D00372:204:HA3E2ADXX:1:1101:1256:2198/2
TTGATTTTTTCATAAATAATTTACAATCTATCACATGAACCTTTGAACAAAAAGAAGCGCTCCTGGAGAATATGAGCCGAGTTGGTCGATTTGTATCAAAT
+
BBBFFFFFFFBFFFFFFFFFIIIFFFIFFFIFFFFFFFIFFFFFIFFBBBFIIIIIIBFBFBFFFFBFFFFFFFFFFBBBFFBFFFBBBFFFFFFFBBBFB

Notice that aside from the /1 and /2 the sequence IDs for each record are identical in each file, indicating that they came from the two 5' ends of the same clonally amplified clusters on the Illumina flowcell. This makes life easier for us --- remember, sorting is good!

Now, let's get to the fun part. Please note that many of these questions will need to be answered with a mixture of standard UNIX commands you have learned thus far, but also with the seqtk and bioawk toolkits that we just installed. Please consult their documentation websites and help menus (e.g., just type seqtk then Enter) for ideas of how to answer the questions. Note that seqtk has many subcommands (e.g., seqtk comp) and one can get further information about what the subcommands do and what their output means by typing the subcommand followed by -h (e.g., seqtk comp -h).

Question #3

How many paired end sequences resulted from this run? Show your work.

Question #4

How many nucleotides were sequenced in total for this experiment? Show your work.

Question #5

What is the overall GC content of the FASTQ files? How does it compare to the GC content of the C. elegans genome (you will need to look this up). Show your work.

Question #6

How does the average Phred quality score for the first read (sequence) position compare to the average Phred quality score for the last position? Why is this? Show your work.

Question #7

Is the length of every sequence in the FASTQ files the same? Show your work.

Question #8

What is the longest mononucleotide run in any sequence? Show your work.

Question #9

In class we talked about the fact that PCR amplification can cause dupicate DNA fragments (PCR duplicates) in the resulting sequencing data. How many PCR duplicates do you find in these data? Note that a PCR duplicate would have to be identified by inspecting both ends of each sequenced fragment. What fraction of the sequenced fragments were PCR duplicates? Hint: you will need to use bioawk for this question.

Note that while not strictly necessary to answer this question, the UNIX paste command may come in handy. While it can do many different things (http://www.theunixschool.com/2012/07/10-examples-of-paste-command-usage-in.html), but it's in simplest form paste places the lines from two files side by side one another:

$ cat a.txt
yuge
loud
head

$ cat b.txt
small
quiet
tail

$ paste a.txt b.txt
yuge    small
loud    quiet
head    tail

Question #10

If you wanted to find all SNPs in a baby's genome with the fewest errors, what modern DNA sequencing technology would you choose? Why?

Homework #4

Installing new software!!!

One of the biggest barriers for new genomics researchers is the confusion and complexity associated with installing new pieces of software to run on UNIX platforms. The basic workflow is as follows:

# Download the code for the software with curl or wget, or git.
# Compile the code into an executable (a.k.a. binary) program
# Copy the executable (e.g., seqtk, samtools, bedtools, etc.) into a directory that is in your PATH

To get your feet wet, we will install both seqtk and bioawk for use in this homework. First, let's create a new directory in your home directory called "src". This is where we will download and compile all of the software we download. Like "bin" for your binaries, "src" is a traditional name for the directory we create for storing custom software installations.

cd ~
mkdir src
cd src

###Installing seqtk

Download the source code from github (this is where most software is hosted now) using the git command.

git clone https://github.com/lh3/seqtk

This creates a new directory called "seqtk". Navigate into it.

cd seqtk

Now we compile the source code for the seqtk binary with the make command, which runs a series of instructions for building the program that are outlined in the file called "Makefile" in the seqtk directory. You may see some warnings. You can ignore them.

make

This should have created a new binary called "seqtk". Let's check.

ls

Now copy this binary to the bin directory in your home directory. To copy files in UNIX we use cp, where the convention is cp FROM TO:

cp seqtk ~/bin/

You should now be able to run seqtk from any directory since the "~/bin" directory is in your PATH.

seqtk

Homework setup

Create a new directory in your home directory with the prosaic name "hw4". Move into that directory. Now we will download real FASTQ files resulting from an Illumina paired-end sequencing run of human genomic DNA. Since this is from a paired-end sequencing run, there are two files --- one for each end of each DNA fragment. As such, the two files are named 1805.1.fq and 1805.2.fq.

wget https://home.chpc.utah.edu/~u1007787/hw4/1805.1.fq
wget https://home.chpc.utah.edu/~u1007787/hw4/1805.2.fq 

Conveniently, the sequence for each end of each fragment is consistently ordered in each file. For example, let's look at the first 12 lines (3 sequences as each sequence record occupies 4 lines) of each file:

@C19G9ACXX:3:1101:1147:71334/1
TGGCTCGAAGCGGGCACTGGCGATCTTGGCCACGGACAGCTGCTCATGGT
+
=BCDGE;DFBG=FEGDEFDEE>EDFGFBFFDAA;EEB;<A?AABCCB??@
@C19G9ACXX:3:1101:1232:59804/1
TTTTAATTTTTCAATGATTTCATCAATAATATTAGCAATAGCTATTTTCA
+
ABCCDFEEEEEFEFEGEEFFGEEGECDCEECDFBFGEEDDGGGDEFFEEC
@C19G9ACXX:3:1101:1236:78358/1
CTTCCTCTTCTTCACCTCCCTGCTGCAGCACTTCAGCTTCTCCGTGGCCG
+
ADCCFGEGEGGDGDFFFGFFGEFGGG=EFEEFFFCBFGFGGGE?B@BED=
@C19G9ACXX:3:1101:1147:71334/2
GTCAGCACAACCTGGACATCGAGTGTCCCATGTACACCAACCTCAGTCGT
+
>CBB@EEDDFCFGDBEDDDG>CGAEBFEEEDF<DEEEEEGFFGGCFBE=?
@C19G9ACXX:3:1101:1232:59804/2
ATTACAGAAAATGATATACAAATTGCATTAGATGATGCCAAAATCAACTT
+
@BCBCDDEFFDDEDDDDCEEFFEFGGEEFCGEDFEDGGFEFFGEGEFEFD
@C19G9ACXX:3:1101:1236:78358/2
GATCCCAACGAGGGCGTGAGCAGGGGACCCGAGTTGGAACTACCACATTG
+
@CBCEEDFE:CEFFB>B7CDFCGBFFEFFF>ED@DCGEEEFDEDDDCBA<

Notice that aside from the /1 and /2 the sequence IDs for each record are identical in each file, indicating that they came from the two 5' ends of the same clonally amplified clusters on the Illumina flowcell. This makes life easier for us --- remember, sorting is good!

Now, let's get to the fun part. Please note that many of these questions will need to be answered with a mixture of standard UNIX commands you have learned thus far, but also with the seqtk toolkit that we just installed. Please consult their documentation websites and help menus (e.g., just type seqtk then Enter) for ideas of how to answer the questions. Note that seqtk has many subcommands (e.g., seqtk comp) and one can get further information about what the subcommands do and what their output means by typing the subcommand followed by -h (e.g., seqtk comp -h).

Question #1

First, a quick question about DNA sequencing technologies. If you wanted to find all single-nucleotide polymorphism in a baby's genome with the fewest errors, what modern DNA sequencing technology would you choose? Why?

Question #2

How many paired-end sequences resulted from this run? Show your work.

Question #3

How many nucleotides were sequenced in total for this experiment? Show your work.

Question #4

What is the overall GC content of the FASTQ files? Show your work.

Question #5

How does the average Phred quality score for the first read (sequence) position compare to the average Phred quality score for the last position? Why is this? Show your work.

Question #6

Is the length of every sequence in the FASTQ files the same? Show your work.

Question #7

Offline, I aligned these fastq files to build 37 human genome using bwa mem. The result is a SAM that you can download with the following command.

wget https://home.chpc.utah.edu/~u1007787/1805.sam

Convert the SAM file to a BAM file named 1805.bam with samtools. Then use samtools to sort the BAM file by chromosome and name the sorted file 1805.sorted.bam. Then index the sorted BAM file with samtools. Show your work.

Question #8

What is the most common mapping quality (MAPQ) for the alignments in the BAM file? How many alignments have that mapping quality? What does that mapping quality reflect in terms of the estimated probability that the mapping location is wrong?

Question #9

The samtools mpileup command reports the depth of coverage and the alleles observed at each position in the genome. Using this command, figure out which column of the output represents the depth of sequencing coverage and use the UNIX sort command to report which two positions in the genome have the highest depth of sequencing coverage in this BAM file. You will want to read up on how to use the UNIX sort command to sort data by a specific column (in this case, the depth column of the output of samtools mpileup): https://www.geeksforgeeks.org/sort-command-linuxunix-examples/#:~:text=%2Dk%20Option%3A%20Unix%20provides%20the,sort%20on%20the%20second%20column.

Question #10

The sorted BAM file (and its index) are available at the following URL: https://home.chpc.utah.edu/~u1007787/hw4/1805.sorted.bam

Load this BAM file into IGV using the File>Load from URL option.

Using IGV only, what is the total aligned sequencing depth on chromosome 2, position 21,236,251? How many G alleles and T alleles were observed?

Question #11

Given the alleles observed on chromosome 2, position 21,236,251, do you think this is evidence of a genetic variant in this individual? Why or why not?

Setup

We are going to be working with a VCF file that was create by running the freebayes variant caller on three BAM files that have been sorted and indexed for a father (1847.workshop.bam), mother (1805.workshop.bam), and child (4805.workshop.bam). These alignments come from an exome capture experiment for each sample, but in order to minimize disk usage, these BAMs represent alignments samples from a small subset of the genome.

Let's create and move into a hw5 directory:

mkdir hw5; cd ~/hw5

I started by running freebayes on our family "trio" (mom, dad, kid) using the following command to make a VCF file of genetic variants observed in the family. You do not need to run this command

freebayes -f human_g1k_v37.fasta 1847.workshop.bam 1805.workshop.bam 4805.workshop.bam > trio.vcf &

The result is a VCF file of all the genetic variation observed in this family "trio". Recall that a VCF file has both "header" lines and "data" lines (that is, lines in the file that represent variants observed in the family). The resulting file can be downloaded to malibu with:

wget https://home.chpc.utah.edu/~u1007787/trio.vcf

Using UNIX commands as well as bcftools (installed on Malibu, you will need to refer to the bcftools documentation), answer the following questions.

Question #0

I used samtools sort to sort the alignments in each of the BAM files above by chromosome and chromosome position. Why was this necessary for freebayes to find genetic variation?

Question #1

How many variants did freebayes detect? Show your work.

Question #2

How many variants did freebayes detect with a P(SNP) (i.e., QUAL) greater than 40? Show your work.

Question #3

How many transitions were found, regardless of quality? How many transversions reference? Show your work.

Question #4

Is the ratio of transitions and tranversions what you would expect? Refer to lectures to devise an expectation.

Question #5

How many insertions were found? How many deletions? Show your work.

Question #6

How many SNPs had a total depth (DP INFO field) greater than or equal to 20? Show your work.

Question #7

How many variants had at least one unknown (missing) genotype? Hint: have a look at bcftools view Show your work.

Question #8

How many variants had an allele frequency (AF INFO field) of 1/6? Show your work.

Question #9

How many variants had a QUAL >= 20, an alternate allele count (AC) of 1 (i.e., a HET in one of the 3 samples), and all 3 samples with a called genotype (i.e., allele number or AN=6)? Show your work.

Question #10

How many of the variants from question #8 appear to be de novo mutations (recall that 4805 is the child)? Is this number expected? Why or why not? Show your work.

Question #11

By looking at the predicted de novo mutation variants, can you suggest ways to eliminate potentialy low quality de novo mutations and enrich for real mutations?

Question #12

Let's imagine you sequence the genomes of 1000 people from Salt Lake City. At a particular polymorphic site where the alleles are A and G, you observe 1 A/A homozygote, 250 A/G heterozygotes, and 749 G/G homozygotes. What are the allele frequencies for the A allele (p) and the G allele (q)? Given these allele frequencies, are the genotype frequencies you observe in line with what you would expect under Hardy Weinberg Equlibrium? Why or why not? Hint: refer back to the slides and read this

Homework #4

Step #1

First, complete the samtools tutorial and feel free to play around with the data a bit to get your feet wet.

Step #2

Install IGV on your laptop by following the instructions here. I recommend using the "Java Web Start" version with 1.2Gb of RAM. This should work nicely with both Windows and OSX. Note that if you have any problems you may need to install a newer version of Java on your computer.

Step #3

Complete this excellent IGV tutorial

Step #4

Download the sorted BAM and index file from Amazon to your laptop desktop. These are the same files you created by doing the tutorial.

Question #1

Is the BAM file 'sample.sorted.bam' sorted or unsorted? How do you know?

Question #2

How many alignments are in 'sample.sorted.bam' Show your work.

Question #3

How many alignments in 'sample.sorted.bam' overlap chromosome 1 from position 12398392 to 12399392? Note that this can be done with a single 'samtools view' command making use of the BAM index (sample.sorted.bam.bai). Show your work.

Question #4

The 'sample.sorted.bam' file includes paired-end alignments. For a given pair, it is often the case that the TLEN is positive for one end of the pair and negative for the other end. Why is this?

Question #5

Using only the positive TLEN for each paired-end alignment, what is the min, max, and mode (most frequent) TLEN observed in the BAM file? Show your work.

Question #6

What fraction of the paired-end alignments are "properly paired"? Hint: use the 'samtools flagstat' command. Does this seem like a reasonable fraction for a high quality DNA fragment library.

Question #7

How many alignments are on the positive strand and on the negative strand? Hint: you need to test the FLAG field. Show your work.

Question #8

Using IGV, how many sequences aligned to exon 17 of the ATM gene?

NOTE: make sure you set the reference genome to "Human hg19"

Question #9

At what position is the apparent SNP in exon 17 of the ATM gene? What allele is observed for this individual?

Question #10

Following question #8, what do you think the individual's genotype is at this site based on the alignments? You may need to refer back to earlier lectures if you have forgotten about genotypes.

Question #11

What do you think the individual's genotype is at the SNP in following locus: chr17:41,244,367-41,244,497 What are the Phred quality scores associated with the observed alternate allele? Does this make you more or less confident in your genotype assesment?

Question #12

What is your assessment of the two apparent SNPs at this locus: chr7:55,227,815-55,228,328 Are they any less or more confident than the SNP in question #11? What else can you say about the inheritance of these two SNPs?

Question #13

Is there a SNP at position chr7:55,224,334? Why or why not?

Complete questions 1-7 at the bottom of the bedtools tutorial. Submit the commands and answers you came up with. You will need to use awk and perhaps other UNIX tools to complete the tutorial.

  1. You have sequenced your genome and are in a frantic search (unlcear why) for heterozygous variants in your genome. Let's assume that for a particular locus, the truth is that you are heterozygous A/G. If the chance of sample either allele is 0.5, what is the probability of only observed one of the two alleles in 5 aligned sequences? Show your work.

  2. Same as #1, but what if you had 10 aligned reads? What about for 30 aligned reads? Show your work.

  3. Now, assume that the variant caller you are using is very strict when calling heterozygotes. It requires at least 5 reads supporting a non-reference allele (in this case, A is the reference allele and G is the alternate allele). Again, let's assume that for a locus, the truth is that you are heterozygous A/G and that you have 15 aligned reads. What is the chance of the caller seeing sufficient evidence for the alternate allele such that it makes a variant call? Show your work

  4. Same as number 3, but now assume there is a slight reference bias against the alignment of reads with the alternate allele, such that the probability of "success"fully aligning the alternate allele is 0.45 instead of 0.5. Show your work

  5. Let's pretend that the SARS-CoV-2 genome is exactly 10,000 base pairs long. Research around the world has converged on a mutation rate (i.e., the probability of seeing a mutation) of 2 X 10^-4. What random process could you use to model the count of observed mutations given this rate? What is the probability of observing at least 1 mutation given the genome size and the mutation rate?

  6. Let's assume that the number of aligned cDNA molecules in an RNA-seq experiment at a given gene is a Poisson random variable. Your lab is obsessed with the UVLOVER gene and the mean number of reads to UVLOVER in skin from the ear lobe cell type is 10. You do RNA-seq on skin collected from a control individual without any sun exposure in the last week (quarantine) and also an experimental individual who sunbathes her earlobes daily. You observe 18 reads aligned to UVLOVER in the experimental individual and 9 in the control.

    6a. Using your intuition, do you think this is likely to reflect statistically-significant differential expression in UVLOVER between the two individuals?

    6b. Write a simulation in R that computes how likely it to see at least a 2:1 difference in expression by chance in either direction (i.e., 2:1 contol:exp or 2:1 exp:control) if the mean expression for UVLOVER is 10 and cDNA sampling is a Poisson process. What is the probability of seeing this by chance? Show you work.

    6c. What is the probability of seeing at least a 4 fold expression difference by chance if the mean expression is 30?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment