Aaron Quinlan arq5x

## Overview.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                arq5x
                / Overview.md
            
            
              Last active
              November 17, 2022 23:03
            
              
                Analysis of rare disease VCF with bcftools
              
          
    Analysis of rare disease VCF with bcftools

Big Picture - You are the genome detective.

Here's the deal.

We have sequenced the exomes of an affected child, and their unaffected parents. The child has a rare skin condition.
We aligned their exome data to the human reference genome.
We called variants using GATK.
The resulting VCF file is called trio.trim.vep.vcf.gz.


## all_the_seqs.txt
GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACC
CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTAC
GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTG
AGTAATACTTCGTTCAGTTTGTGGAAGGTAGTGTTTAACCGGTTGCTCGC
GGTATGCGCTGGTCAAATCGGAGAGTGGGTGTTTATCGGATGGATCGCTG
CGGTGCGTGCTGTGATTCATCTTTGACTGGTGTTTATGGTCGGTCTTTAC
CATTGTACTTCCCGTTCAGTTTATCAAATTTGGTGTTTATATGAACCATA
GATGGTGGTGGTGGCGGTGGCGCTGGTGTTGCATGGTGTTGCGCATTATT
GTATCGCGTTCAGTTTCGGAAGGTGGTGTTTAACCAGTTCGCCGCCTACC
GTTTTTTCGTTCATTTGGTACGGTGTTGCGCCGGTCGCCGCCTACCGTGA

## genome.txt
chr1  10

## go.sh
cat a.bed
chr1	10	50	10

cat b.bed
chr1	20	40	20

cat c.bed
chr1	30	33	30

# Find the sub-intervals shared and unique to each file.

## notes.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                arq5x
                / notes.md
            
            
              Created
              March 8, 2017 00:17
            
              
                Jim's Thesis Talk
              
          
State from the beginning that you are focusing on exome
Talk through the pLI figure in more detail
Any slide needs to talked through in detail if you present it
Make it clear what missense variants are what you are seeing
Should the observed CpG density be modulated by the codon content of the region
Model ideas?

Exclude singletons???
use just synonymous density as the independent variable


Molly Przezorski, CpG
Simplify the explanation of each of the essential gene descriptions


## merge-callable.py
#!/usr/bin/env python
"""
given a callable file from goleft depth on stdin, merge adjacent LOW_COVERAGE and CALLABLE regions.
also split regions that are larger than max_region.
this is to even out parallelism for sending regions to freebayes.
"""
from __future__ import print_function, division
import sys
from itertools import groupby
from operator import itemgetter

## gist:0dbc7f40a29c898f79482b98ace232a8
# GeneDx: https://www.genedx.com/test-catalog/disorders/early-onset-epileptic-encephalopathy-andor-infantile-spasms/
# Invitae: https://www.invitae.com/en/physician/tests/03402/
ADSL
ALDH7A1
ALG13
ARHGEF9
ARID1B
ARX
ATP1A2
ATP6AP2

## run.py
# STEP 1: for all fams, print each SGS region (chrom, start, end)
import sys
import subprocess as sub

def run_jobs(commands):
    """
    This function takes a set of max_work commands and executes
    them with rj
    """
    f = open('rungemini.sh', 'w')

## gist:8d9c2767d2495ba4b9bf6f555d29c088
# mnake sure all rows have 99 fields
$ awk 'BEGIN{FS="\t"} {print NF}' /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt | uniq
99

# how many HET (1) and HOM_ALT (3) genotypes were there?
$ awk 'BEGIN{FS="\t"} {print $99}' /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt | sort | uniq -c

# get rid of headers except for the first one
(head -n 1 /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt; grep -v gt_types /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt)

## make-simrep-micsat-bed.sh
# download simple repeats from UCSC and convert to BED
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz \
    | gzcat \
    | cut -f 2-5 \
    > simrep.hg19.bed

# download microsatellites from UCSC and convert to bed
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/microsat.txt.gz \
    | gzcat \
    | cut -f 2-5 \
	GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACC
	CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTAC
	GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTG
	AGTAATACTTCGTTCAGTTTGTGGAAGGTAGTGTTTAACCGGTTGCTCGC
	GGTATGCGCTGGTCAAATCGGAGAGTGGGTGTTTATCGGATGGATCGCTG
	CGGTGCGTGCTGTGATTCATCTTTGACTGGTGTTTATGGTCGGTCTTTAC
	CATTGTACTTCCCGTTCAGTTTATCAAATTTGGTGTTTATATGAACCATA
	GATGGTGGTGGTGGCGGTGGCGCTGGTGTTGCATGGTGTTGCGCATTATT
	GTATCGCGTTCAGTTTCGGAAGGTGGTGTTTAACCAGTTCGCCGCCTACC
	GTTTTTTCGTTCATTTGGTACGGTGTTGCGCCGGTCGCCGCCTACCGTGA
	cat a.bed
	chr1 10 50 10

	cat b.bed
	chr1 20 40 20

	cat c.bed
	chr1 30 33 30

	# Find the sub-intervals shared and unique to each file.
	#!/usr/bin/env python
	"""
	given a callable file from goleft depth on stdin, merge adjacent LOW_COVERAGE and CALLABLE regions.
	also split regions that are larger than max_region.
	this is to even out parallelism for sending regions to freebayes.
	"""
	from __future__ import print_function, division
	import sys
	from itertools import groupby
	from operator import itemgetter
	# GeneDx: https://www.genedx.com/test-catalog/disorders/early-onset-epileptic-encephalopathy-andor-infantile-spasms/
	# Invitae: https://www.invitae.com/en/physician/tests/03402/
	ADSL
	ALDH7A1
	ALG13
	ARHGEF9
	ARID1B
	ARX
	ATP1A2
	ATP6AP2
	# STEP 1: for all fams, print each SGS region (chrom, start, end)
	import sys
	import subprocess as sub

	def run_jobs(commands):
	"""
	This function takes a set of max_work commands and executes
	them with rj
	"""
	f = open('rungemini.sh', 'w')
	# mnake sure all rows have 99 fields
	$ awk 'BEGIN{FS="\t"} {print NF}' /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt \| uniq
	99

	# how many HET (1) and HOM_ALT (3) genotypes were there?
	$ awk 'BEGIN{FS="\t"} {print $99}' /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt \| sort \| uniq -c

	# get rid of headers except for the first one
	(head -n 1 /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt; grep -v gt_types /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt)
	# download simple repeats from UCSC and convert to BED
	curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz \
	\| gzcat \
	\| cut -f 2-5 \
	> simrep.hg19.bed

	# download microsatellites from UCSC and convert to bed
	curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/microsat.txt.gz \
	\| gzcat \
	\| cut -f 2-5 \