Skip to content

Instantly share code, notes, and snippets.

View arq5x's full-sized avatar

Aaron Quinlan arq5x

View GitHub Profile
@arq5x
arq5x / Overview.md
Last active November 17, 2022 23:03
Analysis of rare disease VCF with bcftools

Analysis of rare disease VCF with bcftools

Big Picture - You are the genome detective.

Here's the deal.

  • We have sequenced the exomes of an affected child, and their unaffected parents. The child has a rare skin condition.
  • We aligned their exome data to the human reference genome.
  • We called variants using GATK.
  • The resulting VCF file is called trio.trim.vep.vcf.gz.
@arq5x
arq5x / all_the_seqs.txt
Created November 14, 2022 23:46
toy data for CSHL.
GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACC
CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTAC
GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTG
AGTAATACTTCGTTCAGTTTGTGGAAGGTAGTGTTTAACCGGTTGCTCGC
GGTATGCGCTGGTCAAATCGGAGAGTGGGTGTTTATCGGATGGATCGCTG
CGGTGCGTGCTGTGATTCATCTTTGACTGGTGTTTATGGTCGGTCTTTAC
CATTGTACTTCCCGTTCAGTTTATCAAATTTGGTGTTTATATGAACCATA
GATGGTGGTGGTGGCGGTGGCGCTGGTGTTGCATGGTGTTGCGCATTATT
GTATCGCGTTCAGTTTCGGAAGGTGGTGTTTAACCAGTTCGCCGCCTACC
GTTTTTTCGTTCATTTGGTACGGTGTTGCGCCGGTCGCCGCCTACCGTGA
chr1 10
@arq5x
arq5x / go.sh
Last active November 23, 2021 02:38
compute average scores for share intervals
cat a.bed
chr1 10 50 10
cat b.bed
chr1 20 40 20
cat c.bed
chr1 30 33 30
# Find the sub-intervals shared and unique to each file.
@arq5x
arq5x / notes.md
Created March 8, 2017 00:17
Jim's Thesis Talk
  1. State from the beginning that you are focusing on exome
  2. Talk through the pLI figure in more detail
  3. Any slide needs to talked through in detail if you present it
  4. Make it clear what missense variants are what you are seeing
  5. Should the observed CpG density be modulated by the codon content of the region
  6. Model ideas?
    • Exclude singletons???
    • use just synonymous density as the independent variable
  7. Molly Przezorski, CpG
  8. Simplify the explanation of each of the essential gene descriptions
#!/usr/bin/env python
"""
given a callable file from goleft depth on stdin, merge adjacent LOW_COVERAGE and CALLABLE regions.
also split regions that are larger than max_region.
this is to even out parallelism for sending regions to freebayes.
"""
from __future__ import print_function, division
import sys
from itertools import groupby
from operator import itemgetter
@arq5x
arq5x / gist:0dbc7f40a29c898f79482b98ace232a8
Created November 13, 2016 20:50
Union EIEE Genes on GeneDx and Invitae Gene Panels
# GeneDx: https://www.genedx.com/test-catalog/disorders/early-onset-epileptic-encephalopathy-andor-infantile-spasms/
# Invitae: https://www.invitae.com/en/physician/tests/03402/
ADSL
ALDH7A1
ALG13
ARHGEF9
ARID1B
ARX
ATP1A2
ATP6AP2
@arq5x
arq5x / run.py
Created November 10, 2016 20:39
python batch submission
# STEP 1: for all fams, print each SGS region (chrom, start, end)
import sys
import subprocess as sub
def run_jobs(commands):
"""
This function takes a set of max_work commands and executes
them with rj
"""
f = open('rungemini.sh', 'w')
# mnake sure all rows have 99 fields
$ awk 'BEGIN{FS="\t"} {print NF}' /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt | uniq
99
# how many HET (1) and HOM_ALT (3) genotypes were there?
$ awk 'BEGIN{FS="\t"} {print $99}' /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt | sort | uniq -c
# get rid of headers except for the first one
(head -n 1 /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt; grep -v gt_types /uufs/chpc.utah.edu/common/home/u1072926/gemini_queries/all.txt)
# download simple repeats from UCSC and convert to BED
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/simpleRepeat.txt.gz \
| gzcat \
| cut -f 2-5 \
> simrep.hg19.bed
# download microsatellites from UCSC and convert to bed
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/microsat.txt.gz \
| gzcat \
| cut -f 2-5 \