Skip to content

Instantly share code, notes, and snippets.

@nouyang-curoverse
Last active August 29, 2015 14:10
Show Gist options
  • Save nouyang-curoverse/b17a3820d2ccf36f3e47 to your computer and use it in GitHub Desktop.
Save nouyang-curoverse/b17a3820d2ccf36f3e47 to your computer and use it in GitHub Desktop.
Scripts for working with FASTA files

GA4GH Regions of Interest (FASTA Files)

Regions:

  • BRCA1
  • BRCA2
  • SMA
  • KIR
  • MHC

Overview

  • Quickstart: How to download the FASTA files
  • Verification: Describes my verification steps and conclusions
  • Verify Sequences Yourself: Describes how to replicate my verification steps
  • Extract Sequences Yourself: Describes how to independently extract the regions of interest from the reference genomes
  • About: Meta-information about this project

Quickstart

I created a collection of the FASTA files for BRCA1, BRCA2, and SMA genes as extracted from hg19 and hg38 reference genomes. I also created a short biopython script, exactsearch.py, for quickly verifying whether a record in a FASTA file is a direct match against for any of the records in another FASTA file (or directory of FASTA files).

The FASTA files are in the root directory, and the LRG and NCBI reference sequences (which are created separately from the reference genome) are in the "test" directory for each gene. The file listing is as follows:

brca1test
    LRG_292.fasta
    ncbi_brca1.fasta
brca2test
    LRG_293.fasta
    ncbi_brca2.fasta
smatest
    LRG_676.fasta
    ncbi_sma.fasta
brca1-hg19.fa
brca1-hg38.fa
brca2-hg19.fa
brca2-hg38.fa
sma-hg19.fa
sma-hg38.fa

To get the FASTA files

A. if you have access to an Arvados VM, try

$ arv get qr1hi-4zz18-ykp04ajmg4z7wol/ . #note trailing dot!

B. Otherwise, they're publicly available:

$ wget --mirror --no-parent --no-host --cut-dirs=3 https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ykp04ajmg4z7wol/4llcsi6jnlvdoro5cqilihjlla1cao87zs72du04l8n6jrweg8/

You can preview the collection contents at

A. https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-ykp04ajmg4z7wol (login with any google account)

or, if you'd prefer to not login, preview the contents at

B. https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ykp04ajmg4z7wol/4llcsi6jnlvdoro5cqilihjlla1cao87zs72du04l8n6jrweg8/

Verification

I verified my sequences against the NCBI and LRG reference FASTA files. The coordinates come from the NCBI gene definitions for each gene. My conclusions:

brca1

  • brca1-hg19 is the exact same as brca1-hg38
  • brca1-hg19 and brca1-hg38 are exactly the reverse complement of LRG_292 and the NCBI reference sequences

brca2

  • brca2-hg19 is the exact same as brca2-hg38
  • brca2-hg19 and brca2-hg38 have a single base variation (T->C) from LRG_293 and the NCBI reference sequences, that is they are an exact match except for one base

sma

  • sma-hg19 is the exact same as sma-hg38
  • sma-hg19 and sma-hg38 exactly match LRG_676 and the NCBI reference sequences

Also, each NCBI reference sequence is an exact substring of the LRG reference sequence.

Verify Sequences Yourself

If you'd like to verify the file contents, exactsearch.py will quickly tell you whether a sequence exactly matches another (either as originally in the file, as reversed, as complemented, or as reverse complemented), and for actual alignments you can run blastn. For example, if we want to verify brca1-hg18.fa against the LRG definition and the NCBI definition,

Download Reference Genes

The reference genes I verified against come from:

BRCA1

BRCA2

SMA

Exact Search

Get python script and biopython dependency

$ sudo apt-get install pip git
$ pip install biopython
$ git clone <this gist>

Run python script

$ python exactsearch.py dir brca1-hg19.fa brca1test
Number of matches found for brca1-hg19.fa :
2
Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)
['brca1test/LRG_292.fasta', 'LRG_292g (genomic sequence)', 'reverse complement', 'GTACCTTGAT', 193689, 81189]
['brca1test/ncbi_brca1.fasta', 'gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17', 'reverse complement', 'GTACCTTGAT', 81189, 81189]

Blastn

If exactsearch.py shows zero matches, it's possible there's simply a SNP or two (a base is deleted or changed). To find out, run blastn . After installing blast as per http://www.ncbi.nlm.nih.gov/books/NBK52640/, for instance:

$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.30+-x64-linux.tar.gz
$ tar zxvpf ncbi-blast-2.2.30+-x64-linux.tar.gz
$ vi ~/.bashrc
  PATH="$PATH:/data/tools/ncbi-blast-2.2.30+/bin"
  BLASTDB="/data/tools/ncbi-blast-2.2.30+/db"
$ source ~/.bashrc

Make a single fasta file containing all the sequences you want to query against. For instance, I created an lrg292_ncbi FASTA file that looks like

>gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17
GTACCTTGATTTCGTATTCTGAGAGGCTGCTGCTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGGAAAA
[...]
>LRG_292g (genomic sequence)
TGTGTGTATGAAGTTAACTTCAAAGCAAGCTTCCTGTGCTGAGGGGGTGGGAGGTAAGGG
[...]

Then make the local database

$ makeblastdb –in /data/fasta/brca/lrg292_ncbi.fa –dbtype nucl –parse_seqids    

Then blast search that database

$ blastn -db /data/fasta/brca/lrg292_ncbi.fa -query /data/fasta/brca/hg19-brca1.fa  -out=blast_brca1.out

If you open the blast_brca1.out file, you'll see plaintext results:

>LRG_293p1 (protein translated from transcript t1 of LRG_293)
Length=86452

 Score = 1.555e+05 bits (84190),  Expect = 0.0
 Identities = 84192/84193 (99%), Gaps = 0/84193 (0%)
 Strand=Plus/Plus


Query  39721  TAATGACAATGAGATTCATCAGTTTAACAAAAACAACTCCAATCAAGCAGTAGCTGTAAC  39780
              |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
Sbjct  41980  TAATGACAATGAGATTCATCAGTTTAACAAAAACAACTCCAATCAAGCAGCAGCTGTAAC  42039

Note that you can tell that they are aligned in the same direction (both query and subject are increasing indices) and as they are originally. If the complements aligned instead, you'd see Strand=Plus/Minus:

> gi|262359905:92501-173689 Homo sapiens breast cancer 1, early 
onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17
Length=81189

 Score = 1.499e+05 bits (81189),  Expect = 0.0
 Identities = 81189/81189 (100%), Gaps = 0/81189 (0%)
 Strand=Plus/Minus

Query  1      TGGAAGTGTTTGCTACCAAGTTTATTTGCAGTGTTAACAGCACAACATTTACAAAACGTA  60
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  81189  TGGAAGTGTTTGCTACCAAGTTTATTTGCAGTGTTAACAGCACAACATTTACAAAACGTA  81130

Extract Sequences Yourself

After obtaining the chromosomal location from NCBI for each gene, I extracted the gene out of the hg19 and hg38 FASTA files with samtools.

First, obtain the reference sequence FASTA files.

The collection may be previewed at

A. https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-ak4xkyri37st0j8 if you wish to login to Arvados

B. https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ak4xkyri37st0j8/4zuujj9l1biskm1q229naj8ltn80jwv5y5ffk0yotb2mq68n7x/ if you don't wish to login

Otherwise, download them directly (NOTE: This collection is around 4 gigabytes!)

$ wget --mirror --no-parent --no-host --cut-dirs=3 https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ak4xkyri37st0j8/4zuujj9l1biskm1q229naj8ltn80jwv5y5ffk0yotb2mq68n7x/ 

Install samtools to manipulate the FASTA files

git clone https://github.com/samtools/samtools.git
git clone https://github.com/samtools/htslib.git

Use samtools to extract the relevant regions, as defined by NCBI

The exact samtools commands

samtools faidx /data/fasta/hg19/chr17.fa chr17:41196312-41277500 > /data/fasta/brca/brca1-hg19.fa
samtools faidx /data/fasta/hg38/chr17.fa chr17:43,044,295-43,125,483 > /data/fasta/brca/brca1-hg38.fa
samtools faidx /data/fasta/hg19/chr13.fa chr13:32,889,617-32,973,809 > /data/fasta/brca/brca2-hg19.fa
samtools faidx /data/fasta/hg38/chr13.fa chr13:32,315,480-32,399,672 > /data/fasta/brca/brca2-hg38.fa
samtools faidx /data/fasta/hg19/chr5.fa chr5:70,220,768-70,248,839 > /data/fasta/sma/sma-hg19.fa
samtools faidx /data/fasta/hg38/chr5.fa chr5:70,924,941-70,953,012 > /data/fasta/sma/sma-hg38.fa

About

Version

  • v0.1 -- (02 Dec 2014) BRCA1, BRCA2, SMA genes on hg19 and hg38 FASTA files created.

Todo

  • KIR and MHC regions for hg19 and hg38.
  • Find good definitions for location of all regions against hg18.

Notes

  • v0.1 -- The python exactsearch.py program is documented in the source code below.

Author

Nancy Ouyang (nancy@curoverse.com)

from Bio import SeqIO
from Bio.Alphabet import IUPAC
import sys
import glob
def exactsearch(isDirSearch, queryfile, subject):
"""Searches for whether the given sequence exactly matches any of the sequences in another directory or file.
That is, given query seq S, it searches whether seq S, rev(S), complement(S), rev(complement(S))
exists as a substring in any of the records in any of the files in the given subject directory or file.
isDirSearch -- Boolean. Specifies whether we are searching against a directory or single FASTA file
queryfile -- String. The FASTA file. Should only contain one sequence.
subject -- String. 1. The directory of FASTA files, can be a relative or absolute path. Each file can
contain multiple sequences.
2. The FASTA file to search against, can contain multiple sequences.
An example directory structure:
./
brca1-hg19.fa
brca1-hg38.fa
brca1test/
LRG_292.fasta
ncbi_brca1.fasta
Example calls:
$ python exactsearch.py dir brca1-hg19.fa brca1test
$ python exactsearch.py file brca1-hg19.fa brca1-hg38.fa
Example results:
$ python exactsearch.py dir brca1-hg19.fa brca1test
Searching through the following records in file brca1test/LRG_292.fasta :
record id LRG_292g
record id LRG_292t1
record id LRG_292p1
Searching through the following records in file brca1test/ncbi_brca1.fasta :
record id gi|262359905:92501-173689
Number of matches found for brca1-hg19.fa :
2
Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)
['brca1test/LRG_292.fasta', 'LRG_292g (genomic sequence)', 'reverse complement', 'GTACCTTGAT', 193689, 81189]
['brca1test/ncbi_brca1.fasta', 'gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17', 'reverse complement', 'GTACCTTGAT', 81189, 81189]
Note: In plain English, this means that the sequence in brca-hg19.fa exactly matches both
1) a subset of the LRG_292g record in LRG_292.fasta
2) the full gi|BRCA1 record in ncbi_brca1.fasta
"""
alpha = IUPAC.unambiguous_dna
query = SeqIO.read(queryfile, "fasta", alpha) #parse for multiple, read() for single in fasta
query_list = {"original":query.seq, "reverse":query.seq[::-1], "complement":query.seq.complement(), \
"reverse complement":query.seq.reverse_complement()}
results = []
if isDirSearch:
subject = subject + '/*'
#print 'subject', subject
#print 'query_list', query_list
for fafile in glob.glob(subject):
print "Searching through the following records in file", fafile, ": "
#print 'fafile', fafile
for record in SeqIO.parse(fafile, "fasta", alpha):
print 'record id', record.id
for descr, seq in query_list.iteritems():
#print 'seq', seq
#print 'searching.... ', descr, str(seq)[:60]
#print 'in this: ', str(record.seq)[:60]
if seq.upper() in record.seq.upper():
results.append([fafile, record.description, descr, str(seq)[:10], len(record.seq), len(seq)])
return results
else:
for record in SeqIO.parse(subject, "fasta", alpha):
print 'Searching record id', record.id
for descr, seq in query_list.iteritems():
if seq.upper() in record.seq.upper():
results.append([subject, record.description, descr, str(seq)[:10], len(record.seq), len(seq)])
return results
if __name__=="__main__":
if len(sys.argv) != 4:
raise Exception('Not enough arguments')
elif sys.argv[1] == 'dir':
isDirSearch = True
elif sys.argv[1] == 'file':
isDirSearch = False
else:
raise Exception('Must supply whether we are searching against a file or a directory')
results = exactsearch(isDirSearch, sys.argv[2], sys.argv[3])
print "Number of matches found for", sys.argv[2], ':\n', len(results)
print "Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)"
print ('\n'.join(map(str, results)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment