nouyang-curoverse/Readme.md

## Readme.md

      
    Raw
  

              Readme.md
            
          
    GA4GH Regions of Interest (FASTA Files)

Regions:

BRCA1
BRCA2
SMA
KIR
MHC

Overview


Quickstart: How to download the FASTA files
Verification: Describes my verification steps and conclusions
Verify Sequences Yourself: Describes how to replicate my verification steps
Extract Sequences Yourself: Describes how to independently extract the regions of interest from the reference genomes
About: Meta-information about this project

Quickstart

I created a collection of the FASTA files for BRCA1, BRCA2, and SMA genes as extracted from hg19 and hg38 reference genomes. I also created a short biopython script, exactsearch.py, for quickly verifying whether a record in a FASTA file is a direct match against for any of the records in another FASTA file (or directory of FASTA files).
The FASTA files are in the root directory, and the LRG and NCBI reference sequences (which are created separately from the reference genome) are in the "test" directory for each gene. The file listing is as follows:
brca1test
    LRG_292.fasta
    ncbi_brca1.fasta
brca2test
    LRG_293.fasta
    ncbi_brca2.fasta
smatest
    LRG_676.fasta
    ncbi_sma.fasta
brca1-hg19.fa
brca1-hg38.fa
brca2-hg19.fa
brca2-hg38.fa
sma-hg19.fa
sma-hg38.fa

To get the FASTA files
A. if you have access to an Arvados VM, try
$ arv get qr1hi-4zz18-ykp04ajmg4z7wol/ . #note trailing dot!

B. Otherwise, they're publicly available:
$ wget --mirror --no-parent --no-host --cut-dirs=3 https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ykp04ajmg4z7wol/4llcsi6jnlvdoro5cqilihjlla1cao87zs72du04l8n6jrweg8/

You can preview the collection contents at
A. https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-ykp04ajmg4z7wol (login with any google account)
or, if you'd prefer to not login, preview the contents at
B. https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ykp04ajmg4z7wol/4llcsi6jnlvdoro5cqilihjlla1cao87zs72du04l8n6jrweg8/
Verification

I verified my sequences against the NCBI and LRG reference FASTA files. The coordinates come from the NCBI gene definitions for each gene. My conclusions:
brca1

brca1-hg19 is the exact same as brca1-hg38
brca1-hg19 and brca1-hg38 are exactly the reverse complement of LRG_292 and the NCBI reference sequences

brca2

brca2-hg19 is the exact same as brca2-hg38
brca2-hg19 and brca2-hg38 have a single base variation (T->C) from LRG_293 and the NCBI reference sequences, that is they are an exact match except for one base

sma

sma-hg19 is the exact same as sma-hg38
sma-hg19 and sma-hg38 exactly match LRG_676 and the NCBI reference sequences

Also, each NCBI reference sequence is an exact substring of the LRG reference sequence.
Verify Sequences Yourself

If you'd like to verify the file contents, exactsearch.py will quickly tell you whether a sequence exactly matches another (either as originally in the file, as reversed, as complemented, or as reverse complemented), and for actual alignments you can run blastn. For example, if we want to verify brca1-hg18.fa against the LRG definition and the NCBI definition,
Download Reference Genes

The reference genes I verified against come from:
BRCA1

http://www.ncbi.nlm.nih.gov/nuccore/NG_005905.2?from=92501&to=173689&report=fasta
http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_292.xml

BRCA2

http://www.ncbi.nlm.nih.gov/nuccore/NG_012772.3?from=5001&to=89193&report=fasta
http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_293.xml

SMA

http://www.ncbi.nlm.nih.gov/nuccore/NG_008691.1?from=5001&to=33072&report=fasta
http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_676.xml

Exact Search

Get python script and biopython dependency
$ sudo apt-get install pip git
$ pip install biopython
$ git clone <this gist>

Run python script
$ python exactsearch.py dir brca1-hg19.fa brca1test
Number of matches found for brca1-hg19.fa :
2
Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)
['brca1test/LRG_292.fasta', 'LRG_292g (genomic sequence)', 'reverse complement', 'GTACCTTGAT', 193689, 81189]
['brca1test/ncbi_brca1.fasta', 'gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17', 'reverse complement', 'GTACCTTGAT', 81189, 81189]

Blastn

If exactsearch.py shows zero matches, it's possible there's simply a SNP or two (a base is deleted or changed). To find out, run blastn . After installing blast as per http://www.ncbi.nlm.nih.gov/books/NBK52640/, for instance:
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.30+-x64-linux.tar.gz
$ tar zxvpf ncbi-blast-2.2.30+-x64-linux.tar.gz
$ vi ~/.bashrc
  PATH="$PATH:/data/tools/ncbi-blast-2.2.30+/bin"
  BLASTDB="/data/tools/ncbi-blast-2.2.30+/db"
$ source ~/.bashrc

Make a single fasta file containing all the sequences you want to query against. For instance, I created an lrg292_ncbi FASTA file that looks like
>gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17
GTACCTTGATTTCGTATTCTGAGAGGCTGCTGCTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGGAAAA
[...]
>LRG_292g (genomic sequence)
TGTGTGTATGAAGTTAACTTCAAAGCAAGCTTCCTGTGCTGAGGGGGTGGGAGGTAAGGG
[...]

Then make the local database
$ makeblastdb –in /data/fasta/brca/lrg292_ncbi.fa –dbtype nucl –parse_seqids    

Then blast search that database
$ blastn -db /data/fasta/brca/lrg292_ncbi.fa -query /data/fasta/brca/hg19-brca1.fa  -out=blast_brca1.out

If you open the blast_brca1.out file, you'll see plaintext results:
>LRG_293p1 (protein translated from transcript t1 of LRG_293)
Length=86452

 Score = 1.555e+05 bits (84190),  Expect = 0.0
 Identities = 84192/84193 (99%), Gaps = 0/84193 (0%)
 Strand=Plus/Plus


Query  39721  TAATGACAATGAGATTCATCAGTTTAACAAAAACAACTCCAATCAAGCAGTAGCTGTAAC  39780
              |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
Sbjct  41980  TAATGACAATGAGATTCATCAGTTTAACAAAAACAACTCCAATCAAGCAGCAGCTGTAAC  42039

Note that you can tell that they are aligned in the same direction (both query and subject are increasing indices) and as they are originally. If the complements aligned instead, you'd see Strand=Plus/Minus:
> gi|262359905:92501-173689 Homo sapiens breast cancer 1, early 
onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17
Length=81189

 Score = 1.499e+05 bits (81189),  Expect = 0.0
 Identities = 81189/81189 (100%), Gaps = 0/81189 (0%)
 Strand=Plus/Minus

Query  1      TGGAAGTGTTTGCTACCAAGTTTATTTGCAGTGTTAACAGCACAACATTTACAAAACGTA  60
              ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  81189  TGGAAGTGTTTGCTACCAAGTTTATTTGCAGTGTTAACAGCACAACATTTACAAAACGTA  81130

Extract Sequences Yourself

After obtaining the chromosomal location from NCBI for each gene, I extracted the gene out of the hg19 and hg38 FASTA files with samtools.
First, obtain the reference sequence FASTA files.
The collection may be previewed at
A. https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-ak4xkyri37st0j8 if you wish to login to Arvados
B. https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ak4xkyri37st0j8/4zuujj9l1biskm1q229naj8ltn80jwv5y5ffk0yotb2mq68n7x/ if you don't wish to login
Otherwise, download them directly (NOTE: This collection is around 4 gigabytes!)
$ wget --mirror --no-parent --no-host --cut-dirs=3 https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ak4xkyri37st0j8/4zuujj9l1biskm1q229naj8ltn80jwv5y5ffk0yotb2mq68n7x/ 

Install samtools to manipulate the FASTA files
git clone https://github.com/samtools/samtools.git
git clone https://github.com/samtools/htslib.git

Use samtools to extract the relevant regions, as defined by NCBI

BRCA1: http://www.ncbi.nlm.nih.gov/gene/672
BRCA2: http://www.ncbi.nlm.nih.gov/gene/675
SMA: http://www.ncbi.nlm.nih.gov/gene/6606

The exact samtools commands
samtools faidx /data/fasta/hg19/chr17.fa chr17:41196312-41277500 > /data/fasta/brca/brca1-hg19.fa
samtools faidx /data/fasta/hg38/chr17.fa chr17:43,044,295-43,125,483 > /data/fasta/brca/brca1-hg38.fa
samtools faidx /data/fasta/hg19/chr13.fa chr13:32,889,617-32,973,809 > /data/fasta/brca/brca2-hg19.fa
samtools faidx /data/fasta/hg38/chr13.fa chr13:32,315,480-32,399,672 > /data/fasta/brca/brca2-hg38.fa
samtools faidx /data/fasta/hg19/chr5.fa chr5:70,220,768-70,248,839 > /data/fasta/sma/sma-hg19.fa
samtools faidx /data/fasta/hg38/chr5.fa chr5:70,924,941-70,953,012 > /data/fasta/sma/sma-hg38.fa

About

Version


v0.1 -- (02 Dec 2014)
BRCA1, BRCA2, SMA genes on hg19 and hg38 FASTA files created.

Todo


KIR and MHC regions for hg19 and hg38.
Find good definitions for location of all regions against hg18.

Notes


v0.1 -- The python exactsearch.py program is documented in the source code below.

Author

Nancy Ouyang (nancy@curoverse.com)

  
## exactsearch.py
from Bio import SeqIO
from Bio.Alphabet import IUPAC
import sys
import glob

def exactsearch(isDirSearch, queryfile, subject):
    """Searches for whether the given sequence exactly matches any of the sequences in another directory or file.
    That is, given query seq S, it searches whether seq S, rev(S), complement(S), rev(complement(S))
    exists as a substring in any of the records in any of the files in the given subject directory or file.

    isDirSearch -- Boolean. Specifies whether we are searching against a directory or single FASTA file
    queryfile -- String. The FASTA file. Should only contain one sequence.
    subject -- String. 1. The directory of FASTA files, can be a relative or absolute path. Each file can
                   contain multiple sequences.
               2. The FASTA file to search against, can contain multiple sequences.

    An example directory structure:
        ./
          brca1-hg19.fa
          brca1-hg38.fa
          brca1test/
            LRG_292.fasta
            ncbi_brca1.fasta

    Example calls:
        $ python exactsearch.py dir brca1-hg19.fa brca1test
        $ python exactsearch.py file brca1-hg19.fa brca1-hg38.fa

    Example results:
        $ python exactsearch.py dir brca1-hg19.fa brca1test
        Searching through the following records in file brca1test/LRG_292.fasta :
        record id LRG_292g
        record id LRG_292t1
        record id LRG_292p1
        Searching through the following records in file brca1test/ncbi_brca1.fasta :
        record id gi|262359905:92501-173689
        Number of matches found for brca1-hg19.fa :
        2
        Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)
        ['brca1test/LRG_292.fasta', 'LRG_292g (genomic sequence)', 'reverse complement', 'GTACCTTGAT', 193689, 81189]
        ['brca1test/ncbi_brca1.fasta', 'gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17', 'reverse complement', 'GTACCTTGAT', 81189, 81189]

        Note: In plain English, this means that the sequence in brca-hg19.fa exactly matches both
            1) a subset of the LRG_292g record in LRG_292.fasta
            2) the full gi|BRCA1 record in ncbi_brca1.fasta

    """
    alpha = IUPAC.unambiguous_dna

    query = SeqIO.read(queryfile, "fasta", alpha) #parse for multiple, read() for single in fasta
    query_list = {"original":query.seq, "reverse":query.seq[::-1], "complement":query.seq.complement(), \
            "reverse complement":query.seq.reverse_complement()}

    results = []
    if isDirSearch:
        subject = subject + '/*'
        #print 'subject', subject
        #print 'query_list', query_list
        for fafile in glob.glob(subject):
            print "Searching through the following records in file", fafile, ": "
            #print 'fafile', fafile
            for record in SeqIO.parse(fafile, "fasta", alpha):
                print 'record id', record.id
                for descr, seq in query_list.iteritems():
                    #print 'seq', seq
                    #print 'searching.... ', descr, str(seq)[:60]
                    #print 'in this: ', str(record.seq)[:60]
                    if seq.upper() in record.seq.upper():
                        results.append([fafile, record.description, descr, str(seq)[:10], len(record.seq), len(seq)])
        return results
    else:
        for record in SeqIO.parse(subject, "fasta", alpha):
            print 'Searching record id', record.id
            for descr, seq in query_list.iteritems():
                if seq.upper() in record.seq.upper():
                    results.append([subject, record.description, descr, str(seq)[:10], len(record.seq), len(seq)])
        return results

if __name__=="__main__":
    if len(sys.argv) != 4:
        raise Exception('Not enough arguments')
    elif sys.argv[1] == 'dir':
        isDirSearch = True
    elif sys.argv[1] == 'file':
        isDirSearch = False
    else:
        raise Exception('Must supply whether we are searching against a file or a directory')

    results = exactsearch(isDirSearch, sys.argv[2], sys.argv[3])
    print "Number of matches found for", sys.argv[2], ':\n', len(results)
    print "Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)"
    print ('\n'.join(map(str, results)))
	from Bio import SeqIO
	from Bio.Alphabet import IUPAC
	import sys
	import glob

	def exactsearch(isDirSearch, queryfile, subject):
	"""Searches for whether the given sequence exactly matches any of the sequences in another directory or file.
	That is, given query seq S, it searches whether seq S, rev(S), complement(S), rev(complement(S))
	exists as a substring in any of the records in any of the files in the given subject directory or file.

	isDirSearch -- Boolean. Specifies whether we are searching against a directory or single FASTA file
	queryfile -- String. The FASTA file. Should only contain one sequence.
	subject -- String. 1. The directory of FASTA files, can be a relative or absolute path. Each file can
	contain multiple sequences.
	2. The FASTA file to search against, can contain multiple sequences.

	An example directory structure:
	./
	brca1-hg19.fa
	brca1-hg38.fa
	brca1test/
	LRG_292.fasta
	ncbi_brca1.fasta

	Example calls:
	$ python exactsearch.py dir brca1-hg19.fa brca1test
	$ python exactsearch.py file brca1-hg19.fa brca1-hg38.fa

	Example results:
	$ python exactsearch.py dir brca1-hg19.fa brca1test
	Searching through the following records in file brca1test/LRG_292.fasta :
	record id LRG_292g
	record id LRG_292t1
	record id LRG_292p1
	Searching through the following records in file brca1test/ncbi_brca1.fasta :
	record id gi\|262359905:92501-173689
	Number of matches found for brca1-hg19.fa :
	2
	Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)
	['brca1test/LRG_292.fasta', 'LRG_292g (genomic sequence)', 'reverse complement', 'GTACCTTGAT', 193689, 81189]
	['brca1test/ncbi_brca1.fasta', 'gi\|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17', 'reverse complement', 'GTACCTTGAT', 81189, 81189]

	Note: In plain English, this means that the sequence in brca-hg19.fa exactly matches both
	1) a subset of the LRG_292g record in LRG_292.fasta
	2) the full gi\|BRCA1 record in ncbi_brca1.fasta

	"""
	alpha = IUPAC.unambiguous_dna

	query = SeqIO.read(queryfile, "fasta", alpha) #parse for multiple, read() for single in fasta
	query_list = {"original":query.seq, "reverse":query.seq[::-1], "complement":query.seq.complement(), \
	"reverse complement":query.seq.reverse_complement()}

	results = []
	if isDirSearch:
	subject = subject + '/*'
	#print 'subject', subject
	#print 'query_list', query_list
	for fafile in glob.glob(subject):
	print "Searching through the following records in file", fafile, ": "
	#print 'fafile', fafile
	for record in SeqIO.parse(fafile, "fasta", alpha):
	print 'record id', record.id
	for descr, seq in query_list.iteritems():
	#print 'seq', seq
	#print 'searching.... ', descr, str(seq)[:60]
	#print 'in this: ', str(record.seq)[:60]
	if seq.upper() in record.seq.upper():
	results.append([fafile, record.description, descr, str(seq)[:10], len(record.seq), len(seq)])
	return results
	else:
	for record in SeqIO.parse(subject, "fasta", alpha):
	print 'Searching record id', record.id
	for descr, seq in query_list.iteritems():
	if seq.upper() in record.seq.upper():
	results.append([subject, record.description, descr, str(seq)[:10], len(record.seq), len(seq)])
	return results

	if __name__=="__main__":
	if len(sys.argv) != 4:
	raise Exception('Not enough arguments')
	elif sys.argv[1] == 'dir':
	isDirSearch = True
	elif sys.argv[1] == 'file':
	isDirSearch = False
	else:
	raise Exception('Must supply whether we are searching against a file or a directory')

	results = exactsearch(isDirSearch, sys.argv[2], sys.argv[3])
	print "Number of matches found for", sys.argv[2], ':\n', len(results)
	print "Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)"
	print ('\n'.join(map(str, results)))