Regions:
- BRCA1
- BRCA2
- SMA
- KIR
- MHC
- Quickstart: How to download the FASTA files
- Verification: Describes my verification steps and conclusions
- Verify Sequences Yourself: Describes how to replicate my verification steps
- Extract Sequences Yourself: Describes how to independently extract the regions of interest from the reference genomes
- About: Meta-information about this project
I created a collection of the FASTA files for BRCA1
, BRCA2
, and SMA
genes as extracted from hg19
and hg38
reference genomes. I also created a short biopython script, exactsearch.py
, for quickly verifying whether a record in a FASTA file is a direct match against for any of the records in another FASTA file (or directory of FASTA files).
The FASTA files are in the root directory, and the LRG and NCBI reference sequences (which are created separately from the reference genome) are in the "test" directory for each gene. The file listing is as follows:
brca1test
LRG_292.fasta
ncbi_brca1.fasta
brca2test
LRG_293.fasta
ncbi_brca2.fasta
smatest
LRG_676.fasta
ncbi_sma.fasta
brca1-hg19.fa
brca1-hg38.fa
brca2-hg19.fa
brca2-hg38.fa
sma-hg19.fa
sma-hg38.fa
To get the FASTA files
A. if you have access to an Arvados VM, try
$ arv get qr1hi-4zz18-ykp04ajmg4z7wol/ . #note trailing dot!
B. Otherwise, they're publicly available:
$ wget --mirror --no-parent --no-host --cut-dirs=3 https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ykp04ajmg4z7wol/4llcsi6jnlvdoro5cqilihjlla1cao87zs72du04l8n6jrweg8/
You can preview the collection contents at
A. https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-ykp04ajmg4z7wol (login with any google account)
or, if you'd prefer to not login, preview the contents at
I verified my sequences against the NCBI and LRG reference FASTA files. The coordinates come from the NCBI gene definitions for each gene. My conclusions:
brca1
- brca1-hg19 is the exact same as brca1-hg38
- brca1-hg19 and brca1-hg38 are exactly the reverse complement of LRG_292 and the NCBI reference sequences
brca2
- brca2-hg19 is the exact same as brca2-hg38
- brca2-hg19 and brca2-hg38 have a single base variation (T->C) from LRG_293 and the NCBI reference sequences, that is they are an exact match except for one base
sma
- sma-hg19 is the exact same as sma-hg38
- sma-hg19 and sma-hg38 exactly match LRG_676 and the NCBI reference sequences
Also, each NCBI reference sequence is an exact substring of the LRG reference sequence.
If you'd like to verify the file contents, exactsearch.py
will quickly tell you whether a sequence exactly matches another (either as originally in the file, as reversed, as complemented, or as reverse complemented), and for actual alignments you can run blastn
. For example, if we want to verify brca1-hg18.fa
against the LRG definition and the NCBI definition,
The reference genes I verified against come from:
BRCA1
- http://www.ncbi.nlm.nih.gov/nuccore/NG_005905.2?from=92501&to=173689&report=fasta
- http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_292.xml
BRCA2
- http://www.ncbi.nlm.nih.gov/nuccore/NG_012772.3?from=5001&to=89193&report=fasta
- http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_293.xml
SMA
- http://www.ncbi.nlm.nih.gov/nuccore/NG_008691.1?from=5001&to=33072&report=fasta
- http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_676.xml
Get python script and biopython dependency
$ sudo apt-get install pip git
$ pip install biopython
$ git clone <this gist>
Run python script
$ python exactsearch.py dir brca1-hg19.fa brca1test
Number of matches found for brca1-hg19.fa :
2
Results: (match filename, match sequence name, query match type, query[:10], subject length, query length)
['brca1test/LRG_292.fasta', 'LRG_292g (genomic sequence)', 'reverse complement', 'GTACCTTGAT', 193689, 81189]
['brca1test/ncbi_brca1.fasta', 'gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17', 'reverse complement', 'GTACCTTGAT', 81189, 81189]
If exactsearch.py
shows zero matches, it's possible there's simply a SNP or two (a base is deleted or changed). To find out, run blastn
. After installing blast
as per http://www.ncbi.nlm.nih.gov/books/NBK52640/, for instance:
$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.30+-x64-linux.tar.gz
$ tar zxvpf ncbi-blast-2.2.30+-x64-linux.tar.gz
$ vi ~/.bashrc
PATH="$PATH:/data/tools/ncbi-blast-2.2.30+/bin"
BLASTDB="/data/tools/ncbi-blast-2.2.30+/db"
$ source ~/.bashrc
Make a single fasta file containing all the sequences you want to query against. For instance, I created an lrg292_ncbi FASTA file that looks like
>gi|262359905:92501-173689 Homo sapiens breast cancer 1, early onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17
GTACCTTGATTTCGTATTCTGAGAGGCTGCTGCTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGGAAAA
[...]
>LRG_292g (genomic sequence)
TGTGTGTATGAAGTTAACTTCAAAGCAAGCTTCCTGTGCTGAGGGGGTGGGAGGTAAGGG
[...]
Then make the local database
$ makeblastdb –in /data/fasta/brca/lrg292_ncbi.fa –dbtype nucl –parse_seqids
Then blast search that database
$ blastn -db /data/fasta/brca/lrg292_ncbi.fa -query /data/fasta/brca/hg19-brca1.fa -out=blast_brca1.out
If you open the blast_brca1.out file, you'll see plaintext results:
>LRG_293p1 (protein translated from transcript t1 of LRG_293)
Length=86452
Score = 1.555e+05 bits (84190), Expect = 0.0
Identities = 84192/84193 (99%), Gaps = 0/84193 (0%)
Strand=Plus/Plus
Query 39721 TAATGACAATGAGATTCATCAGTTTAACAAAAACAACTCCAATCAAGCAGTAGCTGTAAC 39780
|||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
Sbjct 41980 TAATGACAATGAGATTCATCAGTTTAACAAAAACAACTCCAATCAAGCAGCAGCTGTAAC 42039
Note that you can tell that they are aligned in the same direction (both query and subject are increasing indices) and as they are originally. If the complements aligned instead, you'd see Strand=Plus/Minus
:
> gi|262359905:92501-173689 Homo sapiens breast cancer 1, early
onset (BRCA1), RefSeqGene (LRG_292) on chromosome 17
Length=81189
Score = 1.499e+05 bits (81189), Expect = 0.0
Identities = 81189/81189 (100%), Gaps = 0/81189 (0%)
Strand=Plus/Minus
Query 1 TGGAAGTGTTTGCTACCAAGTTTATTTGCAGTGTTAACAGCACAACATTTACAAAACGTA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 81189 TGGAAGTGTTTGCTACCAAGTTTATTTGCAGTGTTAACAGCACAACATTTACAAAACGTA 81130
After obtaining the chromosomal location from NCBI for each gene, I extracted the gene out of the hg19
and hg38
FASTA files with samtools
.
First, obtain the reference sequence FASTA files.
The collection may be previewed at
A. https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-ak4xkyri37st0j8 if you wish to login to Arvados
B. https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ak4xkyri37st0j8/4zuujj9l1biskm1q229naj8ltn80jwv5y5ffk0yotb2mq68n7x/ if you don't wish to login
Otherwise, download them directly (NOTE: This collection is around 4 gigabytes!)
$ wget --mirror --no-parent --no-host --cut-dirs=3 https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-ak4xkyri37st0j8/4zuujj9l1biskm1q229naj8ltn80jwv5y5ffk0yotb2mq68n7x/
Install samtools
to manipulate the FASTA files
git clone https://github.com/samtools/samtools.git
git clone https://github.com/samtools/htslib.git
Use samtools
to extract the relevant regions, as defined by NCBI
BRCA1
: http://www.ncbi.nlm.nih.gov/gene/672BRCA2
: http://www.ncbi.nlm.nih.gov/gene/675SMA
: http://www.ncbi.nlm.nih.gov/gene/6606
The exact samtools commands
samtools faidx /data/fasta/hg19/chr17.fa chr17:41196312-41277500 > /data/fasta/brca/brca1-hg19.fa
samtools faidx /data/fasta/hg38/chr17.fa chr17:43,044,295-43,125,483 > /data/fasta/brca/brca1-hg38.fa
samtools faidx /data/fasta/hg19/chr13.fa chr13:32,889,617-32,973,809 > /data/fasta/brca/brca2-hg19.fa
samtools faidx /data/fasta/hg38/chr13.fa chr13:32,315,480-32,399,672 > /data/fasta/brca/brca2-hg38.fa
samtools faidx /data/fasta/hg19/chr5.fa chr5:70,220,768-70,248,839 > /data/fasta/sma/sma-hg19.fa
samtools faidx /data/fasta/hg38/chr5.fa chr5:70,924,941-70,953,012 > /data/fasta/sma/sma-hg38.fa
- v0.1 -- (02 Dec 2014) BRCA1, BRCA2, SMA genes on hg19 and hg38 FASTA files created.
- KIR and MHC regions for hg19 and hg38.
- Find good definitions for location of all regions against hg18.
- v0.1 -- The python
exactsearch.py
program is documented in the source code below.
Nancy Ouyang (nancy@curoverse.com)