Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Find chromosome regions for given SSR GenBank IDs
#!/usr/bin/env python3
#
# This script will BLAST a series of GenBank IDs representing SSR sequences from
# Sirjusingh, C., Kohn, L.M., 2001. Characterization of microsatellites in the
# fungal plant pathogen, Sclerotinia sclerotiorum. Molecular Ecology 1, 267e269.
#
# This uses biopython to download the XML, save it to a file in the working
# directory, and then looks for matches that have Sclerotinia chromosome in the
# title and has a score of > 100. If there's a better way to do this, I would
# like to know it. Currently, it can download a few results at a time and then
# it just hangs.
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import os
import re
import sys
# Each of these sequences represents the conventional name and GenBank ID for
# the SSR genotypes defined in Sirjusingh and Kohn, 2001
seqs = {
"41-1" : "AF377899",
"5-2" : "AF377900",
"6-2" : "AF377901",
"7-2" : "AF377902",
"9-2" : "AF377903",
"11-2" : "AF377905",
"12-2" : "AF377906",
"13-2" : "AF377907",
"5-3" : "AF377908",
"7-3" : "AF377909",
"8-3" : "AF377910",
"17-3" : "AF377911",
"20-3" : "AF377912",
"23-4" : "AF377913",
"36-4" : "AF377914",
"42-4" : "AF377916",
"50-4" : "AF377917",
"55-4" : "AF377918",
"92-4" : "AF377919",
"99-4" : "AF377926",
"106-4" : "AF377921",
"110-4" : "AF377922",
"114-4" : "AF377923",
"117-4" : "AF377924",
"119-4" : "AF377925"
}
for s in seqs.keys():
print("Processing " + str(seqs[s]), file = sys.stderr)
xml = str(seqs[s]) + ".xml"
# Check if the xml file has already been downloaded.
# If it hasn't, download it with qblast()
if not os.path.isfile(xml):
print("Downloading...", file = sys.stderr)
result_handle = NCBIWWW.qblast("blastn", "nt", str(seqs[s]))
print("Done", file = sys.stderr)
with open(xml, "w") as out_handle:
out_handle.write(result_handle.read())
# Read in the file and grab the blast records
result_handle = open(xml)
blast_records = NCBIXML.read(result_handle)
for alignment in blast_records.alignments:
HSP = alignment.hsps[0]
# Determine if the alignment comes from Ssc genome. We don't care about
# any other sequence
if re.search(r'Sclerotinia.+?chromosome', alignment.title) and HSP.score > 100:
# grab the title
res = s + ";" + str(seqs[s]) + ";" + alignment.title + ";"
# grab the positions on chromosome
res = res + str(HSP.sbjct_start) + ";"
res = res + str(HSP.sbjct_end)
# print to stdout
print(res)
else:
next
@zkamvar
Copy link
Author

zkamvar commented Nov 1, 2017

Okay, after running this for an ungodly amount of time, I finally got a csv file (yes, I'm storing this in Box... I'm lazy) that worked and, after processing in R (hey, that's where I'm comfortable), I found this:

the_loci <- c("5-2", "6-2", "7-2", "8-3", "9-2", "12-2", "17-3", "20-3", 
                            "55-4", "110-4", "114-4")
suppressPackageStartupMessages(library("tidyverse"))
readr::read_delim("~/Box Sync/ssr_regions/ssr_regions.csv", delim = ";", col_names = c("SSR", "GenBankID", "Chromosome", "start", "stop")) %>% 
    mutate(chrom = gsub("^.+?chromosome ([0-9]{1,2}).+?$", "\\1", Chromosome)) %>%
    arrange(chrom, start) %>% 
    filter(SSR %in% the_loci) %>% 
    group_by(chrom) %>% 
    mutate(diff = start - lag(stop)) %>% 
    select(SSR, chrom, start, stop, diff)
#> Parsed with column specification:
#> cols(
#>   SSR = col_character(),
#>   GenBankID = col_character(),
#>   Chromosome = col_character(),
#>   start = col_integer(),
#>   stop = col_integer()
#> )
#> # A tibble: 11 x 5
#> # Groups:   chrom [7]
#>      SSR chrom   start    stop    diff
#>    <chr> <chr>   <int>   <int>   <int>
#>  1   6-2    11 1822978 1823456      NA
#>  2   8-3    12 1235695 1235947      NA
#>  3  55-4    16 1064252 1064423      NA
#>  4  20-3     3 2365031 2364754      NA
#>  5   5-2     3 2550148 2549831  185394
#>  6   9-2     3 2856158 2855801  306327
#>  7  17-3     4    2591    2935      NA
#>  8 114-4     4   58802   58434   55867
#>  9   7-2     4 1617049 1617220 1558615
#> 10  12-2     5  778518  778734      NA
#> 11 110-4     6  546391  546765      NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment