Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@paulgradie
Created August 18, 2014 01:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save paulgradie/f488aea3e17be7fc8d59 to your computer and use it in GitHub Desktop.
Save paulgradie/f488aea3e17be7fc8d59 to your computer and use it in GitHub Desktop.
Outdated Version: For making all your reference chromosome files evenly formatted to lines of 50bp PRIOR to concatenation.
__author__ = 'Paul G'
from sys import argv
import os
import re
script = argv
print """\nHi USER (lol Tron joke!) This simple script is meant to help you format your reference file.\n
It is really only meant to handle one chromosome file at a time,a future update will allow for entire reference fix.
Purpose:
Adjusting the line lengths of your reference chromosome PRIOR TO CONCATENATION by ignoring fasta headers, removing line
breaks, and then redistributing the line breaks. Many genome browsers such as IGV won't accept a reference file that
doesn't have lines formatted evenly with 50bp./n/n/n This script does NOT overwrite your original file!!
"""
print """Usage:\n1.First navigate to the directory where your reference file is located. \n2. When you see the '>' prompt, type the file name of your reference chromosome. \n3. Alternatively, you may write a relative path to your file from whichever directory you are in.\n4. A third option is to write an absolute path from any directory.
5. Simply type the name of your chromosome file and hit enter.
Output: A file called "FixedRef.fa" that you can rename to whatever you'd like. I recommend writing 'chr#_fixed.fa'.
"""
#unhash this; this is how we select our file
prompt = raw_input("Type your filename or path plus filename>")
#prompt = "formattest.fa" ##For testing only - unhash to test
#make a new file called FixedRef.fa
finaloutput = open("FixedRef.fa", 'w')
tempref = ""
#open the old chr.fa file and go through it line by line
with open(prompt, "r") as ref:
for line in ref:
#if its the fasta formatter, skip this line and write it to the output
if line.startswith(">") is True:
tempheader = line
#otherwise, add on the entire next line without the '\n' marker
else:
tempref += line.rstrip('\n')
finaloutput.write(tempheader)
#every 50 characters, insert a '\n' line return
finaloutput.write(re.sub("(.{50})", "\\1\n", tempref, 0, re.DOTALL))
finaloutput.close()
print "All done."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment