Skip to content

Instantly share code, notes, and snippets.

@walterst
Created July 12, 2013 14:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save walterst/5984883 to your computer and use it in GitHub Desktop.
Save walterst/5984883 to your computer and use it in GitHub Desktop.
Parser to pull barcodes from fastq labels and write to a separate barcodes fastq file. See description at beginning of code for usage example. Requires PyCogent 1.5.3 to be installed (http://sourceforge.net/projects/pycogent/files/PyCogent/1.5.3/PyCogent-1.5.3.tgz/download)
#!/usr/bin/env python
# Usage:
# python parse_bcs_from_fastq_labels.py X Y Z A
# where X is input fastq file, Y is output barcode reads file,
# Z is character to split on in label (use quote characters), and A is number of characters to trim from the end of the label (0 for none)
# This assumes barcode is at the end of the label, and the number of characters following it are consistent
""" Example sequence, would use: python parse_bcs_from_fastq_labels.py fastq_fp bc_reads.fastq '#' 2 to generate barcodes
@MCIC-SOLEXA_0051_FC:1:1:14637:1026#CGATGT/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+MCIC-SOLEXA_0051_FC:1:1:14637:1026#CGATGT/1
cQRQOXXXXX_T___WTWWTQTVTV_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@MCIC-SOLEXA_0051_FC:1:1:4065:1039#CGATGT/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+MCIC-SOLEXA_0051_FC:1:1:4065:1039#CGATGT/1
KPPPQWWWWWQQ________BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
"""
from sys import argv
from cogent.parse.fastq import MinimalFastqParser
f = open(argv[1], "U")
bc_out = open(argv[2], "w")
char_to_split = argv[3]
chars_to_trim = int(argv[4])
for data in MinimalFastqParser(f, strict=False):
# Read in current label
curr_label = data[0].strip()
# Cut off last part of line past ":" character, replace if different character used
curr_bc_read = data[0].strip().split(char_to_split)[-1][0:-chars_to_trim]
# Create fake quality score since not going to get real data, match length of barcode
curr_bc_qual = "F"*len(curr_bc_read)
bc_out.write("@%s\n" % curr_label)
bc_out.write("%s\n" % curr_bc_read)
bc_out.write("+%s\n" % curr_label)
bc_out.write("%s\n" % curr_bc_qual)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment