Last active
January 20, 2022 17:18
-
-
Save mohammedkhalfan/f2ed9e3455911a302fb6410a499e35b9 to your computer and use it in GitHub Desktop.
Takes a demultiplexed fastq file as input and returns sorted list of barcodes found in ascending order of frequency.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Usage: python3 count-barcode-freq.py <fastq_file.gz> | |
## Example: python3 count-barcode-freq.py sample.fastq.gz | |
from operator import itemgetter | |
import sys, gzip | |
barcodes = {} | |
with gzip.open(sys.argv[1]) as fastq: | |
for line in fastq: | |
if not line.startswith(b'@'): continue | |
bc = line.decode("utf-8").split(':')[-1].strip() | |
if bc not in barcodes: | |
barcodes[bc] = 1 | |
else: | |
barcodes[bc]+=1 | |
total = sum(barcodes.values()) | |
for k, v in sorted(barcodes.items(), key=itemgetter(1)): | |
print(k, v, round(v/total*100, 2)) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello Dr. Khalfan,
Thank you for making your code public on GitHub. I have run this python script on one of my files and am getting the output:
15 849464 100.0
The first four lines of my fastq.gz file is:
@FS10000408:4:BNT40310-1714:1:1101:1050:1000 1:N:0:15
GGTTTGCTCTGGTTATTGAAACTTCTTGACTGTGTTCTCTTGATTTTCCCCGGTTTGATAGTTTAGCCGGCTTTGCTTCATTCTTCAGCGAAGTGGCAAATCTAGCCAATAACAAAAAAGTCAAGGAGGTGGTTTTCTACTGGAAGTAC
+
,,FFF,FFFFFFFFFFF,FFFFF:FFF,F:F:F,FFFFFFF,F:FFFF,F,,:FF,:,FF,F:,F:FFF::FFFF:F,:FFFFF,F,,::FF:FF,FF:,FFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF
I was expecting an oligonucleotide not a numeral as the barcode... could you give me some guidance? These reads were made on an Illumina iSeq machine using Nextera adapters.
Thank you for your time!
Best Regards,
René