Fastqc is a program to perform some basic quality checks on fastq files. It makes nice html reports for a given file, but (as far as I can tell) doesn't provide a straightfowrard way to compare the results across files (which might represent different library preps, sequencing lanes or samples).
Here is the (really pretty hacky) solution to aggregating these stats that I came up with. This all assumes that you have a directory where reports for each fastq file are in a subdirectories containing the reports with names ./library_name.L001.R1.fastqc/fastqc_data.txt
. We will then use regular expressions to match just those parts of the file we care about.
import os
import re
#percent sequences left after de_dup
re_dup = re.compile('Total Deduplicated Percentage\t(\d\d\.\d)')