Skip to content

Instantly share code, notes, and snippets.

@inutano
Last active November 6, 2015 05:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save inutano/e339257cb0307da05e05 to your computer and use it in GitHub Desktop.
Save inutano/e339257cb0307da05e05 to your computer and use it in GitHub Desktop.

Documentation of Whole-SRA-FastQC

code: Quanto

Software

Totally based on FastQC

Version

This document is based on FastQC version 0.11.3

Basic Modules

Basic quality check modules from FastQC

filename

input file name, e.g. "SRR000001.fastq.gz". Not suitable for identifier since this can be like "stdin" due to the calculation workflow.

  • type: String

file_type

'Conventional base calls' or 'colorspace', most data are base calls while data from SOLiD are colorspace which required to be converted to base calls.

  • type: Categorical

encoding

e.g. 'Sanger / Illumina 1.9'

  • type: Categorical

total_sequences

Total number of sequences in the input file.

  • type: Integer

filtered_sequences

Number of filtered sequences specified by option, must be 0 since we do not filter any sequences.

  • type: Integer

sequence_length

e.g. '36' or '50-150'.

  • type: Integer or Range

percent_gc

Overall percent GC.

  • type: Integer

per_base_sequence_quality

Mean, median, lower quartile, upper quartile, 10th quartile, 90th quartile of phred score for each base positions.

  • type: dataframe (integer, float)

per_tile_sequence_quality

(new module, not yet implemented)

per_sequnce_quality_scores

Count of sequences for each phred scores.

  • type: dataframe (integer, float)

per_base_sequence_content

Percentages of G, A, T, C for each base positions.

  • type: dataframe (integer, float)

per_sequence_gc_content

Count of sequences for each %GC.

  • type: dataframe (integer, float)

per_base_n_content

Percentages of bases called as N for each base positions.

  • type: dataframe (integer, float)

sequence_length_distribution

Count of sequences for each sequence length.

  • type: dataframe (integer, float)

total_duplicate_percentage

Percentage of the duplicated sequences of the total.

  • type: float

sequence_duplication_levels

Relative count of duplicated sequences for each duplication levels.

  • type: dataframe (integer, float)

overrepresented_sequences

List of sequences appeared more than 0.1% of total with its score (count-percentage) and possible source e.g. adaptor sequence.

  • type: dataframe (string, float)

kmer_content

Count, ratio of overall observed/expected, ratio of max observed/exepected and position of max pbserved/expected for each 5-mers.

  • type: dataframe (string, float)

Custom Modules

Custom quality indicate modules

min_length

Minimum length of sequences indicated by sequence length module.

  • type: float

max_length

Maximum length of sequences indicated by sequence length module.

  • type: float

mean_sequence_length

Mean length of sequences indicated by sequence length module.

  • type: float

median_sequence_length

Median length of sequences indicated by sequence length module.

  • type: float

overall_mean_quality_score

Quality indicator for the input file using mean value of per base sequence quality.

  • type: float

overall_median_quality_score

Quality indicator for the input file using median value of per base sequence quality.

  • type: float

overall_n_content

Percentage of overall n appearance calculated from per base n content.

  • type: float

Example of fastqc_data.txt

Example of plots are available at official page of FastQC. See good illumina data and bad illumina data

##FastQC	0.10.1
>>Basic Statistics	pass
#Measure	Value
Filename	ERR055260.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	33692804
Filtered Sequences	0
Sequence length	36
%GC	40
>>END_MODULE
>>Per base sequence quality	pass
#Base	Mean	Median	Lower Quartile	Upper Quartile	10th Percentile	90th Percentile
1	38.09500847718106	39.0	38.0	40.0	35.0	40.0
2	37.703108058326045	39.0	38.0	40.0	33.0	40.0
3	37.38177641730264	39.0	38.0	40.0	33.0	40.0
4	37.75079236504032	39.0	38.0	40.0	33.0	40.0
5	37.715360496561814	39.0	38.0	40.0	33.0	40.0
6	37.88910848737908	39.0	38.0	40.0	35.0	40.0
7	37.7323402647046	39.0	38.0	40.0	33.0	40.0
8	37.696287788929645	39.0	38.0	40.0	33.0	40.0
9	37.65292689798095	39.0	38.0	40.0	33.0	40.0
10	37.574305005899774	39.0	38.0	40.0	33.0	40.0
11	37.67899068299569	39.0	38.0	40.0	33.0	40.0
12	37.39158314042369	39.0	37.0	40.0	33.0	40.0
13	37.38735989441544	39.0	37.0	40.0	33.0	40.0
14	37.2906411410579	39.0	37.0	40.0	33.0	40.0
15	37.17269708392332	39.0	36.0	40.0	32.0	40.0
16	37.22397919152113	39.0	37.0	40.0	33.0	40.0
17	37.10818915516797	39.0	36.0	40.0	32.0	40.0
18	37.02408362925211	39.0	36.0	40.0	32.0	40.0
19	37.07573323371958	39.0	36.0	40.0	32.0	40.0
20	36.95778353739867	39.0	36.0	40.0	31.0	40.0
21	37.08180610316672	39.0	36.0	40.0	33.0	40.0
22	36.990251004339086	39.0	36.0	40.0	32.0	40.0
23	37.02335727237187	39.0	36.0	40.0	32.0	40.0
24	36.93700628181614	39.0	36.0	40.0	32.0	40.0
25	37.028989513606525	39.0	36.0	40.0	32.0	40.0
26	36.98949529400996	39.0	36.0	40.0	33.0	40.0
27	36.79439235748975	39.0	36.0	40.0	32.0	40.0
28	36.543570342201264	38.0	36.0	40.0	31.0	40.0
29	36.43908939724933	38.0	36.0	40.0	31.0	40.0
30	36.523010106252954	38.0	36.0	40.0	31.0	40.0
31	36.429458498022306	38.0	36.0	40.0	31.0	40.0
32	36.27531036004009	38.0	36.0	40.0	31.0	40.0
33	36.12885104487	38.0	35.0	40.0	30.0	40.0
34	35.739400080800635	38.0	35.0	40.0	29.0	40.0
35	35.66179745681006	38.0	35.0	40.0	29.0	40.0
36	35.6744608136503	38.0	35.0	40.0	29.0	40.0
>>END_MODULE
>>Per sequence quality scores	pass
#Quality	Count
2	50286.0
3	966.0
4	1304.0
5	1936.0
6	2218.0
7	3957.0
8	4201.0
9	5279.0
10	6306.0
11	6334.0
12	8558.0
13	10742.0
14	12620.0
15	14313.0
16	17574.0
17	22568.0
18	27724.0
19	35020.0
20	43862.0
21	54141.0
22	65589.0
23	80119.0
24	101310.0
25	131505.0
26	173663.0
27	230580.0
28	304747.0
29	395319.0
30	505505.0
31	636941.0
32	798371.0
33	1016143.0
34	1363547.0
35	1939626.0
36	2835615.0
37	4343740.0
38	5375975.0
39	1.2696099E7
40	368501.0
>>END_MODULE
>>Per base sequence content	warn
#Base	G	A	T	C
1	21.568317674005407	27.723905080740685	28.783710017130065	21.924067228123846
2	20.859004009526448	30.14009743146298	29.378557538203598	19.622341020806967
3	19.871759020533563	30.333583765680793	30.451993479885303	19.342663733900338
4	20.112516407814205	29.907275072849743	30.627476606398616	19.352731912937433
5	20.77839328134104	30.234710295214462	30.45225092295607	18.53464550048843
6	19.751613744030394	30.1310583998951	30.18988111572674	19.927446740347765
7	19.529743330257617	30.02529454002178	30.939361132257005	19.505600997463603
8	19.2779686510499	30.019016227293406	30.809971494505817	19.893043627150877
9	19.4511712664278	29.394952282485644	31.11639079655591	20.037485654530645
10	19.733634265818598	29.61723742162058	30.385820896474836	20.263307416085986
11	19.86803428323437	29.365699141470742	30.45093539973882	20.315331175556068
12	19.64830531765774	29.320177685419118	30.60173323656885	20.42978376035429
13	19.906591923901615	29.467431680663918	30.411805440710722	20.214170954723745
14	19.900890409714787	29.49094708769267	30.55503186971319	20.053130632879355
15	19.683514616355467	29.496283538763947	30.76681299662682	20.05338884825377
16	19.587630640655497	29.375961110271497	30.722661135594414	20.31374711347859
17	19.566697387370905	29.37546248747952	30.63057619069045	20.42726393445912
18	19.52858835969841	29.697801346542725	30.523075491134545	20.250534802624323
19	19.756595503300943	29.432590413074557	30.59961705769576	20.211197025928744
20	19.852909837958276	29.28972904718764	30.29984681595512	20.55751429889896
21	19.736086672988094	29.505445732566514	30.431483233036943	20.32698436140845
22	20.00252041949373	29.274455756190548	30.51929130030258	20.20373252401314
23	19.962701827963027	29.42996077144544	30.442031479481496	20.16530592111004
24	19.800753300318966	29.59609416895074	30.577903222302304	20.025249308427995
25	19.891263421783677	29.602993541469186	30.486007070590333	20.01973596615681
26	19.744303667145363	29.335705260385424	30.677747191399167	20.242243881070042
27	19.94891082123505	29.38188304603278	30.572581372777712	20.09662475995446
28	19.752763824584026	29.42595398115277	30.428625055961504	20.392657138301697
29	19.730910493528526	29.43132308014495	30.505187398472387	20.332579027854138
30	20.202162847086	29.29213026470447	30.290274002697696	20.21543288551183
31	19.92989728726722	29.482277901839876	30.353785993200173	20.234038817692728
32	20.130666577989917	29.32103200735859	30.49401461622464	20.05428679842686
33	19.89157010642012	29.65369525514942	30.408405906898427	20.046328731532036
34	19.98428803966568	29.35088750701782	30.687609140515583	19.977215312800915
35	20.029303586605614	29.357663434601644	30.47588440546533	20.13714857332741
36	20.054531525485384	29.322394776047727	30.429153358681578	20.193920339785315
>>END_MODULE
>>Per base GC content	pass
#Base	%GC
1	43.49238490212925
2	40.481345030333415
3	39.214422754433905
4	39.465248320751634
5	39.31303878182947
6	39.67906048437816
7	39.03534432772122
8	39.17101227820078
9	39.48865692095844
10	39.99694168190459
11	40.18336545879044
12	40.07808907801203
13	40.12076287862536
14	39.95402104259414
15	39.736903464609235
16	39.90137775413409
17	39.99396132183002
18	39.77912316232273
19	39.96779252922968
20	40.41042413685724
21	40.06307103439654
22	40.20625294350687
23	40.12800774907306
24	39.82600260874696
25	39.910999387940485
26	39.986547548215405
27	40.045535581189505
28	40.14542096288572
29	40.063489521382664
30	40.41759573259783
31	40.16393610495995
32	40.18495337641678
33	39.937898837952154
34	39.9615033524666
35	40.16645215993302
36	40.248451865270695
>>END_MODULE
>>Per sequence GC content	pass
#GC Content	Count
0	2030.0
1	2569.0
2	3108.0
3	3108.0
4	6722.5
5	10337.0
6	10337.0
7	25111.0
8	39885.0
9	39885.0
10	68722.5
11	97560.0
12	97560.0
13	163166.5
14	228773.0
15	382853.0
16	536933.0
17	536933.0
18	1078223.0
19	1619513.0
20	1619513.0
21	1577108.5
22	1534704.0
23	1534704.0
24	1513886.5
25	1493069.0
26	1572658.5
27	1652248.0
28	1652248.0
29	1806665.5
30	1961083.0
31	1961083.0
32	2127069.0
33	2293055.0
34	2293055.0
35	2425408.0
36	2557761.0
37	2557761.0
38	2745179.0
39	2932597.0
40	2927845.0
41	2923093.0
42	2923093.0
43	2821301.0
44	2719509.0
45	2719509.0
46	2603949.0
47	2488389.0
48	2488389.0
49	2340317.0
50	2192245.0
51	2004598.0
52	1816951.0
53	1816951.0
54	1624925.5
55	1432900.0
56	1432900.0
57	1286259.0
58	1139618.0
59	1139618.0
60	952493.0
61	765368.0
62	765368.0
63	635691.5
64	506015.0
65	411790.5
66	317566.0
67	317566.0
68	254479.5
69	191393.0
70	191393.0
71	151128.0
72	110863.0
73	110863.0
74	86738.5
75	62614.0
76	48186.5
77	33759.0
78	33759.0
79	25211.0
80	16663.0
81	16663.0
82	12244.0
83	7825.0
84	7825.0
85	5538.0
86	3251.0
87	3251.0
88	2228.5
89	1206.0
90	817.5
91	429.0
92	429.0
93	292.5
94	156.0
95	156.0
96	144.0
97	132.0
98	132.0
99	167.5
100	203.0
>>END_MODULE
>>Per base N content	pass
#Base	N-Count
1	0.0
2	0.01642487220713361
3	0.5503192907304479
4	0.004570708926452069
5	7.330942239179619E-4
6	1.1278372675660952E-4
7	0.002813657183296469
8	2.967992809384461E-6
9	0.0029145689388155407
10	0.0032143362125633713
11	0.0012376530015133203
12	0.0
13	0.0
14	0.0
15	0.0
16	0.0
17	0.0
18	0.0
19	0.0
20	0.0
21	0.0
22	0.0
23	0.0
24	0.0
25	5.935985618768922E-6
26	0.02445032476370919
27	0.03007467113749274
28	0.0
29	0.0
30	0.0010744133969971747
31	0.006645335900211808
32	0.03687434266379254
33	0.022093738473057924
34	0.0
35	0.0
36	0.0
>>END_MODULE
>>Sequence Length Distribution	pass
#Length	Count
36	3.3692804E7
>>END_MODULE
>>Sequence Duplication Levels	fail
#Total Duplicate Percentage	79.75348905003098
#Duplication Level	Relative count
1	100.0
2	94.58492688413948
3	70.533183352081
4	43.69853768278965
5	25.700787401574804
6	14.77615298087739
7	9.275590551181102
8	6.717660292463442
9	4.893138357705287
10++	79.76377952755905
>>END_MODULE
>>Overrepresented sequences	warn
#Sequence	Count	Percentage	Possible Source
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCG	66145	0.19631788437673514	Illumina Paired End PCR Primer 2 (97% over 36bp)
TTATTCTATGTTATTCTATGTTATTCTATGTTATTC	55521	0.16478592876983464	No Hit
AATAACATAGAATAACATAGAATAACATAGAATAAC	52868	0.1569118438465377	No Hit
ATAGAATAACATAGAATAACATAGAATAACATAGAA	50994	0.1513498253217512	No Hit
CTATGTTATTCTATGTTATTCTATGTTATTCTATGT	50545	0.15001719655033757	No Hit
CATAGAATAACATAGAATAACATAGAATAACATAGA	49336	0.14642889324379177	No Hit
GAATAACATAGAATAACATAGAATAACATAGAATAA	48688	0.14450563390331064	No Hit
TATGTTATTCTATGTTATTCTATGTTATTCTATGTT	48627	0.14432458634193818	No Hit
GTTATTCTATGTTATTCTATGTTATTCTATGTTATT	48349	0.1434994843409293	No Hit
TGTTATTCTATGTTATTCTATGTTATTCTATGTTAT	47439	0.14079861088438944	No Hit
AGAATAACATAGAATAACATAGAATAACATAGAATA	46916	0.13924635064508137	No Hit
TAGAATAACATAGAATAACATAGAATAACATAGAAT	45861	0.13611511823118078	No Hit
ATGTTATTCTATGTTATTCTATGTTATTCTATGTTA	44430	0.1318679205209516	No Hit
ACATAGAATAACATAGAATAACATAGAATAACATAG	41366	0.12277399055299762	No Hit
TCTATGTTATTCTATGTTATTCTATGTTATTCTATG	41338	0.12269088675433484	No Hit
TTCTATGTTATTCTATGTTATTCTATGTTATTCTAT	40405	0.11992174946317916	No Hit
AACATAGAATAACATAGAATAACATAGAATAACATA	38890	0.1154252403569617	No Hit
ATAACATAGAATAACATAGAATAACATAGAATAACA	38263	0.11356430886547762	No Hit
TAACATAGAATAACATAGAATAACATAGAATAACAT	37993	0.11276295080694383	No Hit
>>END_MODULE
>>Kmer Content	warn
#Sequence	Count	Obs/Exp Overall	Obs/Exp Max	Max Obs/Exp Position
CTATG	3682525	3.1166635	3.6598775	6
CAGCA	1692370	2.2376845	5.340712	17
AGCAG	1409890	1.8827376	5.0392146	18
TGCCG	888770	1.6877342	6.104747	26
GAGCG	846420	1.6750124	6.33894	9
GCCGA	791525	1.5509424	6.0667715	27
GCAGG	780815	1.5451841	6.1764627	19
AGCGG	772975	1.5296693	6.2083983	10
GCGGT	619000	1.187152	5.674342	11
CCGAG	565840	1.1087272	5.6430373	28
>>END_MODULE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment