Step 1.3, Step 2.1 to 2.3 and Step 3.1 can be done for a given fastq file and reference fasta using the following shell script: https://github.com/sumeetg23/Tools/blob/master/Shell/CRISPR-Generate-Count.sh
Step 3.2 script is available here: https://github.com/sumeetg23/Tools/blob/master/R/CRISPR_CountSummary_v1.2.R
Step 1.1 (if file not in fasta format)
Input - Tab delimited text file with column 1 as a unique ID and column 2 as the sequence.
Example of the input file content:
pool10_1_essentials_10001_MED6_1_brunello_published TCATCATTCGGAAGCAACAG
pool10_1_essentials_10001_MED6_2_brunello_published ATGAAGTGGTCAAAATGCAG
pool10_1_essentials_10001_MED6_3_brunello_published ATCAGTTATAAACTCTAGAG
pool10_1_essentials_10001_MED6_4_brunello_published TGCCACCAATACCCTTTGGA
Step 1.2
awk '{print ">"$1"\n"$2}' <input tab delimited file> > <output fasta file>
example: awk '{print ">"$1"\n"$2}' test.txt > test.fa
Output result:
>pool10_1_essentials_10001_MED6_1_brunello_published
TCATCATTCGGAAGCAACAG
>pool10_1_essentials_10001_MED6_2_brunello_published
ATGAAGTGGTCAAAATGCAG
>pool10_1_essentials_10001_MED6_3_brunello_published
ATCAGTTATAAACTCTAGAG
>pool10_1_essentials_10001_MED6_4_brunello_published
TGCCACCAATACCCTTTGGA
Step 1.3 Build bowtie index
bowtie-build <input fasta file> <basename for bowtie index>
Example: bowtie-build guides.fa guides
This will generate a bunch of bowtie index files that have ebwt extension
Step 2.1: Change to the directory with the gzipped fastq files
Step 2.2: If read length unknown, check the read length for a fastq file (for any given flow cell/lane, this should be the same for all fastq files. So just need to check 1 of the files)
zcat <gzipped_fastq_file> | head -n 4 | awk '{if(NR%4==2) print length($1)}'
Step 2.3: Run the following command
bowtie -3 30 -n 0 -l 20 -p 4 -S <BOWTIE INDEX> <input gzipped fastq file> <output sam filename>
note: "-3" paramter is the bases to be trimmed from the 3' end, the value would (readLength_from_step2.2 - length_of_reference_Sequences_from_step1)
Step 3.1: Summarize counts for each SAM file.
grep -v ^\@ <sam/alignment file> | awk -F\"\t\" '{ if(\$4=1 && \$13==\"MD:Z:20\") print \$3 }' | sort | uniq -c | awk '{ print \$2\"\t\"\$1 }' > <out file with .count extension>
Step 3.2: Summarize individual count files into a single file.
Run "./CRISPR_CountSummary_v1.2.R". It is available here https://github.com/sumeetg23/Tools/tree/master/R
The script takes 2 arguments - input directory (directory with count files from step 3.1) and output file.
If you run the R script without arguments, it will print out help.