Skip to content

Instantly share code, notes, and snippets.

@sumeetg23
Last active March 4, 2023 01:20
Show Gist options
  • Save sumeetg23/1cfb183ccaace56206712dd2a4bc0ac4 to your computer and use it in GitHub Desktop.
Save sumeetg23/1cfb183ccaace56206712dd2a4bc0ac4 to your computer and use it in GitHub Desktop.
CRISPR Screen - NGS FASTQ to Count File

Step 1.3, Step 2.1 to 2.3 and Step 3.1 can be done for a given fastq file and reference fasta using the following shell script: https://github.com/sumeetg23/Tools/blob/master/Shell/CRISPR-Generate-Count.sh

Step 3.2 script is available here: https://github.com/sumeetg23/Tools/blob/master/R/CRISPR_CountSummary_v1.2.R

Step 1: Generate Bowtie Index

Step 1.1 (if file not in fasta format)

  Input - Tab delimited text file with column 1 as a unique ID and column 2 as the sequence.
  
  Example of the input file content:
  pool10_1_essentials_10001_MED6_1_brunello_published     TCATCATTCGGAAGCAACAG
  pool10_1_essentials_10001_MED6_2_brunello_published     ATGAAGTGGTCAAAATGCAG
  pool10_1_essentials_10001_MED6_3_brunello_published     ATCAGTTATAAACTCTAGAG
  pool10_1_essentials_10001_MED6_4_brunello_published     TGCCACCAATACCCTTTGGA

Step 1.2

  awk '{print ">"$1"\n"$2}' <input tab delimited file> > <output fasta file>

  example: awk '{print ">"$1"\n"$2}' test.txt > test.fa
  
  Output result:
    >pool10_1_essentials_10001_MED6_1_brunello_published
    TCATCATTCGGAAGCAACAG
    >pool10_1_essentials_10001_MED6_2_brunello_published
    ATGAAGTGGTCAAAATGCAG
    >pool10_1_essentials_10001_MED6_3_brunello_published
    ATCAGTTATAAACTCTAGAG
    >pool10_1_essentials_10001_MED6_4_brunello_published
    TGCCACCAATACCCTTTGGA

Step 1.3 Build bowtie index

    bowtie-build <input fasta file> <basename for bowtie index>
    
    Example: bowtie-build guides.fa guides
    This will generate a bunch of bowtie index files that have ebwt extension

Step 2: Align reads in FASTQ file

Step 2.1: Change to the directory with the gzipped fastq files

Step 2.2: If read length unknown, check the read length for a fastq file (for any given flow cell/lane, this should be the same for all fastq files. So just need to check 1 of the files)

    zcat <gzipped_fastq_file> | head -n 4 | awk '{if(NR%4==2) print length($1)}'

Step 2.3: Run the following command

    bowtie -3 30 -n 0 -l 20 -p 4 -S <BOWTIE INDEX> <input gzipped fastq file> <output sam filename>
  
    note: "-3" paramter is the bases to be trimmed from the 3' end, the value would (readLength_from_step2.2 - length_of_reference_Sequences_from_step1)

Step 3: Summarize Count

Step 3.1: Summarize counts for each SAM file.

    grep -v ^\@ <sam/alignment file> | awk -F\"\t\" '{ if(\$4=1 && \$13==\"MD:Z:20\") print \$3 }' | sort | uniq -c | awk '{ print \$2\"\t\"\$1 }' > <out file with .count extension>

Step 3.2: Summarize individual count files into a single file.

    Run "./CRISPR_CountSummary_v1.2.R". It is available here https://github.com/sumeetg23/Tools/tree/master/R
    
    The script takes 2 arguments - input directory (directory with count files from step 3.1) and output file.

    If you run the R script without arguments, it will print out help.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment