Skip to content

Instantly share code, notes, and snippets.

@ipurusho
Last active September 27, 2023 16:58
Show Gist options
  • Save ipurusho/f6a6e53e0aa798c44e09c87bdc8b74fd to your computer and use it in GitHub Desktop.
Save ipurusho/f6a6e53e0aa798c44e09c87bdc8b74fd to your computer and use it in GitHub Desktop.
A brief tutorial on how to run the STAR aligner on medinfo.mssm.edu

###Download STAR### Obtain STAR source from https://github.com/alexdobin/STAR

Add the following to your .bashrc file and source it: export PATH=/path/to/STAR/bin/:$PATH

###Generate Reference Genome Before using STAR, a reference genome must be built using STAR's genomeGenerate mode. This requires a genome fasta file and GTF/GFF reference annotation. This can be achieved with the following command:

STAR --runThreadN {number of cores} --runMode genomeGenerate --genomeDir /path/to/resulting/STAR/genome/ --genomeFastaFiles /path/to/genome/fasta/file --sjdbGTFfile /path/to/GTF/or/GFF --sjdbOverhang {read length - 1}

Note: the --sjdbOverhang is dependent on your read length of your fastq files. 100 is the default value and said to work well in most cases.

STAR Alignment

After you have built a genome for STAR, you can proceed to align single-end or paired-end fastq files to this reference using the following command:

STAR --runMode alignReads --outSAMtype BAM Unsorted --readFilesCommand zcat --genomeDir /path/to/STAR/genome/folder --outFileNamePrefix {sample name}  --readFilesIn  /path/to/R1 /path/to/R2

Note: - --outSAMtype can be left as BAM Unsorted if you are going to utilize HTSeq for read counting, since HTSeq requires .bam files to be name sorted (which you can easily pipe samtools prior) or you may use the option BAM SortedByCoordinate if you are aligning reads to generate tdfs for viewing.

  • --readFilesCommand should remain zcat if raw samples are gzipped (i.e. .fastq.gz extension). Omit this flag otherwise.
  • R1 and R2 can accommodate comma separted input which enables mapping of technical replicates, namely fastq's for the same sample sequenced on multiple lanes. Just make sure R2 technical replicates are in the same order as R1.

Below is a concise example of how to loop through an entire directory. Since STAR is incredibly quick, it's suitable to run each alignment in serial:

for i in *_R1.fastq.gz; do
STAR --runMode alignReads --genomeLoad  LoadAndKeep --readFilesCommand zcat --outSAMtype BAM Unsorted --genomeDir /path/to/STAR/genome --readFilesIn $i ${i%_R1.fastq.gz}_R2.fastq.gz --runThreadN 10 --outFileNamePrefix ${i%_R1.fastq.gz}
done

Note: The --genomeLoad LoadAndKeep option will save the built genome into memory allowing for faster alignment

@AlveenaZulfiqar
Copy link

AlveenaZulfiqar commented Oct 9, 2018

Hi
I am new in this field, can you guide me if there is no reference genome available then how can we use STAR to align our reads back to transcriptome assembled.

@carloslizama
Copy link

Im trying to run this script from the terminal but something is wrong, I hope you can help me:

for i in /media/carlos/DATAPART1/Lung/*.fastq; do
STAR --genomeDir /media/carlos/DATAPART1/genomeDir --runThreadN 4 --runMode alignReads --readFilesIn $i ${/media/carlos/DATAPART1/Lung/i%fastq} --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --outReadsUnmapped Fastx --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix /media/carlos/DATAPART1/Results_lung/${i%.fastq} --limitBAMsortRAM 3000000000 --sjdbGTFfile /home/carlos/STAR-2.7.2a/Mus_musculus.GRCm38.97.gtf
done

I got this error
star_for_fastq.sh: line 1: syntax error near unexpected token $'do\r'' 'tar_for_fastq.sh: line 1: for i in /media/carlos/DATAPART1/Lung/*.fastq; do

@necrosnake91
Copy link

Im trying to run this script from the terminal but something is wrong, I hope you can help me:

for i in /media/carlos/DATAPART1/Lung/*.fastq; do
STAR --genomeDir /media/carlos/DATAPART1/genomeDir --runThreadN 4 --runMode alignReads --readFilesIn $i ${/media/carlos/DATAPART1/Lung/i%fastq} --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --outReadsUnmapped Fastx --outFilterIntronMotifs RemoveNoncanonical --outFileNamePrefix /media/carlos/DATAPART1/Results_lung/${i%.fastq} --limitBAMsortRAM 3000000000 --sjdbGTFfile /home/carlos/STAR-2.7.2a/Mus_musculus.GRCm38.97.gtf
done

I got this error
star_for_fastq.sh: line 1: syntax error near unexpected token $'do\r'' 'tar_for_fastq.sh: line 1: for i in /media/carlos/DATAPART1/Lung/*.fastq; do

Hi Carlos,

I'm performing an alignment using STAR. I constructed a similar for loop and ran it. Below, I leave you my code:

#Align reads against the reference genome
#Generate the list with the name of the forward (r1) and reverse (r2) files
for i in {1..2}; do
r1="t0$i-R1.clean.fastq.gz"
r2="t0$i-R2.clean.fastq.gz" 
STAR --genomeDir ../../Humano \ #Path to the index generated previously
--runThreadN 32 \ #Number of cores
--readFilesIn $r1 $r2 \ #Path to the input files (forward and reverse)
--outFileNamePrefix t0$i.aligned \ #Prefix to the output files
--outSAMtype BAM SortedByCoordinate \ 
--quantMode TranscriptomeSAM #SAM file required for RSEM to quantify the mapped reads
done

But I got a similar error, just like you. I'm trying to figure out how to fix this problem. I'll be in touch if I have good news.

@MotaharehJadidi
Copy link

Hi everyone!

I'm totally new in computational biology and I appreciate if you could help me understand better!
I get a little confused by seeing different command lines everywhere, some of the commands here are not in the STAR manual.
How can I find how to align my reads to the reference genome?
For example I don't get what "for i in *_R1.fastq.gz; do" is and how it can be used....

@EOMAK91
Copy link

EOMAK91 commented Apr 6, 2022

@MotaharehJadidi that's just a loop, it's looping through all the files so that you don't have to physically type out each file name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment