Skip to content

Instantly share code, notes, and snippets.

@masaomi
Last active September 23, 2022 03:21
Show Gist options
  • Save masaomi/999d1177c00116e61909220c1d40e32e to your computer and use it in GitHub Desktop.
Save masaomi/999d1177c00116e61909220c1d40e32e to your computer and use it in GitHub Desktop.
Bio373_2022_Command_Practice_with_answers.md
Bio373_2022_Command_Practice_without_answers.md

Contents

  1. URL
  2. B-Fabric registration
  3. First task: login the server
  4. Check the server specs
  5. Make your own working directory
  6. Calculate G-C content
  7. FastQC
  8. Advanced practice
  9. For Unix experienced students
  10. Appendix: Text Editor
  11. Answers for Questions

Created by gh-md-toc


1. URL

2. B-Fabric registration

Link

FGCZ servers fgcz_servers.png

Note

  • All projects at FGCZ are managed under B-Fabric system
  • You can access the NGS data after the sequencing through B-Fabric, gStore, or SUSHI
  • If you have your account at FGCZ, you do not have to register again

ToDo

  1. Go to https://fgcz-bfabric.uzh.ch/
  2. Click Register
  3. Follow the instructions and input your information

After the registration,

  • Please send your account name to masaomi.hatakeyama@uzh.ch
  • Please wait for a while until Masa sets-up your configuration
  • Login B-Fabric again, and please make sure you have a new project (p1535) by clicking [My Projects]

myproject_bfabric.png

3. First task: login the server

  1. Start and open a terminal

terminal

  1. Login the server
 $ ssh your_BFabric_account_name@172.23.30.6
 your_BFabric_account_name@172.23.30.6's password:

Note

  • PLEASE replace your_BFabric_account_name with YOUR BFabric ACCOUNT NAME
  • The password characters are not shown in the terminal, but the system can recognize your password. Please press enter-key after typing your password
  • If you fail in log-in, please try it again. If you fail several times, please confirm your account name and password by logging-in B-Fabric.
  • This server address, 172.23.30.6, is valid only inside the University LAN. You need to use VPN if you access it from home (i.e. outside of the LAN).

Reference

  • ssh: connect to a remote computer with Secure SHell Protocol
  • scp: Secure CoPy, file copy via ssh protocol between local computer and a remote server

What is SSH?

ssh

Secure Shell: (Wikipedia) http://en.wikipedia.org/wiki/Secure_Shell

 Secure Shell (SSH) is a cryptographic network protocol for secure data communication, 
 remote command-line login, remote command execution, and other secure network s
 ervices between two networked computers that connects, via a secure channel over 
 an insecure network, a server and a client (running SSH server and SSH client programs, respectively)
 $ hostname
   (make sure you are working on the server fgcz-kl-003, not on Mac)
 $ pwd

4. Check the server specs

Check the system (OS) information

 $ uname -a

Check CPU and RAM information of the server (How many CPUs are there? and How many bytes are the total RAM capacity?)

 $ cat /proc/cpuinfo
 $ cat /proc/meminfo

Check the current running process

 $ htop
 (in order to exit from htop, type 'q')

Reference

  • pwd: show the current working directory path
  • ls: show files and sub directories in the current directory
  • uname: show OS information
  • cat: show text file contents
  • /proc/cpuinf, meminfo: they contain computer hard ware information
  • htop: show running processes, memory usage, and so on.

5. Make your own working directory

Please make your directory under /scratch as follows:

 $ cd /scratch/bio373_2022
 $ mkdir your_name
  • cd: Change Working directory
  • mkdir: MaKe a new DIRectory

Note

  • Please DO NOT type just your_name but YOUR NAME, such as masaomi, tim, to identify who uses it (e.g. first name, family name, nick name, whatever, but do not type just like your_name)
  • You should use this directory for your command practice during the course, rather than home directory, because no enough disk space is allocated to a user home directory.
  • All data should be saved in this directory, otherwise your data might be deleted by another user or you might delete other user's data by accident

Copy a sample file to your working directory

 $ cp /scratch/bio373_2022/data/TAIR10_chr1.fa.gz /scratch/bio373_2022/your_name/
 $ ls

Note

  • NOT your_name but YOUR NAME again.

Change the current working directory to your directory

 $ cd /scratch/bio373_2022/your_name
 $ ls
  • Make sure that there is TAIR10_chr1.fa.gz file.

Decompress the compressed file & confirm the decompressed file

 $ gunzip -c TAIR10_chr1.fa.gz > TAIR10_chr1.fa
 $ ls

Note

  • gunzip: decompress a compressed file by gzip command
  • -c: keep the original file and make an output in standard output
  • >: (redirection) change the standard output to a file

Check if the file is really decompressed

 $ ls
 $ file TAIR10_chr1.fa.gz
 $ file TAIR10_chr1.fa

Note

  • file: check the file type

Check and compare the file sizes between compressed and decompressed files with different options

 $ ls -l
 $ ls -lh

Note

  • -l: show detail information of files
  • -lh: file size is shown with unit prefix (k, M, G, ...)

Show the first and last some lines of the FASTA file

 (Check the differences of the commands below)
 $ head TAIR10_chr1.fa
 $ tail TAIR10_chr1.fa
 $ head -n 100 TAIR10_chr1.fa
 $ head -n 100 TAIR10_chr1.fa|less
 $ less TAIR10_chr1.fa
 (To exit less command mode, type 'q', to go to the next page, type space or enter key)

Note

  • head: show first 10 lines
  • tail: show last 10 lines
  • | (pipe): command output is passed to the next command as an input
  • less: show text file data one screen by screen, typing 'h' key shows sub-commands

List all the gene annotation

 $ grep '>' TAIR10_chr1.fa
 $ grep '>' TAIR10_chr1.fa | less
 (to exit less condition, type 'q', to go to the next page, type space)

Count the total number of lines, words, and characters

 $ wc TAIR10_chr1.fa

Count only the total number of lines

 $ wc -l TAIR10_chr1.fa

Count only the number of characters

 $ wc -c TAIR10_chr1.fa

6. Calculate G-C content

Count how many genes are defined in the FASTA file (TAIR10_chr1.fa, A. thaliana Chromosome1) (Count only the gene annotation lines)

 $ grep '>' TAIR10_chr1.fa | wc -l

Count the total number of nucleotide bases (Do not count the gene annotation line)

Hint: grep with -v option can skip a specific line. Check grep --help or search by Google!

 $ grep -v '>' TAIR10_chr1.fa | wc -c
 $ grep -v '>' TAIR10_chr1.fa | wc -l

Question1

  • How much is the GC content in the FASTA file (TAIR10_chr1.fa, A. thaliana chromosome1)?

Hints

  • Hint1: sed command can replace characters in a line
  • Hint2: wc can count lines and characters
 $ echo 'AAATTTGGGCCC' | sed 's/A/Z/g'
 ZZZTTTGGGCCC

Reference/Hint

Compress TAIR10_chr1.fa with different options (What is the difference?)

 $ gzip -h
 (for example)
 $ time gzip --fast -c TAIR10_chr1.fa > TAIR10_chr1.fa.fast.gz
 $ time gzip --best -c TAIR10_chr1.fa > TAIR10_chr1.fa.best.gz
 $ ls -lh

Question2

  • How much (How many bytes) is the total data size (after decompression)?

7. FastQC

Copy the following file in your working directory (/scratch/bio373_2022/your_name)

  $ hostname
    (make sure you are working on the server fgcz-kl-003)
  $ pwd
    (make sure you are working in the working directory, /scratch/bio373_2022/your_name)
  $ cp /scratch/bio373_2022/data/akam_samples1.tgz ./
    (do not forget the dot at the end, which means the current working directory)
  $ ls

Extract (decompress) it under your directory

 $ ls
  (make sure there is certainly akam_samples.tgz in your current working directory)
 $ tar zxvf akam_samples1.tgz
 $ ls

Check fastq file

 $ ls
  (make sure there is certainly akam_samples1/ directory)
 $ cd akam_samples1
 $ cat akam_sample1.fastq | more
 (to exit, type 'q')

Note

  • There are 4 samples data.
  • sample1 and 2 are in control (normal condition)
  • sample3 and 4 are in insecticide treated
  • sample1 and 3 libraries were prepared by a FGCZ staff
  • sample2 and 4 libraries were prepared by students

FastQC (Quality Control)

 $ source /usr/local/ngseq/etc/lmod_profile
 $ module load QC/FastQC/0.11.9
 $ ls
   (make sure there is certainly akam_sample1.fastq)
 $ fastqc akam_sample1.fastq
 $ ls
   (check the generated files)

Note

  • module: setup environmental variables appropriately for fastqc command. This command is configurated and available only in fgcz-kl-003 server

Question3

  • What kind of files are generated by fastqc command?

Logout from fgcz-kl-003

 $ exit

Download the result file from the server DO NOT forget the dot at the end (it means the current directory in the local computer)!!

  $ hostname
     (You are sure it is on Mac, not on fgcz-kl-003)
  $ cd ~/Desktop
  $ scp your_BFabric_accout_name@172.23.30.6:/scratch/bio373_2022/your_name/akam_samples1/akam_sample1_fastqc.html ./

Note

  • Again, replace your_BFabric_accout_name and your_name

  • Be careful of the small letter l (L) and the number 1 (one)

  • Open the downloaded html file on a web-browser

fastqc_report

Check it out!!

  • What point is good/bad with the QC report?

Let's Try!!

  • Check the other samples
 akam_sample2.fastq
 akam_sample3.fastq
 akam_sample4.fastq

Question4

  • How different are they?

8. Advanced practice

  • MultiQC can gather all the fastqc results into one file nicely.
  • After making all the FastQC results of 1-4 samples, in the akam_examples1 directory try to run as follows:
 $ . "/usr/local/ngseq/miniconda3/etc/profile.d/conda.sh"
 $ conda activate multiqc
 $ multiqc .

Note

  • conda is a package manager to manage the intallation of software/libraries. It is similar to module to activate the installed software (environment).
  • Please do not forget the last dot after one space.
  • You will get multiqc_report.html file at the end and let's download it to your local computer by scp command, and open it on a web-browser in your local computer

multiqc

MultiQC

9. For Unix experienced students

let's try to make a shell script using a text editor and run several fastqc commands at once.

fastqc_batch.sh

 source /usr/local/ngseq/etc/lmod_profile
 module load QC/FastQC/0.11.9
 for i in `seq 1 3`
 do
   fastqc akam_samples1/akam_sample${i}.fastq
 done

To execute it

 $ bash fastqc_batch.sh

Note

  • Please make the shell script using a text editor (it is fine to make it locally and upload to the server, but for the text editor practice, let's try to use one of the CLI text editors such as vi or nano)
  • In this example, the shell script file name is fastqc_batch.sh but it does not matter whatever the file name is.
  • seq: make a list of numbers
  • for: iterate a process between do and done with assigning each element in the variable i
  • The iterate variable can be referred to with $ symbol
  • To execute the shell script, you can call by bash
  • In the shell script, backquotations `` returns the command result. For the application of this, the following script has the same function as the shell script above

Additional samples

  • /scratch/bio373_2022/data/akam_samples2.tgz

  • Copy it in your working directory

  • Extract the archived files (please use tar command)

  • Execute FastQC (by the following shell script)

fastqc_batch2.sh

 source /usr/local/ngseq/etc/lmod_profile
 module load QC/FastQC/0.11.9
 for file in `ls -1 akam_samples2/w2271_*.fastq`
 do
   fastqc $file
 done

Appendix: Text Editor

The followings are famous text editors available on this server.

  • vi

  • emacs

  • nano

  • I recommend using nano if you have no experience to use a text editor in command line

How to use

  • make a new text file
 $ nano text_file_name.txt
  • save and exit
 Ctrl + x, y

Note

  • The symbol ^ means pressing Control Key

Answers for Questions

Question1, How much is the GC content in the FASTA file (TAIR10_chr1.fa, A. thaliana chromosome1)?

masaomi@fgcz-kl-003:/scratch/bio373_2022/masa
$ grep -v '>' TAIR10_chr1.fa| wc -c
25441459
$ grep -v '>' TAIR10_chr1.fa| wc -l
257293
$ echo 25441459-257293|bc
25184166 # total number of ATGC characters
$ grep -v '>' TAIR10_chr1.fa| sed 's/A//g'|sed 's/T//g'|wc -c
10112141
$ grep -v '>' TAIR10_chr1.fa| sed 's/A//g'|sed 's/T//g'|wc -l
257293
$ echo 10112141-257293|bc
9854848 # total number of GC characters
$ echo 9854848/25184166.0|bc -l
.39131127074051211384
  • GC contents = 39.13%

Question2, How much (How many bytes) is the total data size (after decompression)?

$ time gzip --fast -c TAIR10_chr1.fa > TAIR10_chr1.fa.fast.gz

real	0m0.493s
user	0m0.460s
sys	0m0.017s
$ time gzip --best -c TAIR10_chr1.fa > TAIR10_chr1.fa.best.gz

real	0m18.072s
user	0m18.047s
sys	0m0.024s

$ ls -l TAIR10_chr1.fa
-rw-rw-r--+ 1 masaomi SG_Employees 27134133 Sep 23 16:05 TAIR10_chr1.fa
$ ls -l TAIR10_chr1.fa.fast.gz
-rw-rw-rw-+ 1 masaomi SG_Employees 9662176 Sep 24 11:15 TAIR10_chr1.fa.fast.gz
$ ls -l TAIR10_chr1.fa.best.gz
-rw-rw-rw-+ 1 masaomi SG_Employees 8127926 Sep 24 11:16 TAIR10_chr1.fa.best.gz
  • before compression: 27134133 bytes
  • after compression: 8127926-9662176 bytes (depending on the compression option)

Question3, What kind of files are generated by fastqc command?

$ ls
akam_sample1.fastq  akam_sample2.fastq  akam_sample3.fastq  akam_sample4.fastq

$ fastqc akam_sample1.fastq

$ ls
akam_sample1.fastq        akam_sample1_fastqc.zip  akam_sample3.fastq
akam_sample1_fastqc.html  akam_sample2.fastq       akam_sample4.fastq
  • which means
    • xxx_fastqc.html and xxx_fastqc.zip files are generated after fastqc command

Question4, How different are they (fastqc results of sample1-4)?

  • see

    1. sample1
    2. sample2
    3. sample3
    4. sample4
  • I would say,

    • Overall, the base qualities are good
    • but a lot of adapter contaminations are found
    • We would need further adapter filtering or trimming before starting downstream data analyses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment