masaomi/.gitignore Secret

## .gitignore
Bio373_2022_Command_Practice_with_answers.md
Bio373_2022_Command_Practice_without_answers.md

## Bio373_2022_Command_Practice.md

      
    Raw
  

              Bio373_2022_Command_Practice.md
            
          
    Contents

URL
B-Fabric registration
First task: login the server
Check the server specs
Make your own working directory
Calculate G-C content
FastQC
Advanced practice
For Unix experienced students
Appendix: Text Editor
Answers for Questions

Created by gh-md-toc

1. URL


https://gist.github.com/masaomi/999d1177c00116e61909220c1d40e32e

2. B-Fabric registration

Link

https://fgcz-bfabric.uzh.ch

FGCZ servers


B-Fabric: Project management, user management

https://fgcz-bfabric.uzh.ch


gStore: File server, all results are stored here

https://fgcz-gstore.uzh.ch/projects/p29536


SUSHI: Application server, typical NGS applications can run

https://fgcz-sushi.uzh.ch


Note

All projects at FGCZ are managed under B-Fabric system
You can access the NGS data after the sequencing through B-Fabric, gStore, or SUSHI
If you have your account at FGCZ, you do not have to register again

ToDo

Go to https://fgcz-bfabric.uzh.ch/
Click Register
Follow the instructions and input your information

After the registration,

Please send your account name to masaomi.hatakeyama@uzh.ch
Please wait for a while until Masa sets-up your configuration
Login B-Fabric again, and please make sure you have a new project (p1535) by clicking [My Projects]


3. First task: login the server


Start and open a terminal


Login the server

 $ ssh your_BFabric_account_name@172.23.30.6
 your_BFabric_account_name@172.23.30.6's password:
Note

PLEASE replace your_BFabric_account_name with YOUR BFabric ACCOUNT NAME
The password characters are not shown in the terminal, but the system can recognize your password. Please press enter-key after typing your password
If you fail in log-in, please try it again. If you fail several times, please confirm your account name and password by logging-in B-Fabric.
This server address, 172.23.30.6, is valid only inside the University LAN. You need to use VPN if you access it from home (i.e. outside of the LAN).

Reference

ssh: connect to a remote computer with Secure SHell Protocol
scp: Secure CoPy, file copy via ssh protocol between local computer and a remote server

What is SSH?

Secure Shell: (Wikipedia) http://en.wikipedia.org/wiki/Secure_Shell
 Secure Shell (SSH) is a cryptographic network protocol for secure data communication, 
 remote command-line login, remote command execution, and other secure network s
 ervices between two networked computers that connects, via a secure channel over 
 an insecure network, a server and a client (running SSH server and SSH client programs, respectively)

 $ hostname
   (make sure you are working on the server fgcz-kl-003, not on Mac)
 $ pwd
4. Check the server specs

Check the system (OS) information
 $ uname -a
Check CPU and RAM information of the server (How many CPUs are there? and How many bytes are the total RAM capacity?)
 $ cat /proc/cpuinfo
 $ cat /proc/meminfo
Check the current running process
 $ htop
 (in order to exit from htop, type 'q')
Reference

pwd: show the current working directory path
ls: show files and sub directories in the current directory
uname: show OS information
cat: show text file contents
/proc/cpuinf, meminfo: they contain computer hard ware information
htop: show running processes, memory usage, and so on.

5. Make your own working directory

Please make your directory under /scratch as follows:
 $ cd /scratch/bio373_2022
 $ mkdir your_name

cd: Change Working directory
mkdir: MaKe a new DIRectory

Note

Please DO NOT type just your_name but YOUR NAME, such as masaomi, tim, to identify who uses it  (e.g. first name, family name, nick name, whatever, but do not type just like your_name)
You should use this directory for your command practice during the course, rather than home directory, because no enough disk space is allocated to a user home directory.
All data should be saved in this directory, otherwise your data might be deleted by another user or you might delete other user's data by accident

Copy a sample file to your working directory
 $ cp /scratch/bio373_2022/data/TAIR10_chr1.fa.gz /scratch/bio373_2022/your_name/
 $ ls
Note

NOT your_name but YOUR NAME again.

Change the current working directory to your directory
 $ cd /scratch/bio373_2022/your_name
 $ ls

Make sure that there is TAIR10_chr1.fa.gz file.

Decompress the compressed file & confirm the decompressed file
 $ gunzip -c TAIR10_chr1.fa.gz > TAIR10_chr1.fa
 $ ls
Note

gunzip: decompress a compressed file by gzip command
-c: keep the original file and make an output in standard output
>: (redirection) change the standard output to a file

Check if the file is really decompressed
 $ ls
 $ file TAIR10_chr1.fa.gz
 $ file TAIR10_chr1.fa
Note

file: check the file type

Check and compare the file sizes between compressed and decompressed files with different options
 $ ls -l
 $ ls -lh
Note

-l: show detail information of files
-lh: file size is shown with unit prefix (k, M, G, ...)

Show the first and last some lines of the FASTA file
 (Check the differences of the commands below)
 $ head TAIR10_chr1.fa
 $ tail TAIR10_chr1.fa
 $ head -n 100 TAIR10_chr1.fa
 $ head -n 100 TAIR10_chr1.fa|less
 $ less TAIR10_chr1.fa
 (To exit less command mode, type 'q', to go to the next page, type space or enter key)
Note

head: show first 10 lines
tail: show last 10 lines
| (pipe): command output is passed to the next command as an input
less: show text file data one screen by screen, typing 'h' key shows sub-commands

List all the gene annotation
 $ grep '>' TAIR10_chr1.fa
 $ grep '>' TAIR10_chr1.fa | less
 (to exit less condition, type 'q', to go to the next page, type space)
Count the total number of lines, words, and characters
 $ wc TAIR10_chr1.fa
Count only the total number of lines
 $ wc -l TAIR10_chr1.fa
Count only the number of characters
 $ wc -c TAIR10_chr1.fa
6. Calculate G-C content

Count how many genes are defined in the FASTA file (TAIR10_chr1.fa, A. thaliana Chromosome1)
(Count only the gene annotation lines)
 $ grep '>' TAIR10_chr1.fa | wc -l
Count the total number of nucleotide bases
(Do not count the gene annotation line)
Hint: grep with -v option can skip a specific line. Check grep --help or search by Google!
 $ grep -v '>' TAIR10_chr1.fa | wc -c
 $ grep -v '>' TAIR10_chr1.fa | wc -l
Question1

How much is the GC content in the FASTA file (TAIR10_chr1.fa, A. thaliana chromosome1)?

Hints

Hint1: sed command can replace characters in a line
Hint2: wc can count lines and characters

 $ echo 'AAATTTGGGCCC' | sed 's/A/Z/g'
 ZZZTTTGGGCCC
Reference/Hint

GC-Content (wikipedia):http://en.wikipedia.org/wiki/GC-content
sed command: replace strings, -e "s/[target character]/[replace character]/g"
regular expression: character pattern, wikipedia http://en.wikipedia.org/wiki/Regular_expression
line break is also counted as one character in wc command

Compress TAIR10_chr1.fa with different options (What is the difference?)
 $ gzip -h
 (for example)
 $ time gzip --fast -c TAIR10_chr1.fa > TAIR10_chr1.fa.fast.gz
 $ time gzip --best -c TAIR10_chr1.fa > TAIR10_chr1.fa.best.gz
 $ ls -lh
Question2

How much (How many bytes) is the total data size (after decompression)?

7. FastQC

Copy the following file in your working directory (/scratch/bio373_2022/your_name)
  $ hostname
    (make sure you are working on the server fgcz-kl-003)
  $ pwd
    (make sure you are working in the working directory, /scratch/bio373_2022/your_name)
  $ cp /scratch/bio373_2022/data/akam_samples1.tgz ./
    (do not forget the dot at the end, which means the current working directory)
  $ ls
Extract (decompress) it under your directory
 $ ls
  (make sure there is certainly akam_samples.tgz in your current working directory)
 $ tar zxvf akam_samples1.tgz
 $ ls
Check fastq file
 $ ls
  (make sure there is certainly akam_samples1/ directory)
 $ cd akam_samples1
 $ cat akam_sample1.fastq | more
 (to exit, type 'q')
Note

There are 4 samples data.
sample1 and 2 are in control (normal condition)
sample3 and 4 are in insecticide treated
sample1 and 3 libraries were prepared by a FGCZ staff
sample2 and 4 libraries were prepared by students

FastQC (Quality Control)
 $ source /usr/local/ngseq/etc/lmod_profile
 $ module load QC/FastQC/0.11.9
 $ ls
   (make sure there is certainly akam_sample1.fastq)
 $ fastqc akam_sample1.fastq
 $ ls
   (check the generated files)
Note

module: setup environmental variables appropriately for fastqc command. This command is configurated and available only in fgcz-kl-003 server

Question3

What kind of files are generated by fastqc command?

Logout from fgcz-kl-003
 $ exit
Download the result file from the server DO NOT forget the dot at the end (it means the current directory in the local computer)!!
  $ hostname
     (You are sure it is on Mac, not on fgcz-kl-003)
  $ cd ~/Desktop
  $ scp your_BFabric_accout_name@172.23.30.6:/scratch/bio373_2022/your_name/akam_samples1/akam_sample1_fastqc.html ./
Note


Again, replace your_BFabric_accout_name and your_name


Be careful of the small letter l (L) and the number 1 (one)


Open the downloaded html file on a web-browser


Check it out!!

What point is good/bad with the QC report?

Let's Try!!


Check the other samples

 akam_sample2.fastq
 akam_sample3.fastq
 akam_sample4.fastq
Question4

How different are they?

8. Advanced practice


MultiQC can gather all the fastqc results into one file nicely.
After making all the FastQC results of 1-4 samples, in the akam_examples1 directory try to run  as follows:

 $ . "/usr/local/ngseq/miniconda3/etc/profile.d/conda.sh"
 $ conda activate multiqc
 $ multiqc .
Note

conda is a package manager to manage the intallation of software/libraries. It is similar to module to activate the installed software (environment).
Please do not forget the last dot after one space.
You will get multiqc_report.html file at the end and let's download it to your local computer by scp command, and open it on a web-browser in your local computer


MultiQC

http://multiqc.info/
https://doi.org/10.1093/bioinformatics/btw354

9. For Unix experienced students

let's try to make a shell script using a text editor and run several fastqc commands at once.
fastqc_batch.sh
 source /usr/local/ngseq/etc/lmod_profile
 module load QC/FastQC/0.11.9
 for i in `seq 1 3`
 do
   fastqc akam_samples1/akam_sample${i}.fastq
 done
To execute it
 $ bash fastqc_batch.sh
Note

Please make the shell script using a text editor (it is fine to make it locally and upload to the server, but for the text editor practice, let's try to use one of the CLI text editors such as vi or nano)
In this example, the shell script file name is fastqc_batch.sh but it does not matter whatever the file name is.
seq: make a list of numbers
for: iterate a process between do and done with assigning each element in the variable i
The iterate variable can be referred to with $ symbol
To execute the shell script, you can call by bash
In the shell script, backquotations `` returns the command result. For the application of this, the following script has the same function as the shell script above

Additional samples


/scratch/bio373_2022/data/akam_samples2.tgz


Copy it in your working directory


Extract the archived files (please use tar command)


Execute FastQC (by the following shell script)


fastqc_batch2.sh
 source /usr/local/ngseq/etc/lmod_profile
 module load QC/FastQC/0.11.9
 for file in `ls -1 akam_samples2/w2271_*.fastq`
 do
   fastqc $file
 done
Appendix: Text Editor

The followings are famous text editors available on this server.


vi


emacs


nano


I recommend using nano if you have no experience to use a text editor in command line


How to use

make a new text file

 $ nano text_file_name.txt

save and exit

 Ctrl + x, y
Note

The symbol ^ means pressing Control Key

Answers for Questions

Question1, How much is the GC content in the FASTA file (TAIR10_chr1.fa, A. thaliana chromosome1)?
masaomi@fgcz-kl-003:/scratch/bio373_2022/masa
$ grep -v '>' TAIR10_chr1.fa| wc -c
25441459
$ grep -v '>' TAIR10_chr1.fa| wc -l
257293
$ echo 25441459-257293|bc
25184166 # total number of ATGC characters
$ grep -v '>' TAIR10_chr1.fa| sed 's/A//g'|sed 's/T//g'|wc -c
10112141
$ grep -v '>' TAIR10_chr1.fa| sed 's/A//g'|sed 's/T//g'|wc -l
257293
$ echo 10112141-257293|bc
9854848 # total number of GC characters
$ echo 9854848/25184166.0|bc -l
.39131127074051211384


GC contents = 39.13%

Question2, How much (How many bytes) is the total data size (after decompression)?
$ time gzip --fast -c TAIR10_chr1.fa > TAIR10_chr1.fa.fast.gz

real	0m0.493s
user	0m0.460s
sys	0m0.017s
$ time gzip --best -c TAIR10_chr1.fa > TAIR10_chr1.fa.best.gz

real	0m18.072s
user	0m18.047s
sys	0m0.024s

$ ls -l TAIR10_chr1.fa
-rw-rw-r--+ 1 masaomi SG_Employees 27134133 Sep 23 16:05 TAIR10_chr1.fa
$ ls -l TAIR10_chr1.fa.fast.gz
-rw-rw-rw-+ 1 masaomi SG_Employees 9662176 Sep 24 11:15 TAIR10_chr1.fa.fast.gz
$ ls -l TAIR10_chr1.fa.best.gz
-rw-rw-rw-+ 1 masaomi SG_Employees 8127926 Sep 24 11:16 TAIR10_chr1.fa.best.gz


before compression: 27134133 bytes
after compression: 8127926-9662176 bytes (depending on the compression option)

Question3, What kind of files are generated by fastqc command?
$ ls
akam_sample1.fastq  akam_sample2.fastq  akam_sample3.fastq  akam_sample4.fastq

$ fastqc akam_sample1.fastq

$ ls
akam_sample1.fastq        akam_sample1_fastqc.zip  akam_sample3.fastq
akam_sample1_fastqc.html  akam_sample2.fastq       akam_sample4.fastq


which means

xxx_fastqc.html and xxx_fastqc.zip files are generated after fastqc command


Question4, How different are they (fastqc results of sample1-4)?


see

sample1
sample2
sample3
sample4


I would say,

Overall, the base qualities are good
but a lot of adapter contaminations are found
We would need further adapter filtering or trimming before starting downstream data analyses
	Bio373_2022_Command_Practice_with_answers.md
	Bio373_2022_Command_Practice_without_answers.md