Contents
- URL
- B-Fabric registration
- First task: login the server
- Check the server specs
- Make your own working directory
- Calculate G-C content
- FastQC
- Advanced practice
- For Unix experienced students
- Appendix: Text Editor
- Answers for Questions
Created by gh-md-toc
Link
- B-Fabric: Project management, user management
- gStore: File server, all results are stored here
- SUSHI: Application server, typical NGS applications can run
Note
- All projects at FGCZ are managed under B-Fabric system
- You can access the NGS data after the sequencing through B-Fabric, gStore, or SUSHI
- If you have your account at FGCZ, you do not have to register again
ToDo
- Go to https://fgcz-bfabric.uzh.ch/
- Click Register
- Follow the instructions and input your information
After the registration,
- Please send your account name to masaomi.hatakeyama@uzh.ch
- Please wait for a while until Masa sets-up your configuration
- Login B-Fabric again, and please make sure you have a new project (p1535) by clicking [My Projects]
- Start and open a terminal
- Login the server
$ ssh your_BFabric_account_name@172.23.30.6
your_BFabric_account_name@172.23.30.6's password:
Note
- PLEASE replace your_BFabric_account_name with YOUR BFabric ACCOUNT NAME
- The password characters are not shown in the terminal, but the system can recognize your password. Please press enter-key after typing your password
- If you fail in log-in, please try it again. If you fail several times, please confirm your account name and password by logging-in B-Fabric.
- This server address, 172.23.30.6, is valid only inside the University LAN. You need to use VPN if you access it from home (i.e. outside of the LAN).
Reference
- ssh: connect to a remote computer with Secure SHell Protocol
- scp: Secure CoPy, file copy via ssh protocol between local computer and a remote server
What is SSH?
Secure Shell: (Wikipedia) http://en.wikipedia.org/wiki/Secure_Shell
Secure Shell (SSH) is a cryptographic network protocol for secure data communication,
remote command-line login, remote command execution, and other secure network s
ervices between two networked computers that connects, via a secure channel over
an insecure network, a server and a client (running SSH server and SSH client programs, respectively)
$ hostname
(make sure you are working on the server fgcz-kl-003, not on Mac)
$ pwd
Check the system (OS) information
$ uname -a
Check CPU and RAM information of the server (How many CPUs are there? and How many bytes are the total RAM capacity?)
$ cat /proc/cpuinfo
$ cat /proc/meminfo
Check the current running process
$ htop
(in order to exit from htop, type 'q')
Reference
- pwd: show the current working directory path
- ls: show files and sub directories in the current directory
- uname: show OS information
- cat: show text file contents
- /proc/cpuinf, meminfo: they contain computer hard ware information
- htop: show running processes, memory usage, and so on.
Please make your directory under /scratch as follows:
$ cd /scratch/bio373_2022
$ mkdir your_name
- cd: Change Working directory
- mkdir: MaKe a new DIRectory
Note
- Please DO NOT type just your_name but YOUR NAME, such as masaomi, tim, to identify who uses it (e.g. first name, family name, nick name, whatever, but do not type just like your_name)
- You should use this directory for your command practice during the course, rather than home directory, because no enough disk space is allocated to a user home directory.
- All data should be saved in this directory, otherwise your data might be deleted by another user or you might delete other user's data by accident
Copy a sample file to your working directory
$ cp /scratch/bio373_2022/data/TAIR10_chr1.fa.gz /scratch/bio373_2022/your_name/
$ ls
Note
- NOT your_name but YOUR NAME again.
Change the current working directory to your directory
$ cd /scratch/bio373_2022/your_name
$ ls
- Make sure that there is TAIR10_chr1.fa.gz file.
Decompress the compressed file & confirm the decompressed file
$ gunzip -c TAIR10_chr1.fa.gz > TAIR10_chr1.fa
$ ls
Note
- gunzip: decompress a compressed file by gzip command
- -c: keep the original file and make an output in standard output
- >: (redirection) change the standard output to a file
Check if the file is really decompressed
$ ls
$ file TAIR10_chr1.fa.gz
$ file TAIR10_chr1.fa
Note
- file: check the file type
Check and compare the file sizes between compressed and decompressed files with different options
$ ls -l
$ ls -lh
Note
- -l: show detail information of files
- -lh: file size is shown with unit prefix (k, M, G, ...)
Show the first and last some lines of the FASTA file
(Check the differences of the commands below)
$ head TAIR10_chr1.fa
$ tail TAIR10_chr1.fa
$ head -n 100 TAIR10_chr1.fa
$ head -n 100 TAIR10_chr1.fa|less
$ less TAIR10_chr1.fa
(To exit less command mode, type 'q', to go to the next page, type space or enter key)
Note
- head: show first 10 lines
- tail: show last 10 lines
- | (pipe): command output is passed to the next command as an input
- less: show text file data one screen by screen, typing 'h' key shows sub-commands
List all the gene annotation
$ grep '>' TAIR10_chr1.fa
$ grep '>' TAIR10_chr1.fa | less
(to exit less condition, type 'q', to go to the next page, type space)
Count the total number of lines, words, and characters
$ wc TAIR10_chr1.fa
Count only the total number of lines
$ wc -l TAIR10_chr1.fa
Count only the number of characters
$ wc -c TAIR10_chr1.fa
Count how many genes are defined in the FASTA file (TAIR10_chr1.fa, A. thaliana Chromosome1) (Count only the gene annotation lines)
$ grep '>' TAIR10_chr1.fa | wc -l
Count the total number of nucleotide bases (Do not count the gene annotation line)
Hint: grep with -v option can skip a specific line. Check grep --help or search by Google!
$ grep -v '>' TAIR10_chr1.fa | wc -c
$ grep -v '>' TAIR10_chr1.fa | wc -l
Question1
- How much is the GC content in the FASTA file (TAIR10_chr1.fa, A. thaliana chromosome1)?
Hints
- Hint1: sed command can replace characters in a line
- Hint2: wc can count lines and characters
$ echo 'AAATTTGGGCCC' | sed 's/A/Z/g'
ZZZTTTGGGCCC
Reference/Hint
- GC-Content (wikipedia):http://en.wikipedia.org/wiki/GC-content
- sed command: replace strings, -e "s/[target character]/[replace character]/g"
- regular expression: character pattern, wikipedia http://en.wikipedia.org/wiki/Regular_expression
- line break is also counted as one character in wc command
Compress TAIR10_chr1.fa with different options (What is the difference?)
$ gzip -h
(for example)
$ time gzip --fast -c TAIR10_chr1.fa > TAIR10_chr1.fa.fast.gz
$ time gzip --best -c TAIR10_chr1.fa > TAIR10_chr1.fa.best.gz
$ ls -lh
Question2
- How much (How many bytes) is the total data size (after decompression)?
Copy the following file in your working directory (/scratch/bio373_2022/your_name)
$ hostname
(make sure you are working on the server fgcz-kl-003)
$ pwd
(make sure you are working in the working directory, /scratch/bio373_2022/your_name)
$ cp /scratch/bio373_2022/data/akam_samples1.tgz ./
(do not forget the dot at the end, which means the current working directory)
$ ls
Extract (decompress) it under your directory
$ ls
(make sure there is certainly akam_samples.tgz in your current working directory)
$ tar zxvf akam_samples1.tgz
$ ls
Check fastq file
$ ls
(make sure there is certainly akam_samples1/ directory)
$ cd akam_samples1
$ cat akam_sample1.fastq | more
(to exit, type 'q')
Note
- There are 4 samples data.
- sample1 and 2 are in control (normal condition)
- sample3 and 4 are in insecticide treated
- sample1 and 3 libraries were prepared by a FGCZ staff
- sample2 and 4 libraries were prepared by students
FastQC (Quality Control)
$ source /usr/local/ngseq/etc/lmod_profile
$ module load QC/FastQC/0.11.9
$ ls
(make sure there is certainly akam_sample1.fastq)
$ fastqc akam_sample1.fastq
$ ls
(check the generated files)
Note
- module: setup environmental variables appropriately for fastqc command. This command is configurated and available only in fgcz-kl-003 server
Question3
- What kind of files are generated by fastqc command?
Logout from fgcz-kl-003
$ exit
Download the result file from the server DO NOT forget the dot at the end (it means the current directory in the local computer)!!
$ hostname
(You are sure it is on Mac, not on fgcz-kl-003)
$ cd ~/Desktop
$ scp your_BFabric_accout_name@172.23.30.6:/scratch/bio373_2022/your_name/akam_samples1/akam_sample1_fastqc.html ./
Note
-
Again, replace your_BFabric_accout_name and your_name
-
Be careful of the small letter l (L) and the number 1 (one)
-
Open the downloaded html file on a web-browser
Check it out!!
- What point is good/bad with the QC report?
- Check the other samples
akam_sample2.fastq
akam_sample3.fastq
akam_sample4.fastq
Question4
- How different are they?
- MultiQC can gather all the fastqc results into one file nicely.
- After making all the FastQC results of 1-4 samples, in the akam_examples1 directory try to run as follows:
$ . "/usr/local/ngseq/miniconda3/etc/profile.d/conda.sh"
$ conda activate multiqc
$ multiqc .
Note
- conda is a package manager to manage the intallation of software/libraries. It is similar to module to activate the installed software (environment).
- Please do not forget the last dot after one space.
- You will get multiqc_report.html file at the end and let's download it to your local computer by scp command, and open it on a web-browser in your local computer
MultiQC
let's try to make a shell script using a text editor and run several fastqc commands at once.
fastqc_batch.sh
source /usr/local/ngseq/etc/lmod_profile
module load QC/FastQC/0.11.9
for i in `seq 1 3`
do
fastqc akam_samples1/akam_sample${i}.fastq
done
To execute it
$ bash fastqc_batch.sh
Note
- Please make the shell script using a text editor (it is fine to make it locally and upload to the server, but for the text editor practice, let's try to use one of the CLI text editors such as vi or nano)
- In this example, the shell script file name is fastqc_batch.sh but it does not matter whatever the file name is.
- seq: make a list of numbers
- for: iterate a process between do and done with assigning each element in the variable i
- The iterate variable can be referred to with $ symbol
- To execute the shell script, you can call by bash
- In the shell script, backquotations `` returns the command result. For the application of this, the following script has the same function as the shell script above
Additional samples
-
/scratch/bio373_2022/data/akam_samples2.tgz
-
Copy it in your working directory
-
Extract the archived files (please use tar command)
-
Execute FastQC (by the following shell script)
fastqc_batch2.sh
source /usr/local/ngseq/etc/lmod_profile
module load QC/FastQC/0.11.9
for file in `ls -1 akam_samples2/w2271_*.fastq`
do
fastqc $file
done
The followings are famous text editors available on this server.
-
vi
-
emacs
-
nano
-
I recommend using nano if you have no experience to use a text editor in command line
How to use
- make a new text file
$ nano text_file_name.txt
- save and exit
Ctrl + x, y
Note
- The symbol ^ means pressing Control Key
Question1, How much is the GC content in the FASTA file (TAIR10_chr1.fa, A. thaliana chromosome1)?
masaomi@fgcz-kl-003:/scratch/bio373_2022/masa
$ grep -v '>' TAIR10_chr1.fa| wc -c
25441459
$ grep -v '>' TAIR10_chr1.fa| wc -l
257293
$ echo 25441459-257293|bc
25184166 # total number of ATGC characters
$ grep -v '>' TAIR10_chr1.fa| sed 's/A//g'|sed 's/T//g'|wc -c
10112141
$ grep -v '>' TAIR10_chr1.fa| sed 's/A//g'|sed 's/T//g'|wc -l
257293
$ echo 10112141-257293|bc
9854848 # total number of GC characters
$ echo 9854848/25184166.0|bc -l
.39131127074051211384
- GC contents = 39.13%
Question2, How much (How many bytes) is the total data size (after decompression)?
$ time gzip --fast -c TAIR10_chr1.fa > TAIR10_chr1.fa.fast.gz
real 0m0.493s
user 0m0.460s
sys 0m0.017s
$ time gzip --best -c TAIR10_chr1.fa > TAIR10_chr1.fa.best.gz
real 0m18.072s
user 0m18.047s
sys 0m0.024s
$ ls -l TAIR10_chr1.fa
-rw-rw-r--+ 1 masaomi SG_Employees 27134133 Sep 23 16:05 TAIR10_chr1.fa
$ ls -l TAIR10_chr1.fa.fast.gz
-rw-rw-rw-+ 1 masaomi SG_Employees 9662176 Sep 24 11:15 TAIR10_chr1.fa.fast.gz
$ ls -l TAIR10_chr1.fa.best.gz
-rw-rw-rw-+ 1 masaomi SG_Employees 8127926 Sep 24 11:16 TAIR10_chr1.fa.best.gz
- before compression: 27134133 bytes
- after compression: 8127926-9662176 bytes (depending on the compression option)
Question3, What kind of files are generated by fastqc command?
$ ls
akam_sample1.fastq akam_sample2.fastq akam_sample3.fastq akam_sample4.fastq
$ fastqc akam_sample1.fastq
$ ls
akam_sample1.fastq akam_sample1_fastqc.zip akam_sample3.fastq
akam_sample1_fastqc.html akam_sample2.fastq akam_sample4.fastq
- which means
- xxx_fastqc.html and xxx_fastqc.zip files are generated after fastqc command
Question4, How different are they (fastqc results of sample1-4)?