masaomi/bio334_day2_part1.md Secret

## bio334_day2_part1.md

      
    Raw
  

              bio334_day2_part1.md
            
          
    Day2 Part1 9-10

Goal


Understand command line argument and file input/output process

Example1


Command line argument

Source code
import sys          # import sys library
                           
print(sys.argv)     # print command line arguments
Result
$ python3 day2_1_example1.py 123 abc
['day2_1_example1.py', '123', 'abc']

Note

sys.argv returns the file path and the command line arguments with list
Importing sys module is needed before calling sys.argv

Example2


Load data from a file

import sys                  # Import sys module to use open method
file = open("input.fa")     # Open the file input.fa and return the file handle
for line in file:           # Repeat reading line by line from the file
    print(line.rstrip())    # Show the line with the line break removed
file.close()                # Close the file
Result
$ python3 day2_1_example2.py  
>seq00000
GAAGACTAGT
>seq00001
GAAGACTAGT
>seq00002
GAAGACTAGT
>seq00003
GAAGACTAGT
>seq00004
GAAGACTAGT
....

Note


Before running the script, please check whether the input.fa in the current working directory.


To read the contents of a file, you need the following:

import sys module
open a file
read line by line using for statement
close the file


The open and close processes are necessary to tell the python interpreter about where the data location is and reading/writing from/to a file safely


Example 3

import sys                  # Import sys module to use open method
file = open(sys.argv[1])    # Open the file and return the file handle
for line in file:           # Repeat reading line by line from the file
    print(line.rstrip())    # Show the line with the line break removed
file.close()                # Close the file
Result
$ python3 day2_1_example3.py  input.fa
>seq00000
GAAGACTAGT
>seq00001
GAAGACTAGT
>seq00002
GAAGACTAGT
>seq00003
GAAGACTAGT
>seq00004
GAAGACTAGT
....

Note

Please pay attention to the command line argument compared to Example2

Exercise1


Load a FASTA file by specifying the file path with the command line argument, and calculate the genome size (the total number of nucleotides)

import sys      # import sys package for open and close methods

genome_size = 0                             # Initialize a variable to store the genome size
file = open(sys.argv[1])        # Open a fasta file, athal_genome.fa
for line in file:               # Read line by line using for statement
  #
  # Your code here
  #
file.close()                                # Close the file

print("genome size =", genome_size, "bp")   # Show the genome_size
Expected result
$ python day2_1_exercise1.py athal_genome.fa
genome size = 119667750 bp
Advanced exercise1


Calculate GC content (the ratio of (GC/genome size)) of A. thaliana genome (athal_genome.fa)

Expected result
$ python day2_1_advancecd1.py athal_genome.fa
Genome size = 119667750 bp
GC content = GC count/genome size = 0.35999748470243653

Advanced exercise2


Read a FASTQ file (sample.fq) and calculate 1. the total number of reads, 2. the average length of reads, and 3. GC content

Hint

Fastq file format is as follows:

@HWI-ST1357:71:D2FNTACXX:1:1101:1235:2123 1:N:0:CGATGT
AGTCTCTCTGCTACCTTCATCGTCTACCCATGGCGGCCTTAGTTCCTGCGTCCTTACTAAACCCTCCGTTTCTCTTACGAAGGTCTTCGCGGTTCCCGAAA
+
@@<DFFFFHHHHHIIJIIJIIJHGHIIJJCGIIJIJGIJJGHIGAFH@FB@GHFHFFEDFFEEDBDDDADDCCDDCCCD<??BACCCCBBBDBBB@A####


The first line is an annotation line
The second line is the sequence data
The third line is nothing
The last line is quality data for each nucleotide

Expected result
$ python day2_1_advanced2.py sample.fq
Total reads = 250
Average read length = 101.0
GC content = GC count/total = 0.4575049504950495