- Understand command line argument and file input/output process
- Command line argument
Source code
import sys # import sys library
print(sys.argv) # print command line arguments
Result
$ python3 day2_1_example1.py 123 abc ['day2_1_example1.py', '123', 'abc']
Note
sys.argv
returns the file path and the command line arguments withlist
- Importing
sys
module is needed before callingsys.argv
- Load data from a file
import sys # Import sys module to use open method
file = open("input.fa") # Open the file input.fa and return the file handle
for line in file: # Repeat reading line by line from the file
print(line.rstrip()) # Show the line with the line break removed
file.close() # Close the file
Result
$ python3 day2_1_example2.py >seq00000 GAAGACTAGT >seq00001 GAAGACTAGT >seq00002 GAAGACTAGT >seq00003 GAAGACTAGT >seq00004 GAAGACTAGT ....
Note
-
Before running the script, please check whether the input.fa in the current working directory.
-
To read the contents of a file, you need the following:
- import sys module
- open a file
- read line by line using
for
statement - close the file
-
The open and close processes are necessary to tell the python interpreter about where the data location is and reading/writing from/to a file safely
import sys # Import sys module to use open method
file = open(sys.argv[1]) # Open the file and return the file handle
for line in file: # Repeat reading line by line from the file
print(line.rstrip()) # Show the line with the line break removed
file.close() # Close the file
Result
$ python3 day2_1_example3.py input.fa >seq00000 GAAGACTAGT >seq00001 GAAGACTAGT >seq00002 GAAGACTAGT >seq00003 GAAGACTAGT >seq00004 GAAGACTAGT ....
Note
- Please pay attention to the command line argument compared to Example2
- Load a FASTA file by specifying the file path with the command line argument, and calculate the genome size (the total number of nucleotides)
import sys # import sys package for open and close methods
genome_size = 0 # Initialize a variable to store the genome size
file = open(sys.argv[1]) # Open a fasta file, athal_genome.fa
for line in file: # Read line by line using for statement
#
# Your code here
#
file.close() # Close the file
print("genome size =", genome_size, "bp") # Show the genome_size
Expected result
$ python day2_1_exercise1.py athal_genome.fa genome size = 119667750 bp
- Calculate GC content (the ratio of (GC/genome size)) of A. thaliana genome (athal_genome.fa)
Expected result
$ python day2_1_advancecd1.py athal_genome.fa Genome size = 119667750 bp GC content = GC count/genome size = 0.35999748470243653
- Read a FASTQ file (sample.fq) and calculate 1. the total number of reads, 2. the average length of reads, and 3. GC content
Hint
- Fastq file format is as follows:
@HWI-ST1357:71:D2FNTACXX:1:1101:1235:2123 1:N:0:CGATGT
AGTCTCTCTGCTACCTTCATCGTCTACCCATGGCGGCCTTAGTTCCTGCGTCCTTACTAAACCCTCCGTTTCTCTTACGAAGGTCTTCGCGGTTCCCGAAA
+
@@<DFFFFHHHHHIIJIIJIIJHGHIIJJCGIIJIJGIJJGHIGAFH@FB@GHFHFFEDFFEEDBDDDADDCCDDCCCD<??BACCCCBBBDBBB@A####
- The first line is an annotation line
- The second line is the sequence data
- The third line is nothing
- The last line is quality data for each nucleotide
Expected result
$ python day2_1_advanced2.py sample.fq Total reads = 250 Average read length = 101.0 GC content = GC count/total = 0.4575049504950495