This is a list of common mistakes and bad practices seen in assignments submitted for the Bioinformatics lesson.
Prefer using with instead of open/close
https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python
One bonus of using this method is that any files opened will be closed automatically after you are done. This leaves less to worry about during cleanup.
Avoid:
f = open('filename.txt', 'w')
...do stuff..
f.close
Better:
with open('filename.txt', 'w') as f:
...do stuff...
Use continue
if possible. This will save you some indentation..
Avoid:
for line in f1:
if not "HAVANA" in line:
...do stuff...
Better:
for line in f1:
if "HAVANA" in line:
continue
...do stuff..
When parsing/reading/opening large files make sure that you are using the right tools the right way. pandas/append is extemely slow and is not intended for loading large files.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
For example use from_dict instead
Never use append to transform a pandas DataFrame:
Avoid:
df5 = pd.DataFrame(columns = ['FeatureTranscript', 'GeneIDTranscript'])
for index, row in dataframe.iterrows():
if row['Feature']=="transcript":
# Δες σημείωση 3 (-2)
df5 = df5.append({'FeatureTranscript' : row['Feature'], 'GeneIDTranscript' : row['GeneID']}, ignore_index=True)
Better:
df5 = dataframe[dataframe['Feature'] == 'transcript'][['Feature', 'GeneID']]
Use list comprehensions and sum(bool expressions), to count things. Do not use count = count + 1
Remember: True + True = 2
, True + False = 1
Avoid:
cnt=0
for item2 in dataframe.Parent:
if item1==item2:
cnt=cnt+1
Better:
cnt = sum(item1==item2 for item2 in dataframe.Parent)
Most of the times in python you don't neet an index to iterate a list. You can get the items of the list directly.
Avoid:
for i in range(len(lines1)):
ls = lines1[i].split('\t')
Better:
for line in lines1:
ls = line.split('\t')
Better: Functions..
Use list comprehensions.
Avoid:
topGeneID = []
for i in range(len(lines1)):
ls = lines1[i].split('\t')
if ls[2] == 'gene':
continue
else:
topGeneID.append(lines1[i].split('\t')[-1].split(';')[2].split('=')[1])
Better:
f = lambda x : x.split('\t')[-1].split(';')[2].split('=')[1]
topGeneID = [f(line) for line in lines1 if line.split('\t')[2] != 'gene']
Do not use python build in function names for variable names. The fact that you can do, does not mean that you should! Try:
a = str(123)
print (a) # prints : '123'
str = 5
a = str(123) # Raises error!
Use class Counter when you want to count something in python.
Avoid:
cnt_gene = cnt_exon = cnt_cds = cnt_transcript = 0
for i in range(len(dataframes)):
if(genelist[i][1] == 'gene'):
cnt_gene += 1
elif(genelist[i][1] == 'transcript'):
cnt_transcript += 1
elif(genelist[i][1] == 'exon'):
cnt_exon += 1
elif(genelist[i][1] == 'CDS'):
cnt_cds += 1
Better:
from collections import Counter
items_to_count = [item[1] for item in genelist if item[1] in ['gene', 'transcript', 'exon', 'CDS']]
print (Counter(items_to_count))
Python has a max function!
NEVER EVER DO:
max_value = counters2[len(counters2)-1]
for i in range(len(counters2)):
if(counters2[i] > max_value):
max_location = i
max_value = counters2[i]
Instead do:
max_value = max(counters2)
max_location = counters2.index(max_value)
How to iterate all lines in a file:
with open(filename) as f:
for line in f:
...
How not to iterate all lines in a file:
f = open(filename)
for line in f.read().splitlines():
...
python has the special value None
, which is similar to null
in other languages.
Avoid:
a = 'None'
a = 'TIPOTA'
a = 'PRAMA'
Better:
a = None
Python has a garbage collector! Therefore..
Avoid:
a = [1,2,3]
del a
a = [5,6,7]
Better:
a = [1,2,3]
a = [4,5,6]
A very good code practice is to never leave if..else if hanging:
Avoid:
if x=='a':
...do stuf()..
else if x == 'b':
...do_other_stuff()..
Better:
if x=='a':
...do stuf()..
else if x == 'b':
...do_other_stuff()..
else:
raise Exception('This is weird')