kantale/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    This is a list of common mistakes and bad practices seen in assignments submitted for the Bioinformatics lesson.
Note 1

Prefer using with instead of open/close
https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

One bonus of using this method is that any files opened will be closed automatically after you are done. This leaves less to worry about during cleanup.

Avoid:
f = open('filename.txt', 'w')
...do stuff..
f.close
Better:
with open('filename.txt', 'w') as f:
    ...do stuff...

Note 2

Use continue if possible. This will save you some indentation..
Avoid:
for line in f1:
    if not "HAVANA" in line:
       ...do stuff...
Better:
for line in f1:
    if "HAVANA" in line:
        continue
        
    ...do stuff..
Note 3

When parsing/reading/opening large files make sure that you are using the right tools the right way.
pandas/append is extemely slow and is not intended for loading large files.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

For example use from_dict instead
Note 3.1

Never use append to transform a pandas DataFrame:
Avoid:
df5 = pd.DataFrame(columns = ['FeatureTranscript', 'GeneIDTranscript'])
for index, row in dataframe.iterrows():
    if row['Feature']=="transcript":
        # Δες σημείωση 3 (-2)
        df5 = df5.append({'FeatureTranscript' : row['Feature'], 'GeneIDTranscript' : row['GeneID']}, ignore_index=True)
Better:
df5 = dataframe[dataframe['Feature'] == 'transcript'][['Feature', 'GeneID']]
Note 4

Use list comprehensions and sum(bool expressions), to count things. Do not use count = count + 1
Remember: True + True = 2, True + False = 1
Avoid:
    cnt=0
    for item2 in dataframe.Parent:
        if item1==item2:
            cnt=cnt+1
Better:
cnt = sum(item1==item2 for item2 in dataframe.Parent)

Note 5

Most of the times in python you don't neet an index to iterate a list. You can get the items of the list directly.
Avoid:
for i in range(len(lines1)): 
    ls = lines1[i].split('\t')
Better:
for line in lines1: 
    ls = line.split('\t')
Note 6

Don't Repeat Yourself
Avoid:

Better:
Functions..
Note 7

Use list comprehensions.
Avoid:
topGeneID = []

for i in range(len(lines1)):
    ls = lines1[i].split('\t')
    if ls[2] == 'gene':
        continue
    else:
        topGeneID.append(lines1[i].split('\t')[-1].split(';')[2].split('=')[1])
Better:
f = lambda x : x.split('\t')[-1].split(';')[2].split('=')[1]
topGeneID = [f(line) for line in lines1 if line.split('\t')[2] != 'gene']
Note 8

Do not use python build in function names for variable names. The fact that you can do, does not mean that you should!
Try:
a = str(123)
print (a) # prints : '123'

str = 5
a = str(123) # Raises error!
Note 9

Use class Counter when you want to count something in python.
Avoid:
cnt_gene = cnt_exon = cnt_cds = cnt_transcript = 0
for i in range(len(dataframes)):
    if(genelist[i][1] == 'gene'):
        cnt_gene += 1
    elif(genelist[i][1] == 'transcript'):
        cnt_transcript += 1
    elif(genelist[i][1] == 'exon'):
        cnt_exon += 1
    elif(genelist[i][1] == 'CDS'):
        cnt_cds += 1
Better:
from collections import Counter
items_to_count = [item[1] for item in genelist if item[1] in ['gene', 'transcript', 'exon', 'CDS']]
print (Counter(items_to_count))

Note 10

Python has a max function!
NEVER EVER DO:
max_value = counters2[len(counters2)-1]
for i in range(len(counters2)):
    if(counters2[i] > max_value):
        max_location = i
        max_value = counters2[i]
Instead do:
max_value = max(counters2)
max_location = counters2.index(max_value)

Note 11

How to iterate all lines in a file:
with open(filename) as f:
   for line in f:
      ...
How not to iterate all lines in a file:
f = open(filename)
for line in f.read().splitlines():
   ...
Note 12

python has the special value None, which is similar to null in other languages.
Avoid:
a = 'None'
a = 'TIPOTA'
a = 'PRAMA'
Better:
a = None
Note 13

Python has a garbage collector! Therefore..
Avoid:
a = [1,2,3]
del a
a = [5,6,7]
Better:
a = [1,2,3]
a = [4,5,6]
Note 14

A very good code practice is to never leave if..else if hanging:
Avoid:
if x=='a':
   ...do stuf()..
else if x == 'b':
   ...do_other_stuff()..
Better:
if x=='a':
   ...do stuf()..
else if x == 'b':
   ...do_other_stuff()..
else:
   raise Exception('This is weird')