Skip to content

Instantly share code, notes, and snippets.

@kantale
Last active June 15, 2019 08:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kantale/3e4bc6e1714814c3b91fcfe6c78a2697 to your computer and use it in GitHub Desktop.
Save kantale/3e4bc6e1714814c3b91fcfe6c78a2697 to your computer and use it in GitHub Desktop.
TEI_assignments_notes

This is a list of common mistakes and bad practices seen in assignments submitted for the Bioinformatics lesson.

Note 1

Prefer using with instead of open/close

https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

One bonus of using this method is that any files opened will be closed automatically after you are done. This leaves less to worry about during cleanup.

Avoid:

f = open('filename.txt', 'w')
...do stuff..
f.close

Better:

with open('filename.txt', 'w') as f:
    ...do stuff...

Note 2

Use continue if possible. This will save you some indentation..

Avoid:

for line in f1:
    if not "HAVANA" in line:
       ...do stuff...

Better:

for line in f1:
    if "HAVANA" in line:
        continue
        
    ...do stuff..

Note 3

When parsing/reading/opening large files make sure that you are using the right tools the right way. pandas/append is extemely slow and is not intended for loading large files.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

For example use from_dict instead

Note 3.1

Never use append to transform a pandas DataFrame:

Avoid:

df5 = pd.DataFrame(columns = ['FeatureTranscript', 'GeneIDTranscript'])
for index, row in dataframe.iterrows():
    if row['Feature']=="transcript":
        # Δες σημείωση 3 (-2)
        df5 = df5.append({'FeatureTranscript' : row['Feature'], 'GeneIDTranscript' : row['GeneID']}, ignore_index=True)

Better:

df5 = dataframe[dataframe['Feature'] == 'transcript'][['Feature', 'GeneID']]

Note 4

Use list comprehensions and sum(bool expressions), to count things. Do not use count = count + 1 Remember: True + True = 2, True + False = 1

Avoid:

    cnt=0
    for item2 in dataframe.Parent:
        if item1==item2:
            cnt=cnt+1

Better:

cnt = sum(item1==item2 for item2 in dataframe.Parent)

Note 5

Most of the times in python you don't neet an index to iterate a list. You can get the items of the list directly.

Avoid:

for i in range(len(lines1)): 
    ls = lines1[i].split('\t')

Better:

for line in lines1: 
    ls = line.split('\t')

Note 6

Don't Repeat Yourself

Avoid: DRY

Better: Functions..

Note 7

Use list comprehensions.

Avoid:

topGeneID = []

for i in range(len(lines1)):
    ls = lines1[i].split('\t')
    if ls[2] == 'gene':
        continue
    else:
        topGeneID.append(lines1[i].split('\t')[-1].split(';')[2].split('=')[1])

Better:

f = lambda x : x.split('\t')[-1].split(';')[2].split('=')[1]
topGeneID = [f(line) for line in lines1 if line.split('\t')[2] != 'gene']

Note 8

Do not use python build in function names for variable names. The fact that you can do, does not mean that you should! Try:

a = str(123)
print (a) # prints : '123'

str = 5
a = str(123) # Raises error!

Note 9

Use class Counter when you want to count something in python.

Avoid:

cnt_gene = cnt_exon = cnt_cds = cnt_transcript = 0
for i in range(len(dataframes)):
    if(genelist[i][1] == 'gene'):
        cnt_gene += 1
    elif(genelist[i][1] == 'transcript'):
        cnt_transcript += 1
    elif(genelist[i][1] == 'exon'):
        cnt_exon += 1
    elif(genelist[i][1] == 'CDS'):
        cnt_cds += 1

Better:

from collections import Counter
items_to_count = [item[1] for item in genelist if item[1] in ['gene', 'transcript', 'exon', 'CDS']]
print (Counter(items_to_count))

Note 10

Python has a max function!

NEVER EVER DO:

max_value = counters2[len(counters2)-1]
for i in range(len(counters2)):
    if(counters2[i] > max_value):
        max_location = i
        max_value = counters2[i]

Instead do:

max_value = max(counters2)
max_location = counters2.index(max_value)

Note 11

How to iterate all lines in a file:

with open(filename) as f:
   for line in f:
      ...

How not to iterate all lines in a file:

f = open(filename)
for line in f.read().splitlines():
   ...

Note 12

python has the special value None, which is similar to null in other languages.

Avoid:

a = 'None'
a = 'TIPOTA'
a = 'PRAMA'

Better:

a = None

Note 13

Python has a garbage collector! Therefore..

Avoid:

a = [1,2,3]
del a
a = [5,6,7]

Better:

a = [1,2,3]
a = [4,5,6]

Note 14

A very good code practice is to never leave if..else if hanging:

Avoid:

if x=='a':
   ...do stuf()..
else if x == 'b':
   ...do_other_stuff()..

Better:

if x=='a':
   ...do stuf()..
else if x == 'b':
   ...do_other_stuff()..
else:
   raise Exception('This is weird')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment