Skip to content

Instantly share code, notes, and snippets.

@uniqueg
Last active December 9, 2021 18:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save uniqueg/0896d2ce17da3764f1d2b53a287f3ff1 to your computer and use it in GitHub Desktop.
Save uniqueg/0896d2ce17da3764f1d2b53a287f3ff1 to your computer and use it in GitHub Desktop.
Intro to Programming: Follow-up tutorial session Dec 9th

Intro to Programming: Q&A tutorial session, Dec 9th, 2021


13:20:50 From Linnea: I just wanted to ask how we should name the file in our code - so that you dont have to alter it when you take the single tasks. Sequence or seq.fa?

@Linnea: It looks like you are speaking about the input file? It shouldn't matter because we will test your functions individually, not the whole code together. Specifically, we will import your function definitions and pass in whatever inputs we want. It is important though that you use the function definitions exactly in the way that we defined them, do not change the function names or argument names. It is okay to add additional arguments, as long as you provide some defaults for them.

An example to highlight how we will consume/test your code. Let's say you have defined this function:

def main(fasta):
  # function code goes here

Now when we get your solution, we will import your function main and then run it like so:

our_input_file = "/path/to/some/test/fasta/file.fa"

main(fasta=our_input_file)

Then we check that main() behaves as expected by checking expected outputs against the outputs that we actually get from your implementations.


13:27:42 From zoë von arx: Do we have to turn in 4 different files or just one with all the four tasks?

@Zoe: One file with all functions.

In principle the file can include any other code as well, we will just ignore it and only use/test the functions as outlined in the assignment. However, if there are syntax or other errors in the code that prevent us from importing your functions, we won't be able to run it. So we recommend that before you submit, you remove all code that is not inside one of the functions we defined (and any other dependent functions that you may have defined yourself and that these other functions use/call; probably not the case for most of you). After you have removed all code outside of functions, you can "Run" the file on Replit to ensure that it doesn't throw an error. It shouldn't give any output (the functions are not called), so that's fine/expected.


13:35:09 From zoë von arx: When exactly should i translate the dna into rna

@Zoe: Actually, you don't need to translate the DNA into RNA at all in this assignment. If we sequence RNA, we always first create a cDNA of it, so we will always deal with DNA here, even if the original biological sequences are RNA.


13:47:00 From Svenja junco: just to make sure the input file / se.faq the data there, we refer to it coding in the main file or we have to copy it directly into the main file?

@Svenja: Yes you refer to it in the code by filename (e.g., seq.fa). Do not copy data into the code directly. But also remembers (as mentioned in other comments here): You should not refer to the filenames inside the functions. You pass a file (by filename as a string or a variable that contains that filename) as an argument to the function, like so:

def func_a(input_file):
  # function code
  # do something with variable `input_file`
  # do NOT do something with a literal file, like `"seq.fa"`

Now, it's okay if you have code outside the function to call it (for your own tests):

my_input_file = "seq.fa"

func_a(input_file=my_input_file)

Or call the function directly with a string value:

func_a(input_file="seq.fa")

However, this code is for you - we will need only the function code! In fact it is better/safer if you remove any code apart from the function code, because any code we won't need for testing will increase the risk that there is something wrong with the code that will stop us from being able to import your functions and test them.

And to make this clear again about the input files: We have our own copy of that, so you don't to mail it to us. In fact, we have multiple test files, each designed to test the different behaviors of each function.


13:54:49 From Sarina Burkard To Everyone: will the input file have more than one sequence like the one on Adam has or do you just test one sequence at a time?

@Sarina: I would expect that our input files (we haven't prepared them yet) will mostly have multiple records. For sure we will test whether your code will be able to handle multiple records. And maybe we will test whether it will also work fine if we pass just one record.


13:59:44 From Maren Anheuser: A question to Task 1: The function find_longest_orf returns the length of the orf. Should the start and stop codon be included in the length or is the length only the sequence between start and stop codon?

@Maren: The ORF length should include both the start and stop codon, i.e., the minimum possible length of an ORF is 6 (when a stop codon immediately follows a start codon)! However, since we didn't clearly specify that, we will accept solutions that do not include the stop codon. However, we don't think there is a good argument not to include the start codon (it is translated, after all!), so this for sure you will have to include.


14:16:13 From Elisabeth Martín: What is meant by the longest open reading frame: If I start reading at my start codon, is the longest open reading frame until I reach the first stop codon? Or is the longest open reading frame until I reach the last stop codon in the gene (even though there were stop codons before that)?

@Elisabeth: The open reading frame (ORF) ends with the first in frame stop codon. An in frame codon is a codon that is among the set of consecutive, non-overlapping triplets following a start codon. For example, in the sequence GCCATGTCTGATAAGTCTTGAGTT, you have the following reading frames:

1: GCC'ATG'TAT'GAT'AAG'TCT'TGA'GTT'TAA
2: G'CCA'TGT'ATG'ATA'AGT'CTT'GAG'TTT'AA
3: GC'CAT'GTA'TGA'TAA'GTC'TTG'AGT'TTA'A

The first one here is the only one that has an ORF, because it is the only one that has both a start codon (position 4, index 3) and a stop codon in the same frame. In fact it has two stop codons in frame, but as I said, translation will usually stop at the first in-frame stop codon (TGA, position 19, index 18), so it will end there and not continue to the second stop codon (TAA, position 25, index 24).

In the second frame, there is no (complete) ORF because, while there is a start codon (position 8, index 7), there is no in-frame stop codon.

In the third frame, there is also no ORF because, while there are actually two stop codons (TGA, position 9, index 8; TAA, position 12, index 11), there is no start codon before it.


14:30:56 From sofia ecca: Is it necessary to convert nucleotides into its complementaries during translation? for example U in T, T in U, C in G and G in C ... because this is not indicated in the text but should be part of the synthesis process..

@Sofia: No, we want to only translate the sequence as it is passed to the function. If we had a use case to also (or only) look at the reverse complement, we could write a function that reverse complements the sequence, then call the translate function with the reverse complemented sequence. This is more efficient than implementing every function that deals with DNA sequences to take care of both directions (or add maybe just the complement, or just the reverse?), just because we might need it some day. We want to keep functions as small as possible to do one clearly defined thing well.

Also, note that the complements for both U and T is A (and vice versa). The difference is that U is only used in RNA sequences, while T is used in DNA sequences. So your question then implies two different conversions: (1) generating the reverse complement of a sequence (A>T, C>G, G>C, T>A), and (2) converting an RNA to a DNA sequence (U>T`). Both we don't need here, because we are only dealing with DNA sequences. You might be confused by the introduction to the assignment mentioning RNA sequencing. But even in the method of RNA sequencing, after extracting and processing the RNA we want to sequence, we first conver the RNA into cDNA (https://en.wikipedia.org/wiki/Complementary_DNA) and then we sequence that cDNA, so what we will get out is basically always DNA, even if it represents an actual RNA sequence from inside the cell.


14:32:21 From Kai Kurth: you said that we don’t need to define the sequence in our code, because you will use one of your own to test the tasks. Is this the same for the start and the length in task 2 (def translate_dna(sequence, start, length))?

@Kai: Yes. Do not "hardcode" input inside functions, that's precisely what arguments are for. The advantage of functions is that they allow us to abstract a specific behavior and then implement it such that we can apply that behavior to different inputs by passing them to the function as arguments. You lose that advantage if you do not make use of these arguments and instead "hard code" the inputs inside the functions or overwrite the variables that contain the arguments. If you do that to test your function, that's okay in principle, but you need to remember to remove that before submission. Given that there is a risk of forgetting to do that, we strongly recommend against that practice.

Example:

## DO THIS

# define your function with an argument to take your inputs,
# then make use of that argument inside the function code
def my_function(my_sequence):
  # do something with `my_sequence` here

# to test/use your function, call it _outside_ of the function definition
my_function("ACGT")

#-----

## DON'T DO THIS

# do not just define functions that do not take inputs via arguments
def my_bad_function():
  # do not "hard code" inputs
  my_sequence = "ACGTGCAGT"
#-----

## DON'T DO THIS EITHER

# an argument for passing input to the function is there, but...
def my_naughty_function(my_sequence):
  # do not override the variable that receives that input!!
  my_sequence = "ACGTGCAGT"

The last two functions are really not very useful, as they will always do exactly the same thing - there is no flexibility to run the function (and its behavior) on any other inputs. For the last one, it's even worse, because by defining an argument, you will think that you should be able to run it on a different sequence, but then you can't, because whatever value is passed into the function is overridden by the assignment to my_sequence. This is a big source of errors, especially because this will likely not lead to errors down the road (because your hard-coded input is likely totally valid) - which makes it hard to spot!

So, instead, use the first pattern: keep your functions clean of "hard-coded" inputs, then call that function from outside of the scope of the function to test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment