Skip to content

Instantly share code, notes, and snippets.

@audy
Created December 4, 2019 05:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save audy/f2ac2cfdff2ff844496da64a5aa81ba4 to your computer and use it in GitHub Desktop.
Save audy/f2ac2cfdff2ff844496da64a5aa81ba4 to your computer and use it in GitHub Desktop.
A one-hot-encoder for DNA sequences that I find myself writing repeatedly
def one_hot_encode(sequence: str, alphabet=["G", "A", "T", "C"]):
"""
one-hot encode a string using a pre-defined alphabet
>>> one_hot_encode("GATC")
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]
"""
vector = ([0] * len(alphabet)) * len(sequence)
for n, character in enumerate(sequence):
vector[alphabet.index(character) + (n * len(alphabet))] = 1
return vector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment