Skip to content

Instantly share code, notes, and snippets.

@tgs
Last active April 19, 2023 16:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tgs/3f1ebeae3adc29d760cbc19c37d2afe8 to your computer and use it in GitHub Desktop.
Save tgs/3f1ebeae3adc29d760cbc19c37d2afe8 to your computer and use it in GitHub Desktop.
Translate NCBI2na data from hexadecimal to standard bases
#!/usr/bin/env python
"""
Translate compact hex sequence representation to standard bases.
If you have NCBI ASN.1 data, the easiest way to get a FASTA version is to
use NCBI's asn2fasta tool, which is available here:
https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asn2fasta/
NCBI2na is a compact representation of sequence data, where each base just
takes 2 bits. It's commonly found in text ASN.1 files as hexadecimal strings.
In binary ASN.1, it is found as binary data - this script does not handle that.
In a string of hexadecimal digits, each digit represents two nucleotide bases.
And the hex string represents actual characters, so only even numbers of hex
digits are valid. So for any given hex string, there are four different
numbers of nucleotides it could represent. This means you need to provide the
length of the nucleotide sequence you expect.
>>> import ncbi2na
>>> ncbi2na.decode_ncbi2na(20, u'0123456789')
'AAACAGATCACCCGCTGAGC'
>>> ncbi2na.decode_ncbi2na(17, u'0123456789')
'AAACAGATCACCCGCTG'
Since we have the expected length, we can sanity-check it against the input
string:
>>> ncbi2na.decode_ncbi2na(16, u'0123456789')
Traceback (most recent call last):
...
ValueError: Found more nuc bases than expected
>>> ncbi2na.decode_ncbi2na(21, u'0123456789')
Traceback (most recent call last):
...
ValueError: Found fewer nuc bases than expected
This should work from Python 2 or 3, but the hexadecimal string
must be unicode because the `translate` method works very differently
in byte strings.
"""
from itertools import product
__all__ = ['decode_ncbi2na']
_nucs = u'ACGT'
_replacements = list(map(u''.join, product(_nucs, _nucs)))
_hex2nuc = dict(zip(map(ord, u'0123456789ABCDEF'), _replacements))
if bytes is str: # python2
_text_type = unicode
else:
_text_type = str
def decode_ncbi2na(length, hex_str):
if not isinstance(hex_str, _text_type):
raise TypeError("hex_str must be unicode/text type")
seq_untrimmed = hex_str.translate(_hex2nuc)
raw_len = len(seq_untrimmed)
if length > raw_len:
raise ValueError("Found fewer nuc bases than expected")
if length < raw_len - 3:
raise ValueError("Found more nuc bases than expected")
return seq_untrimmed[:length]
@MrTomRod
Copy link

Hey Thomas

Could you briefly explain what is missing and how to fix it?

Best, Thomas

@tgs
Copy link
Author

tgs commented Apr 18, 2023

@MrTomRod - I've added support for setting the length of the nucleotide sequence, so that sequences can be extracted correctly. Thanks for asking, hope it helps - let me know if there are more issues

@tgs
Copy link
Author

tgs commented Apr 19, 2023

If you have ncbi2na data, it's likely to be in asn.1, right? My actual suggestion is to use asn2fasta from https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/asn2fasta/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment