Skip to content

Instantly share code, notes, and snippets.

@chadselph
Created April 12, 2012 23:39
Show Gist options
  • Save chadselph/2371903 to your computer and use it in GitHub Desktop.
Save chadselph/2371903 to your computer and use it in GitHub Desktop.
gsm encoding for python

(From http://stackoverflow.com/questions/2452861/python-library-for-converting-plain-text-ascii-into-gsm-7-bit-character-set, ran out of space in the comment section.)

Running the original file does not work in either Python 2 or 3.

In Python2, the program prints this:

64868d8d903a7390938d85

(which is wrong) because it is using the indexes of gsm which do not map to the index of their GSM encodings due to the fact that it is a bytestring with some characters taking up multiple bytes. gsm is actually equal to

\r\xc3\x85\xc3\xa5\xce\x94_\xce\xa6\xce\x93\xce\x9b\xce\xa9\xce\xa0\xce\xa8\xce\xa3\xce
\x98\xce\x9e\x1b\xc3\x86\xc3\xa6\xc3\x9f\xc3\x89 !"#\xc2\xa4%&\'()*+,-./0123456789:;<=>?
\xc2\xa1ABCDEFGHIJKLMNOPQRSTUVWXYZ\xc3\x84\xc3\x96\xc3\x91\xc3\x9c`\xc2\xbfabcdefghijklm
nopqrstuvwxyz\xc3\xa4\xc3\xb6\xc3\xb1\xc3\xbc\xc3\xa0'

Notice that non-ascii characters take at least 2 bytes to encode in UTF8 and as a result, gsm is longer than 128 bytes long. Doing gsm.find(c) will return the index of the byte, which is no longer synchronized with the gsm codepoints. For example:

>>> gsm.find('$')  # we might expect this to return 2, the GSM codepoint for '$'
3
# -*- coding: utf8 -*-
# (original file for reference)
gsm = ("@£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞ\x1bÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?"
"¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ`¿abcdefghijklmnopqrstuvwxyzäöñüà")
ext = ("````````````````````^```````````````````{}`````\\````````````[~]`"
"|````````````````````````````````````€``````````````````````````")
def gsm_encode(plaintext):
res = ""
for c in plaintext:
idx = gsm.find(c);
if idx != -1:
res += chr(idx)
continue
idx = ext.find(c)
if idx != -1:
res += chr(27)
res += chr(idx)
return res.encode('hex')
print(gsm_encode("Hello World"))
# -*- coding: utf8 -*-
"""
The approach will mostly work in Python3, where strings are real character strings and not byte-strings. However, the
gsm_encode function will need to be fixed to work in Python3, like so:
"""
import binascii
gsm = ("@£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞ\x1bÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?"
"¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ`¿abcdefghijklmnopqrstuvwxyzäöñüà")
ext = ("````````````````````^```````````````````{}`````\\````````````[~]`"
"|````````````````````````````````````€``````````````````````````")
def gsm_encode(plaintext):
res = bytearray()
for c in plaintext:
idx = gsm.find(c);
if idx != -1:
res.append(idx)
continue
idx = ext.find(c)
if idx != -1:
res.append(27)
res.append(idx)
return binascii.hexlify(res)
print(gsm_encode("Hello World"))
# -*- coding: utf8 -*-
"""
For Python2, you can make it work if gsm is a "unicode" instead of a
bytestring. Make this happen by decoding the bytes from utf8.
"""
gsm = ("@£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞ\x1bÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?"
"¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑÜ`¿abcdefghijklmnopqrstuvwxyzäöñüà").decode('utf8')
ext = ("````````````````````^```````````````````{}`````\\````````````[~]`"
"|````````````````````````````````````€``````````````````````````")
def gsm_encode(plaintext):
res = ""
for c in plaintext:
idx = gsm.find(c);
if idx != -1:
res += chr(idx)
continue
idx = ext.find(c)
if idx != -1:
res += chr(27)
res += chr(idx)
return res.encode('hex')
print(gsm_encode("Hello World"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment