-
-
Save rspeer/7559750 to your computer and use it in GitHub Desktop.
""" | |
This file contains code that, when run on Python 2.7.5 or earlier, creates | |
a string that should not exist: u'\Udeadbeef'. That's a single "character" | |
that's illegal in Python because it's outside the valid Unicode range. | |
It then uses it to crash various things in the Python standard library and | |
corrupt a database. | |
On Python 3... well, this file is full of syntax errors on Python 3. But | |
if you were to change the print statements and byte literals and stuff: | |
* You'd probably see the same bug on Python 3.2. | |
* On Python 3.3, you'd just get an error making the string on the first line. | |
* On Python 3.3.3, the error even makes sense. | |
On narrow builds of Python, u'\Udeadbeef' gets immediately truncated to | |
u'\ubeef', a totally safe character. (It's a nonsense syllable in | |
Korean.) For once, narrow Python's half-assed Unicode support has saved you. | |
The relevant bug is: http://bugs.python.org/issue19279 | |
""" | |
# Use a bug in the UTF-7 decoder to create a string containing codepoint | |
# U+DEADBEEF. (Keep in mind that Unicode ends at U+10FFFF.) | |
deadbeef = '+d,+6t,+vu8-'.decode('utf-7', 'replace')[-1] | |
print repr(deadbeef) | |
# outputs u'\Udeadbeef'. That's not a valid string literal. | |
import codecs | |
with codecs.open('deadbeef.txt', 'w', encoding='utf-8') as outfile: | |
print >> outfile, deadbeef | |
# writes a non-UTF-8 file | |
try: | |
with codecs.open('deadbeef.txt', encoding='utf-8') as infile: | |
print infile.read() | |
except UnicodeDecodeError: | |
print "Boom! Broke your text file." | |
import re | |
try: | |
re.match(u'[A-%s]' % deadbeef, u'test') | |
except MemoryError: | |
print "Boom! Broke your regular expression." | |
import sqlite3 | |
db = sqlite3.connect('deadbeef.db') | |
db.execute(u'CREATE TABLE deadbeef (id integer primary key, value text)') | |
db.execute(u'INSERT INTO deadbeef (value) VALUES (?)', u'\U0001f602') | |
db.execute(u'SELECT * FROM deadbeef').fetchall() | |
# This works fine. I'm just convincing you that SQLite has no problem with | |
# Unicode itself. | |
db.execute(u'INSERT INTO deadbeef (value) VALUES (?)', deadbeef) | |
try: | |
db.execute(u'SELECT * FROM deadbeef').fetchall() | |
except sqlite3.OperationalError: | |
print "Boom! Corrupted your database." | |
# As a bonus, if you run that SQLite query at the IPython prompt, it gets | |
# a second error trying to print out the error message. |
Feature request. Print statement messages should be in lolcatz form.
Boom! I haz broke ur regular expressions!
What's a "narrow build"?
Python 2.7.3 getting this:
: File name too long
./deadbeef_character.py: line 25: syntax error near unexpected token `('
./deadbeef_character.py: line 25: `deadbeef = '+d,+6t,+vu8-'.decode('utf-7', 'replace')[-1]'
On gentoo 2.7.5 it doesn't work (u'\U1eadbeef'), maybe they patched it already?
Has anyone tried MySQL or PostgreSQL? This example only crashed SQLite though, which is super lightweight may not handle unicode errors robustly.
OSX - works fine
Ubuntu 12.04.2 LTS - Boom boom boom
FreeBSD 9.2 - yeah... locks up until the process is killed.
Linux Mint 16 64bit, starting the script with Sublime or Spyder => system crash or at least freezes (too unpatient to wait much longer)
@peterbe: it's a Python compile flag which controls whether Unicode support includes only the Basic Multilingual Plane or the full range of Unicode characters (i.e. does it end at 0x10000 or 0x10FFFF). See http://www.python.org/dev/peps/pep-0261/
This used to only be of interest to those of us working with relatively obscure multilingual content but has become a lot more important for most people now that things outside the BMP like Emoji have become very common. It means that len()
won't work as expected on those characters in most Python 2.x builds. Try running https://github.com/acdha/unix_tools/blob/master/bin/unicode-characters.py under both Python 2 and 3 if you're severely bored.
Slightly simpler way to get a hold of deadbeef:'+d,+6t,+vu8-'.decode('utf-7', 'ignore')
It works fine on OSX only because OSX's default Python is a narrow build. (Kind of disappointing for an OS with otherwise good support for lots of characters, including emoji.) The character just ends up being '\ubeef'.
Might be related: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=469644
I love it. So simple yet so effective.