Skip to content

Instantly share code, notes, and snippets.

@rspeer
Last active June 12, 2023 12:24
Show Gist options
  • Star 70 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save rspeer/7559750 to your computer and use it in GitHub Desktop.
Save rspeer/7559750 to your computer and use it in GitHub Desktop.
"""
This file contains code that, when run on Python 2.7.5 or earlier, creates
a string that should not exist: u'\Udeadbeef'. That's a single "character"
that's illegal in Python because it's outside the valid Unicode range.
It then uses it to crash various things in the Python standard library and
corrupt a database.
On Python 3... well, this file is full of syntax errors on Python 3. But
if you were to change the print statements and byte literals and stuff:
* You'd probably see the same bug on Python 3.2.
* On Python 3.3, you'd just get an error making the string on the first line.
* On Python 3.3.3, the error even makes sense.
On narrow builds of Python, u'\Udeadbeef' gets immediately truncated to
u'\ubeef', a totally safe character. (It's a nonsense syllable in
Korean.) For once, narrow Python's half-assed Unicode support has saved you.
The relevant bug is: http://bugs.python.org/issue19279
"""
# Use a bug in the UTF-7 decoder to create a string containing codepoint
# U+DEADBEEF. (Keep in mind that Unicode ends at U+10FFFF.)
deadbeef = '+d,+6t,+vu8-'.decode('utf-7', 'replace')[-1]
print repr(deadbeef)
# outputs u'\Udeadbeef'. That's not a valid string literal.
import codecs
with codecs.open('deadbeef.txt', 'w', encoding='utf-8') as outfile:
print >> outfile, deadbeef
# writes a non-UTF-8 file
try:
with codecs.open('deadbeef.txt', encoding='utf-8') as infile:
print infile.read()
except UnicodeDecodeError:
print "Boom! Broke your text file."
import re
try:
re.match(u'[A-%s]' % deadbeef, u'test')
except MemoryError:
print "Boom! Broke your regular expression."
import sqlite3
db = sqlite3.connect('deadbeef.db')
db.execute(u'CREATE TABLE deadbeef (id integer primary key, value text)')
db.execute(u'INSERT INTO deadbeef (value) VALUES (?)', u'\U0001f602')
db.execute(u'SELECT * FROM deadbeef').fetchall()
# This works fine. I'm just convincing you that SQLite has no problem with
# Unicode itself.
db.execute(u'INSERT INTO deadbeef (value) VALUES (?)', deadbeef)
try:
db.execute(u'SELECT * FROM deadbeef').fetchall()
except sqlite3.OperationalError:
print "Boom! Corrupted your database."
# As a bonus, if you run that SQLite query at the IPython prompt, it gets
# a second error trying to print out the error message.
@mdesantis
Copy link

I love it. So simple yet so effective.

@mcormier
Copy link

Feature request. Print statement messages should be in lolcatz form.

Boom! I haz broke ur regular expressions!

@peterbe
Copy link

peterbe commented Nov 21, 2013

What's a "narrow build"?

@theonewolf
Copy link

Python 2.7.3 getting this:

: File name too long
./deadbeef_character.py: line 25: syntax error near unexpected token `('
./deadbeef_character.py: line 25: `deadbeef = '+d,+6t,+vu8-'.decode('utf-7', 'replace')[-1]'

@jandk
Copy link

jandk commented Nov 21, 2013

On gentoo 2.7.5 it doesn't work (u'\U1eadbeef'), maybe they patched it already?

@ye
Copy link

ye commented Nov 21, 2013

Has anyone tried MySQL or PostgreSQL? This example only crashed SQLite though, which is super lightweight may not handle unicode errors robustly.

@leepa
Copy link

leepa commented Nov 21, 2013

OSX - works fine
Ubuntu 12.04.2 LTS - Boom boom boom
FreeBSD 9.2 - yeah... locks up until the process is killed.

@deedeethepinhead
Copy link

Linux Mint 16 64bit, starting the script with Sublime or Spyder => system crash or at least freezes (too unpatient to wait much longer)

@acdha
Copy link

acdha commented Nov 21, 2013

@peterbe: it's a Python compile flag which controls whether Unicode support includes only the Basic Multilingual Plane or the full range of Unicode characters (i.e. does it end at 0x10000 or 0x10FFFF). See http://www.python.org/dev/peps/pep-0261/

This used to only be of interest to those of us working with relatively obscure multilingual content but has become a lot more important for most people now that things outside the BMP like Emoji have become very common. It means that len() won't work as expected on those characters in most Python 2.x builds. Try running https://github.com/acdha/unix_tools/blob/master/bin/unicode-characters.py under both Python 2 and 3 if you're severely bored.

@gsakkis
Copy link

gsakkis commented Nov 21, 2013

Slightly simpler way to get a hold of deadbeef:'+d,+6t,+vu8-'.decode('utf-7', 'ignore')

@rspeer
Copy link
Author

rspeer commented Nov 23, 2013

It works fine on OSX only because OSX's default Python is a narrow build. (Kind of disappointing for an OS with otherwise good support for lots of characters, including emoji.) The character just ends up being '\ubeef'.

@benschweizer
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment