Skip to content

Instantly share code, notes, and snippets.

View rspeer's full-sized avatar

Elia Robyn Lake (Robyn Speer) rspeer

View GitHub Profile
/* This Rust code scans through the Common Crawl, looking for text that's
* not English. I suspect I may learn much later that it's terrible,
* unidiomatic Rust, but it would take me months to learn what good Rust is.
*
* We depend on some external libraries:
*
* - html5ever: an HTML parser (we only use its low-level tokenizer)
* - encoding: handles text in all the encodings that WHATWG recognizes
* - string_cache: interns a bunch of frequently-used strings, like tag names -- necessary to use
* the html5ever tokenizer
"""
This file contains code that, when run on Python 2.7.5 or earlier, creates
a string that should not exist: u'\Udeadbeef'. That's a single "character"
that's illegal in Python because it's outside the valid Unicode range.
It then uses it to crash various things in the Python standard library and
corrupt a database.
On Python 3... well, this file is full of syntax errors on Python 3. But
if you were to change the print statements and byte literals and stuff:
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.