This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* This Rust code scans through the Common Crawl, looking for text that's | |
* not English. I suspect I may learn much later that it's terrible, | |
* unidiomatic Rust, but it would take me months to learn what good Rust is. | |
* | |
* We depend on some external libraries: | |
* | |
* - html5ever: an HTML parser (we only use its low-level tokenizer) | |
* - encoding: handles text in all the encodings that WHATWG recognizes | |
* - string_cache: interns a bunch of frequently-used strings, like tag names -- necessary to use | |
* the html5ever tokenizer |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This file contains code that, when run on Python 2.7.5 or earlier, creates | |
a string that should not exist: u'\Udeadbeef'. That's a single "character" | |
that's illegal in Python because it's outside the valid Unicode range. | |
It then uses it to crash various things in the Python standard library and | |
corrupt a database. | |
On Python 3... well, this file is full of syntax errors on Python 3. But | |
if you were to change the print statements and byte literals and stuff: |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
OlderNewer