Update: Added a Python version (pwned-passwords-sqlite-build.py
) that preserves counts and uses binary keys rather than text.
Last executed 2019-06-25 with the v4 dump:
- Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
- Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
- Unpack and strip off the counts:
7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
- Load into sqlite:
$ sqlite3 pwned-passwords-by-hash-v4.sqlite sqlite> pragma journal_mode=memory; sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID; sqlite> .mode csv sqlite> .import hashes.lst hashes
- Confirm that queries will use the primary key's index:
A sample query should return almost instantly:sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; 0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; D657187D9C9C1AD04FDA5132338D405FDB112FA1
A few notes on construction:
- Journaling is disabled to speed up initial import.
- Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
- We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
2.24GB, not bad. Some folks at Threat Stack used a Bloom filter with a 1 in 800 chance of false positives, which took up 1 GB of memory. 1 in 10 million is more to my taste, but 2 GB is in this sort of borderline space where I'm not really sure whether I'd rather have it in memory or on disk.
Curious that you found the Bloom filter to be slower than SQLite. Threat Stack found the same thing. I find that unexpected since SQLite probably has to go out to disk multiple times to search its index, and the Bloom filter just has to run a bunch of hashes and make random memory accesses.
There might be some differences in what's being measured. If I time the performance of actual user logins, many of those passwords are unfortunately going to be the same as each other, so some parts of the index (or the filter) will already be in disk (or memory) cache. But if you check the performance for random strings, you'll have a lot of cache misses and the average time will be worse.