Update: Added a Python version (pwned-passwords-sqlite-build.py
) that preserves counts and uses binary keys rather than text.
Last executed 2019-06-25 with the v4 dump:
- Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
- Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
- Unpack and strip off the counts:
7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
- Load into sqlite:
$ sqlite3 pwned-passwords-by-hash-v4.sqlite sqlite> pragma journal_mode=memory; sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID; sqlite> .mode csv sqlite> .import hashes.lst hashes
- Confirm that queries will use the primary key's index:
A sample query should return almost instantly:sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; 0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; D657187D9C9C1AD04FDA5132338D405FDB112FA1
A few notes on construction:
- Journaling is disabled to speed up initial import.
- Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
- We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
For me it was a big bloom filter, so it was on disk. I think multiple hashing + disk access was more expensive than 2 disk accesses (my sqlite table also had
PRIMARY KEY
+WITHOUT ROWID
like in this gist).