Update: Added a Python version (pwned-passwords-sqlite-build.py
) that preserves counts and uses binary keys rather than text.
Last executed 2019-06-25 with the v4 dump:
- Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
- Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
- Unpack and strip off the counts:
7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
- Load into sqlite:
$ sqlite3 pwned-passwords-by-hash-v4.sqlite sqlite> pragma journal_mode=memory; sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID; sqlite> .mode csv sqlite> .import hashes.lst hashes
- Confirm that queries will use the primary key's index:
A sample query should return almost instantly:sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; 0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; D657187D9C9C1AD04FDA5132338D405FDB112FA1
A few notes on construction:
- Journaling is disabled to speed up initial import.
- Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
- We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
No worries! The original version of the script did indeed store hex strings because I was too lazy to write the Python script and just wanted to load directly from CSV. :-)
Good to know about using vacuum. I had idly wondered if there was any benefit to be had from a post-build cleanup, since the incremental build might not result in the most compact structure, but I wasn't sure what that would entail—and it sounds like this is exactly that. From the SQLite docs:
The B-tree probably is managed with a lot of partial pages to improve insertion time, at the cost of space, and a post-build pass would then perform compaction. A 14% size reduction is well worth the extra build time! I wonder if it also improves access time at all.