Update: Added a Python version (pwned-passwords-sqlite-build.py
) that preserves counts and uses binary keys rather than text.
Last executed 2019-06-25 with the v4 dump:
- Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
- Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
- Unpack and strip off the counts:
7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
- Load into sqlite:
$ sqlite3 pwned-passwords-by-hash-v4.sqlite sqlite> pragma journal_mode=memory; sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID; sqlite> .mode csv sqlite> .import hashes.lst hashes
- Confirm that queries will use the primary key's index:
A sample query should return almost instantly:sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; 0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; D657187D9C9C1AD04FDA5132338D405FDB112FA1
A few notes on construction:
- Journaling is disabled to speed up initial import.
- Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
- We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
14% size recudction seems to be always applicable. I have now a full pwned db and a 1 million most used passwords db if there is not enough space to use the full db.
The 1 million is about 15 MB and will be part of the docker image (better than 15 gb suppllied with the docker image lol). So in production you can mount the full db to the container.
I made a benchmark. With the 1 million db I can make 100.000 calls in about 550 ms (node16, better-sqlite, exclusive mode as it will only have one connection).
I will test the performance later with the full database.
100.000 calls a second is actually pretty well.