Update: Added a Python version (pwned-passwords-sqlite-build.py
) that preserves counts and uses binary keys rather than text.
Last executed 2019-06-25 with the v4 dump:
- Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
- Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
- Unpack and strip off the counts:
7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
- Load into sqlite:
$ sqlite3 pwned-passwords-by-hash-v4.sqlite sqlite> pragma journal_mode=memory; sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID; sqlite> .mode csv sqlite> .import hashes.lst hashes
- Confirm that queries will use the primary key's index:
A sample query should return almost instantly:sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; 0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1; D657187D9C9C1AD04FDA5132338D405FDB112FA1
A few notes on construction:
- Journaling is disabled to speed up initial import.
- Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
- We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
I wanted to know how much of each hash would be needed to uniquely identify any compromised password's hash among the full set (if it was known to be compromised, so I wrote a little program to check what the longest shared prefix is between any two consecutive hashes:
I ran
p7zip -d --stdout ~/tmp/pwned-passwords-sha1-ordered-by-hash-v5.7z | python3 ~/adhoc/pwned.py
and it turns out the largest shared prefix was 14 hex characters out of 40 (7 bytes out of 20). If anyone's curious, here's how long it takes to reach each record length prefix:I think this means that if you truncated hashes to 10 bytes, there would be a 1 in 10 million chance of a random password colliding with a pwned hash:
log10(256 ^ (10 - 7)) = 7.22...
.If you have a billion user registrations, on average you might expect 100 truly random passwords to fail the test, and the user would have to generate a new one. That seems acceptable, and would result in a 2x reduction in DB size.
(Someone check my math?)