Skip to content

Instantly share code, notes, and snippets.

@timmc
Last active October 4, 2024 14:00
Show Gist options
  • Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.
Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.
Building a sqlite DB for the Pwned Passwords data
#!/usr/bin/env python3
# Build a SQLite3 DB for looking up SHA-1 hashes of leaked passwords.
#
# This can be fed the txt file from one of Have I Been Pwned's hash
# lists available from https://haveibeenpwned.com/Passwords -- but any
# text file with line format ``hash-hex:count`` will work.
#
# When run on the v5 hash-ordered SHA-1 file, expect the build to take
# about 35 minutes and produce a 15.7 GiB file (~30.5 bytes per record).
#
# This example shows querying the resulting database for the
# vulnerable password "qwerty123", and finding that it was present
# with a count of 621679:
#
# >>> import sqlite3
# >>> conn = sqlite3.connect("pwned-passwords-sha1-with-counts-v5.sqlite")
# >>> hash_bytes = hashlib.sha1('qwerty123'.encode()).digest()
# >>> conn.execute("SELECT * FROM hashes WHERE hash = :sha1 LIMIT 1", {'sha1': hash_bytes}).fetchone()
# (b'\\\xec\x17[\x16^=^b\xc9\xe1<\xe8H\xefo\xea\xc8\x1b\xff', 621679)
import os
import sqlite3
import sys
def record_generator(in_path):
with open(in_path) as hashes:
for line in hashes:
(sha1_hex, count_str) = line.split(':', 2)
sha1_bytes = bytes.fromhex(sha1_hex)
count = int(count_str)
yield (sha1_bytes, count)
def build(in_path, out_path):
with sqlite3.connect(out_path) as conn:
conn.execute('pragma journal_mode=memory')
conn.execute('CREATE TABLE hashes("hash" BLOB PRIMARY KEY, "count" INT) WITHOUT ROWID')
conn.executemany(
'INSERT INTO hashes(hash, count) VALUES (?, ?)',
record_generator(in_path))
conn.commit()
def main(*args):
if len(args) != 2:
print("Usage: build.py <hash-and-count-input.txt> <output.sqlite>")
sys.exit(1)
build(*args)
if __name__ == '__main__':
main(*sys.argv[1:])

Update: Added a Python version (pwned-passwords-sqlite-build.py) that preserves counts and uses binary keys rather than text.

Last executed 2019-06-25 with the v4 dump:

  1. Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
  2. Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
  3. Unpack and strip off the counts:
    7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
    
  4. Load into sqlite:
    $ sqlite3 pwned-passwords-by-hash-v4.sqlite
    sqlite> pragma journal_mode=memory;
    sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID;
    sqlite> .mode csv
    sqlite> .import hashes.lst hashes
    
  5. Confirm that queries will use the primary key's index:
    sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
    
    A sample query should return almost instantly:
    sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    D657187D9C9C1AD04FDA5132338D405FDB112FA1
    

A few notes on construction:

  • Journaling is disabled to speed up initial import.
  • Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
  • We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
@assafmo
Copy link

assafmo commented Mar 2, 2020

Is comparing BLOB faster than TEXT?

@assafmo
Copy link

assafmo commented Mar 2, 2020

Also could watch the .import progress with pv -d $(pidof sqlite3) on another terminal.

@timmc
Copy link
Author

timmc commented Mar 3, 2020

Dunno! I'd love to see someone do the experiment, although the query time is already so low that I'm not sure it's worth it. But the best improvement would be to use binary instead of hex to get the size down.

You could probably also keep the frequency counts in there without too much trouble, if that was desirable.

@timmc-edx
Copy link

I also ran this without the count field, and it only saves about 5% of the space—the counts are each stored in a little less than 2 bytes, on average.

@timmc
Copy link
Author

timmc commented Aug 5, 2020

I wanted to know how much of each hash would be needed to uniquely identify any compromised password's hash among the full set (if it was known to be compromised, so I wrote a little program to check what the longest shared prefix is between any two consecutive hashes:

import itertools
import sys

def shared_prefix_len(s1, s2):
    """Compute length of shared prefix between two strings"""
    cnt = 0
    for (c1, c2) in zip(s1, s2):
        if c1 == c2:
            cnt += 1
        else:
            break
    return cnt

def find_longest_prefix_pair():
    """
    Find longest prefix between consecutive lines, with progress reporting.
    """
    num_seen = 0
    longest_yet = 0
    last_line = None
    for line in sys.stdin:
        if last_line is not None:
            prefix_length = shared_prefix_len(line, last_line)
            if prefix_length > longest_yet:
                print("New record: %s (%s chars) after %s lines" %
                      (line[:prefix_length], prefix_length, num_seen))
                longest_yet = prefix_length
        last_line = line
        num_seen += 1
        if num_seen % 1000000 == 0:
            print("Seen %s lines" % num_seen)


if __name__ == '__main__':
    find_longest_prefix_pair()

I ran p7zip -d --stdout ~/tmp/pwned-passwords-sha1-ordered-by-hash-v5.7z | python3 ~/adhoc/pwned.py and it turns out the largest shared prefix was 14 hex characters out of 40 (7 bytes out of 20). If anyone's curious, here's how long it takes to reach each record length prefix:

prefix length lines searched before finding
8 1
9 58
11 1358
12 1763751
13 19515399
14 129347251

I think this means that if you truncated hashes to 10 bytes, there would be a 1 in 10 million chance of a random password colliding with a pwned hash: log10(256 ^ (10 - 7)) = 7.22....

If you have a billion user registrations, on average you might expect 100 truly random passwords to fail the test, and the user would have to generate a new one. That seems acceptable, and would result in a 2x reduction in DB size.

(Someone check my math?)

@assafmo
Copy link

assafmo commented Aug 6, 2020

Nice! I think it depends on the use case. I think if you're willing to get false positives then go with bloom filters. It'll be much smaller in size (but slower than sqlite in my experience): https://hur.st/bloomfilter/?n=572611621&p=1.0E-7&m=&k=

@timmc
Copy link
Author

timmc commented Aug 6, 2020

2.24GB, not bad. Some folks at Threat Stack used a Bloom filter with a 1 in 800 chance of false positives, which took up 1 GB of memory. 1 in 10 million is more to my taste, but 2 GB is in this sort of borderline space where I'm not really sure whether I'd rather have it in memory or on disk.

Curious that you found the Bloom filter to be slower than SQLite. Threat Stack found the same thing. I find that unexpected since SQLite probably has to go out to disk multiple times to search its index, and the Bloom filter just has to run a bunch of hashes and make random memory accesses.

There might be some differences in what's being measured. If I time the performance of actual user logins, many of those passwords are unfortunately going to be the same as each other, so some parts of the index (or the filter) will already be in disk (or memory) cache. But if you check the performance for random strings, you'll have a lot of cache misses and the average time will be worse.

@assafmo
Copy link

assafmo commented Aug 6, 2020

For me it was a big bloom filter, so it was on disk. I think multiple hashing + disk access was more expensive than 2 disk accesses (my sqlite table also had PRIMARY KEY + WITHOUT ROWID like in this gist).

@timmc
Copy link
Author

timmc commented Nov 17, 2021

Uzlopak, can you expand on what that does better or differently than the previous scripts?

@Uzlopak
Copy link

Uzlopak commented Nov 17, 2021

TBH I am now ashamed. I extended that other code to store data as binary data and thought here you would also store strings and I would show that it can be stored smalled. I was actually very proud of my change, as it changes 66 GB to 17 GB.

So yeah... nevermind my post. Should have used your snippet from the beginning.

Btw. I executed a vacuum after the import in sqlite db browser and I reduced it further from 17,6 GB to 15,4 GB. Maybe your script should also do a vacuum call after all the hashes are imported?

@timmc
Copy link
Author

timmc commented Nov 17, 2021

No worries! The original version of the script did indeed store hex strings because I was too lazy to write the Python script and just wanted to load directly from CSV. :-)

Good to know about using vacuum. I had idly wondered if there was any benefit to be had from a post-build cleanup, since the incremental build might not result in the most compact structure, but I wasn't sure what that would entail—and it sounds like this is exactly that. From the SQLite docs:

Running VACUUM ensures that each table and index is largely stored contiguously within the database file. In some cases, VACUUM may also reduce the number of partially filled pages in the database, reducing the size of the database file further.

The B-tree probably is managed with a lot of partial pages to improve insertion time, at the cost of space, and a post-build pass would then perform compaction. A 14% size reduction is well worth the extra build time! I wonder if it also improves access time at all.

@Uzlopak
Copy link

Uzlopak commented Nov 17, 2021

14% size recudction seems to be always applicable. I have now a full pwned db and a 1 million most used passwords db if there is not enough space to use the full db.

The 1 million is about 15 MB and will be part of the docker image (better than 15 gb suppllied with the docker image lol). So in production you can mount the full db to the container.

I made a benchmark. With the 1 million db I can make 100.000 calls in about 550 ms (node16, better-sqlite, exclusive mode as it will only have one connection).

I will test the performance later with the full database.

100.000 calls a second is actually pretty well.

@Uzlopak
Copy link

Uzlopak commented Nov 17, 2021

I post this also for the case, that somebody maybe needs a script for nodejs to use the sqlite3 pwned db in his product.

Awesome:

import sqlite3 from "better-sqlite3";
import crypto from "crypto";
import { join } from "path";

const dbPath = process.env.PWNED_DB_PATH || join(__dirname, "../../assets/pwned_100000.sqlite3");

console.log(`Loading pwned database from ${dbPath}`);

const db = new sqlite3(dbPath, {
	fileMustExist: true,
	readonly: true,
});
db.pragma("foreign_keys = false");
db.pragma("journal_mode = off");
db.pragma("synchronous = off");
db.pragma("locking_mode = exclusive");
db.pragma("secure_delete = false");
db.pragma("automatic_index = false");
db.pragma("page_size = 512");

const dbStatement = db.prepare("SELECT COUNT(hash) AS c FROM pwned where hash = ?");

export function pwned(value: string) {
	if (!dbStatement) {
	}

	const hash = crypto.createHash("sha1").update(value).digest();
	return dbStatement.get(hash).c === 1;
}
import { expect } from "chai";
import { describe, it } from "mocha";
import pwned from "../../../src/utils/password/pwned";

describe.only(
	"pwned",
	() => {
		const cases: [string, boolean][] = [
			["123456", true],
			["Password", true],
			["secret", true],
			["P@ssword", true],
			["Was ist Aufklärung?", false],
		];
		cases.forEach(function ([value, isInTestDb]) {
			it(`should return ${isInTestDb} for '${value}'`, function () {
				expect(pwned(value), value).to.be.equal(isInTestDb);
			})
		});

		it('benchmark', () => {
			const start = Date.now();
			for (let i = 0, il = 50000; i <il; i++) {
				pwned("P@ssword");
				pwned("Was ist Aufklärung?");
			}
			const end = Date.now();

			console.log(`100000 pwned checks took ${end-start} ms`);
		});
	}
);
Loading pwned database from /home/uzlopak/project/pwned.sqlite3


  pwned
    ✔ should return true for '123456'
    ✔ should return true for 'Password'
    ✔ should return true for 'secret'
    ✔ should return true for 'P@ssword'
    ✔ should return false for 'Was ist Aufklärung?'
100000 pwned checks took 462 ms
    ✔ benchmark (462ms)

Also in this calculation we have to consider, that generating the sha1 takes some cycles. So probably it is even faster. About 200.000 pwned checks per second is more than sufficient. :). Also no performance difference between full and the lite database.

Will publish a npm package for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment