Skip to content

Instantly share code, notes, and snippets.

@timmc
Last active October 4, 2024 14:00
Show Gist options
  • Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.
Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.
Building a sqlite DB for the Pwned Passwords data
#!/usr/bin/env python3
# Build a SQLite3 DB for looking up SHA-1 hashes of leaked passwords.
#
# This can be fed the txt file from one of Have I Been Pwned's hash
# lists available from https://haveibeenpwned.com/Passwords -- but any
# text file with line format ``hash-hex:count`` will work.
#
# When run on the v5 hash-ordered SHA-1 file, expect the build to take
# about 35 minutes and produce a 15.7 GiB file (~30.5 bytes per record).
#
# This example shows querying the resulting database for the
# vulnerable password "qwerty123", and finding that it was present
# with a count of 621679:
#
# >>> import sqlite3
# >>> conn = sqlite3.connect("pwned-passwords-sha1-with-counts-v5.sqlite")
# >>> hash_bytes = hashlib.sha1('qwerty123'.encode()).digest()
# >>> conn.execute("SELECT * FROM hashes WHERE hash = :sha1 LIMIT 1", {'sha1': hash_bytes}).fetchone()
# (b'\\\xec\x17[\x16^=^b\xc9\xe1<\xe8H\xefo\xea\xc8\x1b\xff', 621679)
import os
import sqlite3
import sys
def record_generator(in_path):
with open(in_path) as hashes:
for line in hashes:
(sha1_hex, count_str) = line.split(':', 2)
sha1_bytes = bytes.fromhex(sha1_hex)
count = int(count_str)
yield (sha1_bytes, count)
def build(in_path, out_path):
with sqlite3.connect(out_path) as conn:
conn.execute('pragma journal_mode=memory')
conn.execute('CREATE TABLE hashes("hash" BLOB PRIMARY KEY, "count" INT) WITHOUT ROWID')
conn.executemany(
'INSERT INTO hashes(hash, count) VALUES (?, ?)',
record_generator(in_path))
conn.commit()
def main(*args):
if len(args) != 2:
print("Usage: build.py <hash-and-count-input.txt> <output.sqlite>")
sys.exit(1)
build(*args)
if __name__ == '__main__':
main(*sys.argv[1:])

Update: Added a Python version (pwned-passwords-sqlite-build.py) that preserves counts and uses binary keys rather than text.

Last executed 2019-06-25 with the v4 dump:

  1. Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
  2. Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
  3. Unpack and strip off the counts:
    7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
    
  4. Load into sqlite:
    $ sqlite3 pwned-passwords-by-hash-v4.sqlite
    sqlite> pragma journal_mode=memory;
    sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID;
    sqlite> .mode csv
    sqlite> .import hashes.lst hashes
    
  5. Confirm that queries will use the primary key's index:
    sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
    
    A sample query should return almost instantly:
    sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    D657187D9C9C1AD04FDA5132338D405FDB112FA1
    

A few notes on construction:

  • Journaling is disabled to speed up initial import.
  • Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
  • We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
@timmc
Copy link
Author

timmc commented Nov 17, 2021

No worries! The original version of the script did indeed store hex strings because I was too lazy to write the Python script and just wanted to load directly from CSV. :-)

Good to know about using vacuum. I had idly wondered if there was any benefit to be had from a post-build cleanup, since the incremental build might not result in the most compact structure, but I wasn't sure what that would entail—and it sounds like this is exactly that. From the SQLite docs:

Running VACUUM ensures that each table and index is largely stored contiguously within the database file. In some cases, VACUUM may also reduce the number of partially filled pages in the database, reducing the size of the database file further.

The B-tree probably is managed with a lot of partial pages to improve insertion time, at the cost of space, and a post-build pass would then perform compaction. A 14% size reduction is well worth the extra build time! I wonder if it also improves access time at all.

@Uzlopak
Copy link

Uzlopak commented Nov 17, 2021

14% size recudction seems to be always applicable. I have now a full pwned db and a 1 million most used passwords db if there is not enough space to use the full db.

The 1 million is about 15 MB and will be part of the docker image (better than 15 gb suppllied with the docker image lol). So in production you can mount the full db to the container.

I made a benchmark. With the 1 million db I can make 100.000 calls in about 550 ms (node16, better-sqlite, exclusive mode as it will only have one connection).

I will test the performance later with the full database.

100.000 calls a second is actually pretty well.

@Uzlopak
Copy link

Uzlopak commented Nov 17, 2021

I post this also for the case, that somebody maybe needs a script for nodejs to use the sqlite3 pwned db in his product.

Awesome:

import sqlite3 from "better-sqlite3";
import crypto from "crypto";
import { join } from "path";

const dbPath = process.env.PWNED_DB_PATH || join(__dirname, "../../assets/pwned_100000.sqlite3");

console.log(`Loading pwned database from ${dbPath}`);

const db = new sqlite3(dbPath, {
	fileMustExist: true,
	readonly: true,
});
db.pragma("foreign_keys = false");
db.pragma("journal_mode = off");
db.pragma("synchronous = off");
db.pragma("locking_mode = exclusive");
db.pragma("secure_delete = false");
db.pragma("automatic_index = false");
db.pragma("page_size = 512");

const dbStatement = db.prepare("SELECT COUNT(hash) AS c FROM pwned where hash = ?");

export function pwned(value: string) {
	if (!dbStatement) {
	}

	const hash = crypto.createHash("sha1").update(value).digest();
	return dbStatement.get(hash).c === 1;
}
import { expect } from "chai";
import { describe, it } from "mocha";
import pwned from "../../../src/utils/password/pwned";

describe.only(
	"pwned",
	() => {
		const cases: [string, boolean][] = [
			["123456", true],
			["Password", true],
			["secret", true],
			["P@ssword", true],
			["Was ist Aufklärung?", false],
		];
		cases.forEach(function ([value, isInTestDb]) {
			it(`should return ${isInTestDb} for '${value}'`, function () {
				expect(pwned(value), value).to.be.equal(isInTestDb);
			})
		});

		it('benchmark', () => {
			const start = Date.now();
			for (let i = 0, il = 50000; i <il; i++) {
				pwned("P@ssword");
				pwned("Was ist Aufklärung?");
			}
			const end = Date.now();

			console.log(`100000 pwned checks took ${end-start} ms`);
		});
	}
);
Loading pwned database from /home/uzlopak/project/pwned.sqlite3


  pwned
    ✔ should return true for '123456'
    ✔ should return true for 'Password'
    ✔ should return true for 'secret'
    ✔ should return true for 'P@ssword'
    ✔ should return false for 'Was ist Aufklärung?'
100000 pwned checks took 462 ms
    ✔ benchmark (462ms)

Also in this calculation we have to consider, that generating the sha1 takes some cycles. So probably it is even faster. About 200.000 pwned checks per second is more than sufficient. :). Also no performance difference between full and the lite database.

Will publish a npm package for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment