Skip to content

Instantly share code, notes, and snippets.

@timmc
Last active October 4, 2024 14:00
Show Gist options
  • Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.
Save timmc/df5fbb6e069fb8c1c4e181a29930ace3 to your computer and use it in GitHub Desktop.
Building a sqlite DB for the Pwned Passwords data
#!/usr/bin/env python3
# Build a SQLite3 DB for looking up SHA-1 hashes of leaked passwords.
#
# This can be fed the txt file from one of Have I Been Pwned's hash
# lists available from https://haveibeenpwned.com/Passwords -- but any
# text file with line format ``hash-hex:count`` will work.
#
# When run on the v5 hash-ordered SHA-1 file, expect the build to take
# about 35 minutes and produce a 15.7 GiB file (~30.5 bytes per record).
#
# This example shows querying the resulting database for the
# vulnerable password "qwerty123", and finding that it was present
# with a count of 621679:
#
# >>> import sqlite3
# >>> conn = sqlite3.connect("pwned-passwords-sha1-with-counts-v5.sqlite")
# >>> hash_bytes = hashlib.sha1('qwerty123'.encode()).digest()
# >>> conn.execute("SELECT * FROM hashes WHERE hash = :sha1 LIMIT 1", {'sha1': hash_bytes}).fetchone()
# (b'\\\xec\x17[\x16^=^b\xc9\xe1<\xe8H\xefo\xea\xc8\x1b\xff', 621679)
import os
import sqlite3
import sys
def record_generator(in_path):
with open(in_path) as hashes:
for line in hashes:
(sha1_hex, count_str) = line.split(':', 2)
sha1_bytes = bytes.fromhex(sha1_hex)
count = int(count_str)
yield (sha1_bytes, count)
def build(in_path, out_path):
with sqlite3.connect(out_path) as conn:
conn.execute('pragma journal_mode=memory')
conn.execute('CREATE TABLE hashes("hash" BLOB PRIMARY KEY, "count" INT) WITHOUT ROWID')
conn.executemany(
'INSERT INTO hashes(hash, count) VALUES (?, ?)',
record_generator(in_path))
conn.commit()
def main(*args):
if len(args) != 2:
print("Usage: build.py <hash-and-count-input.txt> <output.sqlite>")
sys.exit(1)
build(*args)
if __name__ == '__main__':
main(*sys.argv[1:])

Update: Added a Python version (pwned-passwords-sqlite-build.py) that preserves counts and uses binary keys rather than text.

Last executed 2019-06-25 with the v4 dump:

  1. Make sure you have 60 GB free disk space and some extra to spare. Alternatively, take a walk on the wild side and delete source files as soon as you've used them.
  2. Download the SHA-1 (ordered by hash) torrent from https://haveibeenpwned.com/Passwords
  3. Unpack and strip off the counts:
    7z x -so pwned-passwords-sha1-ordered-by-hash-v4.7z pwned-passwords-sha1-ordered-by-hash-v4.txt | sed 's/:.*//' > hashes.lst
    
  4. Load into sqlite:
    $ sqlite3 pwned-passwords-by-hash-v4.sqlite
    sqlite> pragma journal_mode=memory;
    sqlite> CREATE TABLE hashes("hash" BLOB PRIMARY KEY) WITHOUT ROWID;
    sqlite> .mode csv
    sqlite> .import hashes.lst hashes
    
  5. Confirm that queries will use the primary key's index:
    sqlite> EXPLAIN QUERY PLAN SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    0|0|0|SEARCH TABLE hashes USING PRIMARY KEY (hash=?)
    
    A sample query should return almost instantly:
    sqlite> SELECT * FROM hashes WHERE hash = "D657187D9C9C1AD04FDA5132338D405FDB112FA1" LIMIT 1;
    D657187D9C9C1AD04FDA5132338D405FDB112FA1
    

A few notes on construction:

  • Journaling is disabled to speed up initial import.
  • Making the hash the primary key tells sqlite that the data is unique and ordered—and the primary key data is stored in a B-tree, not duplicated into an additional index.
  • We can also avoid storing rowids; no idea how much space this saves us, but no harm here: https://www.sqlite.org/withoutrowid.html
@Uzlopak
Copy link

Uzlopak commented Nov 17, 2021

I post this also for the case, that somebody maybe needs a script for nodejs to use the sqlite3 pwned db in his product.

Awesome:

import sqlite3 from "better-sqlite3";
import crypto from "crypto";
import { join } from "path";

const dbPath = process.env.PWNED_DB_PATH || join(__dirname, "../../assets/pwned_100000.sqlite3");

console.log(`Loading pwned database from ${dbPath}`);

const db = new sqlite3(dbPath, {
	fileMustExist: true,
	readonly: true,
});
db.pragma("foreign_keys = false");
db.pragma("journal_mode = off");
db.pragma("synchronous = off");
db.pragma("locking_mode = exclusive");
db.pragma("secure_delete = false");
db.pragma("automatic_index = false");
db.pragma("page_size = 512");

const dbStatement = db.prepare("SELECT COUNT(hash) AS c FROM pwned where hash = ?");

export function pwned(value: string) {
	if (!dbStatement) {
	}

	const hash = crypto.createHash("sha1").update(value).digest();
	return dbStatement.get(hash).c === 1;
}
import { expect } from "chai";
import { describe, it } from "mocha";
import pwned from "../../../src/utils/password/pwned";

describe.only(
	"pwned",
	() => {
		const cases: [string, boolean][] = [
			["123456", true],
			["Password", true],
			["secret", true],
			["P@ssword", true],
			["Was ist Aufklärung?", false],
		];
		cases.forEach(function ([value, isInTestDb]) {
			it(`should return ${isInTestDb} for '${value}'`, function () {
				expect(pwned(value), value).to.be.equal(isInTestDb);
			})
		});

		it('benchmark', () => {
			const start = Date.now();
			for (let i = 0, il = 50000; i <il; i++) {
				pwned("P@ssword");
				pwned("Was ist Aufklärung?");
			}
			const end = Date.now();

			console.log(`100000 pwned checks took ${end-start} ms`);
		});
	}
);
Loading pwned database from /home/uzlopak/project/pwned.sqlite3


  pwned
    ✔ should return true for '123456'
    ✔ should return true for 'Password'
    ✔ should return true for 'secret'
    ✔ should return true for 'P@ssword'
    ✔ should return false for 'Was ist Aufklärung?'
100000 pwned checks took 462 ms
    ✔ benchmark (462ms)

Also in this calculation we have to consider, that generating the sha1 takes some cycles. So probably it is even faster. About 200.000 pwned checks per second is more than sufficient. :). Also no performance difference between full and the lite database.

Will publish a npm package for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment