Skip to content

Instantly share code, notes, and snippets.

@mlin
Last active December 1, 2023 13:56
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mlin/ee20d7c5156baf9b12518961f36590c3 to your computer and use it in GitHub Desktop.
Save mlin/ee20d7c5156baf9b12518961f36590c3 to your computer and use it in GitHub Desktop.
static.wiki database compression

Context: static.wiki and Show HN post

We downloaded static.wiki's 40.3 GiB SQLite database of English Wikipedia and created a compressed version of it with sqlite_zstd_vfs, our read/write Zstandard compression layer for SQLite3. The compressed version is 10.4 GiB (26%), and the VFS supports HTTP random access in the spirit of the original (although we don't yet have a WebAssembly build; it's a library for CLI & desktop apps for now). You can try it out on Linux or macOS x86-64:

pip3 install genomicsqlite
genomicsqlite https://f000.backblazeb2.com/file/mlin-public/static.wiki/en.zstd.db \
    "select text from wiki_articles where title = 'SQLite'"

(replace the query with any other, or .schema to see the schema, or omit it to enter the sqlite3 interactive REPL)

sqlite_zstd_vfs is a building block of our Genomics Extension for SQLite, which bundles it along with other domain-specific features. This demonstrates a general-purpose use case of the compression layer originally intended for storing genomics big data.

If you want to download the whole 10 GiB compressed database then please kindly get it from zenodo instead of b2 so that I'm not charged. You can use the genomicsqlite CLI with the local file instead of the URL as above, or write a program using any of the GenomicSQLite language bindings.

We used the following commands to generate en.zstd.db from static.wiki's original en.db:

genomicsqlite en.db --compact --inner-page-KiB 64 --outer-page-KiB 2 --level 19 -o en.zstd.db
genomicsqlite --dbi en.zstd.db

The second command generates a .dbi helper file to be served alongside the main database file, which the extension can use (optionally) to streamline web access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment