swayson/sqlite_random_sample.sql

Created February 17, 2016 19:57

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/swayson/84fc86da20db89b56eac.js"></script>
Save swayson/84fc86da20db89b56eac to your computer and use it in GitHub Desktop.

Download ZIP

Efficient way to do random sampling in SQLite.

Raw

sqlite_random_sample.sql

	SELECT * FROM table
	WHERE _ROWID_ >= (abs(random()) % (SELECT max(_ROWID_) FROM table))
	LIMIT 1

alecco commented May 9, 2021

If the full sample is of k rows (say 100k) and the underlying table has n rows (e.g. 100m).

For each of the k rows, this approach will pick a random rowid in the range of valid rowids. But since there might be wholes (e.g. deleted rows) it has to go find the first valid rowid (the >= part). So it's doing k searches in the B-Tree of rowids of the table.

My proposal traverses sequentially the whole of the rowids and samples that to j size (say 1m out of 100m), then sorts and picks k out of that. You can fine-tune the sampling j to get closer to k. Or at least fit in L2/L3 cache.

As long as k is not trivially small, sequential traversal of the rowid B-Tree should be much faster. YMMV

Author

swayson commented May 23, 2021 •

edited

Wow, this is an old gist I forgot about 😅

This picks one random row in the table by taking the first row in a random range.

Here is my proposal for random sampling multiple rows in an efficient way:
https://gist.github.com/alecco/9976dab8fda8256ed403054ed0a65d7b

Looks great and thanks for the detailed description on the implementation. Think this is great learning material and will help folks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment