Skip to content

Instantly share code, notes, and snippets.

@swayson
Created February 17, 2016 19:57
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save swayson/84fc86da20db89b56eac to your computer and use it in GitHub Desktop.
Save swayson/84fc86da20db89b56eac to your computer and use it in GitHub Desktop.
Efficient way to do random sampling in SQLite.
SELECT * FROM table
WHERE _ROWID_ >= (abs(random()) % (SELECT max(_ROWID_) FROM table))
LIMIT 1
@Hibou57
Copy link

Hibou57 commented Dec 22, 2017

Why is ROWID enclosed in “_” ?

@Hibou57
Copy link

Hibou57 commented Dec 22, 2017

Since there may be holes in the rowid sequence, there is no warranty it will always return something, even non-empty tables.

@alecco
Copy link

alecco commented May 8, 2021

This picks one random row in the table by taking the first row in a random range.

Here is my proposal for random sampling multiple rows in an efficient way:
https://gist.github.com/alecco/9976dab8fda8256ed403054ed0a65d7b

@alecco
Copy link

alecco commented May 9, 2021

If the full sample is of k rows (say 100k) and the underlying table has n rows (e.g. 100m).

For each of the k rows, this approach will pick a random rowid in the range of valid rowids. But since there might be wholes (e.g. deleted rows) it has to go find the first valid rowid (the >= part). So it's doing k searches in the B-Tree of rowids of the table.

My proposal traverses sequentially the whole of the rowids and samples that to j size (say 1m out of 100m), then sorts and picks k out of that. You can fine-tune the sampling j to get closer to k. Or at least fit in L2/L3 cache.

As long as k is not trivially small, sequential traversal of the rowid B-Tree should be much faster. YMMV

@swayson
Copy link
Author

swayson commented May 23, 2021

Wow, this is an old gist I forgot about 😅

This picks one random row in the table by taking the first row in a random range.

Here is my proposal for random sampling multiple rows in an efficient way:
https://gist.github.com/alecco/9976dab8fda8256ed403054ed0a65d7b

Looks great and thanks for the detailed description on the implementation. Think this is great learning material and will help folks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment