Created
February 17, 2016 19:57
-
-
Save swayson/84fc86da20db89b56eac to your computer and use it in GitHub Desktop.
Efficient way to do random sampling in SQLite.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT * FROM table | |
WHERE _ROWID_ >= (abs(random()) % (SELECT max(_ROWID_) FROM table)) | |
LIMIT 1 |
Wow, this is an old gist I forgot about 😅
This picks one random row in the table by taking the first row in a random range.
Here is my proposal for random sampling multiple rows in an efficient way:
https://gist.github.com/alecco/9976dab8fda8256ed403054ed0a65d7b
Looks great and thanks for the detailed description on the implementation. Think this is great learning material and will help folks 👍
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If the full sample is of k rows (say 100k) and the underlying table has n rows (e.g. 100m).
For each of the k rows, this approach will pick a random rowid in the range of valid rowids. But since there might be wholes (e.g. deleted rows) it has to go find the first valid rowid (the >= part). So it's doing k searches in the B-Tree of rowids of the table.
My proposal traverses sequentially the whole of the rowids and samples that to j size (say 1m out of 100m), then sorts and picks k out of that. You can fine-tune the sampling j to get closer to k. Or at least fit in L2/L3 cache.
As long as k is not trivially small, sequential traversal of the rowid B-Tree should be much faster. YMMV