Skip to content

Instantly share code, notes, and snippets.

@tnj
Last active November 12, 2019 07:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tnj/38cfd344d539f2e2882dce10971bce73 to your computer and use it in GitHub Desktop.
Save tnj/38cfd344d539f2e2882dce10971bce73 to your computer and use it in GitHub Desktop.
Efficiently finds a record by creation time in a huge table even without having index on created_at
class ActiveRecord::Base
# @param [Time] created_at
# @return [ActiveRecord::Base]
def self.find_closest_created_at(created_at)
return if self.last.created_at < created_at
high = self.last.id
low = self.first.id
while low <= high
mid = (high - low) / 2 + low
closest = self.select(:id, :created_at).where('id >= ?', mid).first
return if closest.nil?
return self.find(closest.id) if high == low
if created_at > closest.created_at
low = mid + 1
else
high = mid
end
end
end
end
@tnj
Copy link
Author

tnj commented Jun 30, 2016

Usage

Install this as an initializer on your Rails project, or just paste whole code into rails console.

Example

You have a Log model that contains billions of rows and doesn't have an index on created_at.

To retrieve rows for the last 24 hours, do the following two steps.

  1. retrieve id of the row closest to the time of 24 hours ago
  2. build where query using the id to narrow down the range of data using a primary key.
Log.where('id >= ?', Log.find_closest_created_at(24.hours.ago).id).where(type: 'error')

It generates a query that uses the primary key to determine the range and runs very fast, even on a table with billions of rows.

mysql> explain SELECT `logs`.* FROM `logs`  WHERE (id >= 1165448944) AND `logs`.`type` = "error"
+----+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
|  1 | SIMPLE      | logs  | range | PRIMARY       | PRIMARY | 4       | NULL | 388992 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
1 row in set (0.00 sec)

Above code is an equivalent to this naïve one:

Log.where('created_at > ?', 24.hours.ago).where(type: 'error')

However, this will result in a full table scan that doesn't return like forever. Not a good idea to run on a table with billions of rows.

mysql> explain SELECT `logs`.* FROM `logs`  WHERE (created_at >= '2016-06-29 12:00:00') AND `logs`.`type` = "error";
+----+-------------+-------+------+---------------+------+---------+------+------------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows       | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------------+-------------+
|  1 | SIMPLE      | logs  | ALL  | NULL          | NULL | NULL    | NULL | 1000000000 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------------+-------------+
1 row in set (0.05 sec)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment