Skip to content

Instantly share code, notes, and snippets.

@ericcj
Forked from siannopollo/README.md
Last active December 20, 2023 15:41
Show Gist options
  • Save ericcj/e58ae3b11ef14fc070862c8567cd1b7f to your computer and use it in GitHub Desktop.
Save ericcj/e58ae3b11ef14fc070862c8567cd1b7f to your computer and use it in GitHub Desktop.
Pluck in batches

pluck_in_batches

Sometimes you need to iterate over a ton of items and you don't want the overhead of creating AR objects out of all of them. Hell, you only need a few things! Well, #pluck has your back.

But what if you want to iterate over many tonnes of items?

Pluck in batches to the rescue!

This fork works on any orderable primary key not just integers just like http://api.rubyonrails.org/classes/ActiveRecord/Batches.html#method-i-find_in_batches

Enjoy!

# This assumes you are in Rails 4 and you can pluck multiple columns
class ActiveRecord::Relation
# pluck_in_batches: yields an array of *columns that is at least size
# batch_size to a block.
#
# Special case: if there is only one column selected than each batch
# will yield an array of columns like [:column, :column, ...]
# rather than [[:column], [:column], ...]
# Arguments
# columns -> an arbitrary selection of columns found on the table.
# batch_size -> How many items to pluck at a time
# &block -> A block that processes an array of returned columns.
# Array is, at most, size batch_size
#
# Returns
# nothing is returned from the function
def pluck_in_batches(*columns, batch_size: 1000)
if columns.empty?
raise "There must be at least one column to pluck"
end
# the :id to start the query at
batch_start = nil
# It's cool. We're only taking in symbols
# no deep clone needed
select_columns = columns.dup
# Find index of :id in the array
remove_id_from_results = false
id_index = columns.index(primary_key.to_sym)
# :id is still needed to calculate offsets
# add it to the front of the array and remove it when yielding
if id_index.nil?
id_index = 0
select_columns.unshift(primary_key)
remove_id_from_results = true
end
loop do
relation = self.reorder(table[primary_key].asc).limit(batch_size)
relation = relation.where(table[primary_key].gt(batch_start)) if batch_start
items = relation.pluck(*select_columns)
break if items.empty?
# Use the last id to calculate where to offset queries
last_item = items.last
batch_start = last_item.is_a?(Array) ? last_item[id_index] : last_item
# Remove :id column if not in *columns
items.map! { |row| row[1..-1] } if remove_id_from_results
yield items
break if items.size < batch_size
end
end
end
@arocca1
Copy link

arocca1 commented Sep 15, 2017

When fetching the id index from the columns array, I think you actually want to do something like this:
id_index = columns.index(primary_key.to_sym) || columns.index(primary_key) || columns.index("#{table_name}.#{primary_key}").
I ran into an issue here where I got no results because I specified the column as "<table_name>.<primary_key>"

@mathijssterk
Copy link

@ericcj Thanks for the gist!

When fetching a single non-primary key column, a nested array is returned. To fix this you could replace the code

      # Remove :id column if not in *columns
      items.map! { |row| row[1..-1] } if remove_id_from_results

with

        # Remove :id column if not in *columns
        if remove_id_from_results
          items.map! { |row| row[1..-1] }
          items.map! { |row| row.first } if items.first.length == 1
        end

@ukolovda
Copy link

ukolovda commented Oct 22, 2019

It may be useful to disable cache for avoid big memory usage:

items = relation.uncached { relation.pluck(*select_columns) }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment