Skip to content

Instantly share code, notes, and snippets.

@draffensperger
Last active October 9, 2015 20:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save draffensperger/c52aebffdf9894414d2a to your computer and use it in GitHub Desktop.
Save draffensperger/c52aebffdf9894414d2a to your computer and use it in GitHub Desktop.
Why find_each(batch_size: 1) is helpful if objects reference lots of memory and operations are long.

In MPDX, the Google Contacts sync job takes a long time and the google accounts loop to sync each job could benefit from find_each(batch_size: 1). Basically it seems likes find_each pulls in the records in batches and then saves them in an array to enumerate through. Here's a comparison of the memory results with different batch_sizes that I did using a similar, but contrived MemHungry model. To setup, first do 6.times { MemHungry.create }.

Using the default batch_size of 1000 the memory at the end reflecs all 6 objects and the RAM they hold onto:

[1] pry(main)> MemHungry.all.find_each { |m| puts m.id; m.eat_memory; }
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"   ORDER BY "mem_hungries"."id" ASC LIMIT 1000
1
Memory before GC: 112.63671875
Memory before allocation: 159.90234375
Memory after allocation: 1759.90234375
2
Memory before GC: 1759.90234375
Memory before allocation: 1759.90234375
Memory after allocation: 3359.90234375
3
Memory before GC: 3359.90234375
Memory before allocation: 3359.90234375
Memory after allocation: 4959.90234375
4
Memory before GC: 4959.90234375
Memory before allocation: 4959.90234375
Memory after allocation: 6559.90234375
5
Memory before GC: 6559.90234375
Memory before allocation: 6559.90234375
Memory after allocation: 8159.90234375
6
Memory before GC: 8159.90234375
Memory before allocation: 8159.90234375
Memory after allocation: 9150.45703125
=> nil

But when we run with batch_size: 1 then the memory is lower because find_each won't hold onto all of the models at once (assuming there are less than 1000):

[1] pry(main)> MemHungry.all.find_each(batch_size: 1) { |m| puts m.id; m.eat_memory; }
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"   ORDER BY "mem_hungries"."id" ASC LIMIT 1
1
Memory before GC: 116.43359375
Memory before allocation: 159.27734375
Memory after allocation: 1759.28125
  MemHungry Load (0.6ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 1)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
2
Memory before GC: 1759.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 2)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
3
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 3)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
4
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 4)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
5
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 5)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
6
Memory before GC: 3359.28515625
Memory before allocation: 1759.28515625
Memory after allocation: 3359.28515625
  MemHungry Load (0.5ms)  SELECT  "mem_hungries".* FROM "mem_hungries"  WHERE ("mem_hungries"."id" > 6)  ORDER BY "mem_hungries"."id" ASC LIMIT 1
=> nil

The trade-off of doing a smaller batch size is the cost of the query to retrieve each record one-by-one. In the case when the records involved are related to a big Sidekiq background job operation that takes a long time and allocates a lot of memory, then doing it one-by-one makes the most sense in that case. If the operation on the models were faster (or the models didn't hold references to the allocated objects), then a larger batch size would make more sense.

class MemHungry < ActiveRecord::Base
def eat_memory
puts "Memory before GC: #{rss_mb}"
GC.start
puts "Memory before allocation: #{rss_mb}"
@reference = Array.new(200 * 2 ** 20, 1) # ~200 million 1's
puts "Memory after allocation: #{rss_mb}"
end
def rss_mb
NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment