Skip to content

Instantly share code, notes, and snippets.

@bgvo
Last active August 29, 2015 14:08
Show Gist options
  • Save bgvo/0bc834251d3dcbcf588a to your computer and use it in GitHub Desktop.
Save bgvo/0bc834251d3dcbcf588a to your computer and use it in GitHub Desktop.
Getting incrementing ranges of data (over a million of docs)
olds = Item.order_by(id: 1).skip(new_batch*10000).limit(10000).not_in(id: set)
olds.each do |doc|
...
end
@kuadrosx
Copy link

using skip is a bad idea in any database, if you have a time field you should use it to paginate also $nin can be slow specially if "set" is big so maybe you should add a a field to know if the Item was processed

olds = Item.where(processed: false, :t.gt => last_time).order_by(t: 1)

olds.each do |doc|
    ...
    doc.set(:processed => true)
    last_time = doc.t
end

if it is not enough and you need more speed you should use moped directly and also use no_timeout option to avoid

olds = Item.collection.find(processed: false, :t  => {:$gt => last_time}).sort(t: -1)
olds.each do |doc|
    ...
    Item.collection.find(_id: doc['_id']).update(:$set => {:processed => true})
    last_time = doc['t']
end

Don't forget add indexes for t and processed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment