Skip to content

Instantly share code, notes, and snippets.

@metaskills
Created January 9, 2012 12:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save metaskills/1582832 to your computer and use it in GitHub Desktop.
Save metaskills/1582832 to your computer and use it in GitHub Desktop.
Re: ActiveRecord Performance: Adapter Yielded Non-Cached Rows (Octover 2010)

Aaron,

This is Ken Collins (metaskills). I finally finished up the TinyTds [1] gem integration into the SQL Server adapter as a low level connection mode and have some free time. I modeled that gem after Brian's Mysql2 gem interface, meaning that among other things – it lazily loads each row right from the wire and has the ability to disable row caching.

I want to explore a possible ActiveRecord performance optimization that takes advantage of this and I'm looking for advice and a bit of your mentoring. Here are my initial "raw" thoughts and questions below. I have sent two diffs [2] for reference too.

It would be great if this change was done in such a way that it just worked when the connection allowed. The idea goal would be to allow something like this.

User.all.each do |user|
  # Each user come across the wire lazily.
  # The adapter does not need to load up "n" number of hashes and save them.
end

There are places in ActiveRecord that if this idea of low level implementation failed that we could fall back too for an interface/behavior change, like batches.rb. Being an adapter author, I am usually pretty good at finding the right places, but got stumped last night on the feasibility of this idea given the ActiveRecord::Relation#to_a method. It seems to me the way it is done would require a few fundamental changes. Thoughts on this and the general idea?

  • Ken

[1] http://github.com/rails-sqlserver/tiny_tds [2] https://gist.github.com/bc33adb7d40dac1d0fbc

Hey Ken,

I've been thinking about this problem for a while. Brian mentioned wanting to do something like this. I've merged prepared statement caching, and part of that work is that I want to start wrapping results in a consistent object:

http://github.com/rails/rails/blob/master/activerecord/lib/active_record/result.rb#L1-30

Right now, it's a dummy object that just contains the rows and column info. I was thinking we can have database specific subclasses. The database specific subclasses could use whatever data structure they want. For example, SQLite3 can keep a cursor inside that object and only read rows when they're requested.

I think it might be a good optimization. However, it seems that this would only be a win in the case where the user fetches many rows but only uses a few of them. I'm wondering how often that happens? What do you think?

-- Aaron Patterson http://tenderlovemaking.com/

Hey Aaron,

I've been thinking about this problem for a while. Brian mentioned wanting to do something like this. I've merged prepared statement caching, and part of that work is that I want to start wrapping results in a consistent object:

http://github.com/rails/rails/blob/master/activerecord/lib/active_record/result.rb#L1-30

Right now, it's a dummy object that just contains the rows and column info. I was thinking we can have database specific subclasses. The database specific subclasses could use whatever data structure they want. For example, SQLite3 can keep a cursor inside that object and only read rows when they're requested.

Interesting, that sub class reminds me of what I would have to do for different connection modes in the adapter, mainly :odbc and :adonet. Basically a standard way of jamming old school C interfaces into something AR can cope with. In my case for those modes, it meant a lot of wasted objects iterating and building new collections. One thing I had to do in my adapter was built an abstraction on top of #select and #execute that passed options down to either the raw connection to use with a result/handle object for building the expected return value. Options like :fetch being either :all, :one, :rows, etc. This [1] might give you a clue as to the Rube Goldberg I have now :/ BTW, :dblib is the connection mode for TinyTds.

One thing I liked with TinyTds is that I have a query option (passed to each, :first => true) that even if you selected * from users, it would only get the first result and cancel all remaining results. Premature optimization, maybe, but it hooks nicely with the AR interface for #select_one, #select_rows, etc. Either way, direct feedback on that class. It looks a little slim. Where are options for how the results are returned like as array? I know #select_rows would want to pass something down so that a low level connection can do that leg work of passing down the correct query option so the C code can build the object. I am also wondering if it is pragmatic to initialize an object with everything partitioned too. In Mysql2::Result and TinyTds::Result, these things are a bit lazy. They only spring into existence when you start iterating over them with the query parameters. It could be I was up way to late last night (i was BTW :) but at first glance it looks like this wold be another layer in the way of the #raw_connection doing the legwork. I'd have to see more context and try to build a subclass and see.

I think it might be a good optimization. However, it seems that this would only be a win in the case where the user fetches many rows but only uses a few of them. I'm wondering how often that happens? What do you think?

My gut feeling is this. AR wants a hash or an array of objects to build other objects. Let's say that is a hash for 100 rows, those objects are intermediary and will almost always be GC'ed right away. Why should the low level connection be required to build that large object first just so it can be iterated over and forgotten about. I look at it like the talk on view flushing. I have no numbers on it might be a low gain in small practical usage. But perhaps it has a bigger gain for bulk copies and lean ActiveModels that want to do large migrations. There was a reason #find_each was done in batches.rb. I still think the structure of AR should accommodate this and my gut feeling is it would be a good performance gain.

BTW, I'm having a hard time keeping up with your changes :) The Arel2 stuff is going to be a ground zero rewrite for sql server. It took me 6 weeks last time to pass all the AR tests. Keep it up, but man, your killing me!!!! :) Let's keep talking, I really think this is a good idea and I would love to see more. Right now I gotta figure out why 1.9.2 on AR 2.3.8 is slower that 1.8.6. What a night.

  • Ken

[1] http://github.com/rails-sqlserver/activerecord-sqlserver-adapter/blob/master/lib/active_record/connection_adapters/sqlserver/database_statements.rb#L282

Well, that's encapsulated in the AR::Result object. Databases usually return an array for the columns and an array of arrays for each row. I don't think it's a good idea for database drivers to return results as hashes in the first place. The reason is because column order is important.

We could use hashes in 1.9 because hashes in 1.9 are ordered, but that is not true for 1.8. That's the motivation for the AR::Result object. It simply holds the column and row information. If someone uses the object as an array, it will return ordered hashes for each row. However, if you just want the row information without columns, just call #rows.

I know #select_rows would want to pass something down so that a low level connection can do that leg work of passing down the correct query option so the C code can build the object. I am also wondering if it is pragmatic to initialize an object with everything partitioned too. In Mysql2::Result and TinyTds::Result, these things are a bit lazy. They only spring into existence when you start iterating over them with the query parameters. It could be I was up way to late last night (i was BTW :) but at first glance it looks like this wold be another layer in the way of the raw_connection doing the legwork. I'd have to see more context and try to build a subclass and see.

The other databases adapters are "lazy" too. They only create objects as you iterate over the results in the cursor. The old mysql adapter and the sqlite3 adapter return lists. PG returns a hash though which is annoying because sometimes you need a list.

The bottom line though, is that you can compose a hash out of the column and row info. Sometimes you need just the row info without the column info, and in that case (if you're using hashes) you must do work to extract that data. With the AR::Result object, we lazily create the list and cache it.

I think it might be a good optimization. However, it seems that this would only be a win in the case where the user fetches many rows but only uses a few of them. I'm wondering how often that happens? What do you think?

My gut feeling is this. AR wants a hash or an array of objects to build other objects. Let's say that is a hash for 100 rows, those objects are intermediary and will almost always be GC'ed right away.

The intermediate hashes are not thrown away. AR::Base holds on to them and you can see that here:

http://github.com/rails/rails/blob/master/activerecord/lib/active_record/base.rb#L1423-1423

At least I think this is correct. ;-)

Why should the low level connection be required to build that large object first just so it can be iterated over and forgotten about. I look at it like the talk on view flushing. I have no numbers on it might be a low gain in small practical usage. But perhaps it has a bigger gain for bulk copies and lean ActiveModels that want to do large migrations. There was a reason #find_each was done in batches.rb. I still think the structure of AR should accommodate this and my gut feeling is it would be a good performance gain.

I think you could accomplish this with the AR::Result object. Just as pseudo code:

 class MyAdapter
   class MyResult < AR::Result
     def initialize(cursor)
       @cursor = cursor
     end
     def rows; @cursor.walk_over_rows; end
     def each
       @cursor.each_row { |row| build_hash(row) }
     end
   end

   def exec(sql, name, binds)
     MyAResult.new@real_connection.execute(sql)
   end
 end

Note that you defer the creation of extra objects until someone asks for them.

BTW, I'm having a hard time keeping up with your changes :) The Arel2 stuff is going to be a ground zero rewrite for sql server. It took me 6 weeks last time to pass all the AR tests. Keep it up, but man, your killing me!!!! :) Let's keep talking, I really think this is a good idea and I would love to see more. Right now I gotta figure out why 1.9.2 on AR 2.3.8 is slower that 1.8.6. What a night.

Good luck! That doesn't sound like fun. :'(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment