metaskills/email_1.md

## email_1.md

      
    Raw
  

              email_1.md
            
          
    Aaron,
This is Ken Collins (metaskills). I finally finished up the TinyTds [1] gem integration into the SQL Server adapter as a low level connection mode and have some free time. I modeled that gem after Brian's Mysql2 gem interface, meaning that among other things – it lazily loads each row right from the wire and has the ability to disable row caching.
I want to explore a possible ActiveRecord performance optimization that takes advantage of this and I'm looking for advice and a bit of your mentoring. Here are my initial "raw" thoughts and questions below. I have sent two diffs [2] for reference too.
It would be great if this change was done in such a way that it just worked when the connection allowed. The idea goal would be to allow something like this.
User.all.each do |user|
  # Each user come across the wire lazily.
  # The adapter does not need to load up "n" number of hashes and save them.
end
There are places in ActiveRecord that if this idea of low level implementation failed that we could fall back too for an interface/behavior change, like batches.rb. Being an adapter author, I am usually pretty good at finding the right places, but got stumped last night on the feasibility of this idea given the ActiveRecord::Relation#to_a method. It seems to me the way it is done would require a few fundamental changes. Thoughts on this and the general idea?

Ken

[1] http://github.com/rails-sqlserver/tiny_tds
[2] https://gist.github.com/bc33adb7d40dac1d0fbc

  
## email_2.md

      
    Raw
  

              email_2.md
            
          
    Hey Ken,
I've been thinking about this problem for a while.  Brian mentioned
wanting to do something like this.  I've merged prepared statement
caching, and part of that work is that I want to start wrapping
results in a consistent object:
http://github.com/rails/rails/blob/master/activerecord/lib/active_record/result.rb#L1-30
Right now, it's a dummy object that just contains the rows and column
info.  I was thinking we can have database specific subclasses.  The
database specific subclasses could use whatever data structure they
want.  For example, SQLite3 can keep a cursor inside that object and
only read rows when they're requested.
I think it might be a good optimization.  However, it seems that this
would only be a win in the case where the user fetches many rows but
only uses a few of them.  I'm wondering how often that happens?  What
do you think?
--
Aaron Patterson
http://tenderlovemaking.com/

  
## email_3.md

      
    Raw
  

              email_3.md
            
          
    Hey Aaron,

I've been thinking about this problem for a while.  Brian mentioned
wanting to do something like this.  I've merged prepared statement
caching, and part of that work is that I want to start wrapping
results in a consistent object:
http://github.com/rails/rails/blob/master/activerecord/lib/active_record/result.rb#L1-30
Right now, it's a dummy object that just contains the rows and column
info.  I was thinking we can have database specific subclasses.  The
database specific subclasses could use whatever data structure they
want.  For example, SQLite3 can keep a cursor inside that object and
only read rows when they're requested.

Interesting, that sub class reminds me of what I would have to do for different connection modes in the adapter, mainly :odbc and :adonet. Basically a standard way of jamming old school C interfaces into something AR can cope with. In my case for those modes, it meant a lot of wasted objects iterating and building new collections. One thing I had to do in my adapter was built an abstraction on top of #select and #execute that passed options down to either the raw connection to use with a result/handle object for building the expected return value. Options like :fetch being either :all, :one, :rows, etc. This [1] might give you a clue as to the Rube Goldberg I have now :/ BTW, :dblib is the connection mode for TinyTds.
One thing I liked with TinyTds is that I have a query option (passed to each, :first => true) that even if you selected * from users, it would only get the first result and cancel all remaining results. Premature optimization, maybe, but it hooks nicely with the AR interface for #select_one, #select_rows, etc. Either way, direct feedback on that class. It looks a little slim. Where are options for how the results are returned like as array? I know #select_rows would want to pass something down so that a low level connection can do that leg work of passing down the correct query option so the C code can build the object. I am also wondering if it is pragmatic to initialize an object with everything partitioned too. In Mysql2::Result and TinyTds::Result, these things are a bit lazy. They only spring into existence when you start iterating over them with the query parameters. It could be I was up way to late last night (i was BTW :) but at first glance it looks like this wold be another layer in the way of the #raw_connection doing the legwork. I'd have to see more context and try to build a subclass and see.

I think it might be a good optimization.  However, it seems that this
would only be a win in the case where the user fetches many rows but
only uses a few of them.  I'm wondering how often that happens?  What
do you think?

My gut feeling is this. AR wants a hash or an array of objects to build other objects. Let's say that is a hash for 100 rows, those objects are intermediary and will almost always be GC'ed right away. Why should the low level connection be required to build that large object first just so it can be iterated over and forgotten about. I look at it like the talk on view flushing. I have no numbers on it might be a low gain in small practical usage. But perhaps it has a bigger gain for bulk copies and lean ActiveModels that want to do large migrations. There was a reason #find_each was done in batches.rb. I still think the structure of AR should accommodate this and my gut feeling is it would be a good performance gain.
BTW, I'm having a hard time keeping up with your changes :) The Arel2 stuff is going to be a ground zero rewrite for sql server. It took me 6 weeks last time to pass all the AR tests. Keep it up, but man, your killing me!!!! :) Let's keep talking, I really think this is a good idea and I would love to see more. Right now I gotta figure out why 1.9.2 on AR 2.3.8 is slower that 1.8.6. What a night.

Ken

[1] http://github.com/rails-sqlserver/activerecord-sqlserver-adapter/blob/master/lib/active_record/connection_adapters/sqlserver/database_statements.rb#L282

  
## email_4.md

      
    Raw
  

              email_4.md
            
          
    Well, that's encapsulated in the AR::Result object.  Databases usually
return an array for the columns and an array of arrays for each row.
I don't think it's a good idea for database drivers to return results
as hashes in the first place.  The reason is because column order is
important.
We could use hashes in 1.9 because hashes in 1.9 are ordered, but
that is not true for 1.8.  That's the motivation for the AR::Result
object.  It simply holds the column and row information.  If someone
uses the object as an array, it will return ordered hashes for each
row.  However, if you just want the row information without columns,
just call #rows.

I know #select_rows would want to pass something down so that
a low level connection can do that leg work of passing down the correct
query option so the C code can build the object. I am also wondering if it
is pragmatic to initialize an object with everything partitioned too. In
Mysql2::Result and TinyTds::Result, these things are a bit lazy. They only
spring into existence when you start iterating over them with the query
parameters. It could be I was up way to late last night (i was BTW :) but at
first glance it looks like this wold be another layer in the way of the
raw_connection doing the legwork. I'd have to see more context and try to
build a subclass and see.

The other databases adapters are "lazy" too.  They only create objects
as you iterate over the results in the cursor.  The old mysql adapter
and the sqlite3 adapter return lists.  PG returns a hash though which
is annoying because sometimes you need a list.
The bottom line though, is that you can compose a hash out of the
column and row info.  Sometimes you need just the row info without the
column info, and in that case (if you're using hashes) you must do
work to extract that data.  With the AR::Result object, we lazily
create the list and cache it.

I think it might be a good optimization.  However, it seems that this
would only be a win in the case where the user fetches many rows but
only uses a few of them.  I'm wondering how often that happens?  What
do you think?


My gut feeling is this. AR wants a hash or an array of objects to build
other objects. Let's say that is a hash for 100 rows, those objects are
intermediary and will almost always be GC'ed right away.

The intermediate hashes are not thrown away.  AR::Base holds on to
them and you can see that here:
http://github.com/rails/rails/blob/master/activerecord/lib/active_record/base.rb#L1423-1423
At least I think this is correct.  ;-)

Why should the low
level connection be required to build that large object first just so it can
be iterated over and forgotten about. I look at it like the talk on view
flushing. I have no numbers on it might be a low gain in small practical
usage. But perhaps it has a bigger gain for bulk copies and lean
ActiveModels that want to do large migrations. There was a reason #find_each
was done in batches.rb. I still think the structure of AR should accommodate
this and my gut feeling is it would be a good performance gain.

I think you could accomplish this with the AR::Result object.  Just as
pseudo code:
 class MyAdapter
   class MyResult < AR::Result
     def initialize(cursor)
       @cursor = cursor
     end
     def rows; @cursor.walk_over_rows; end
     def each
       @cursor.each_row { |row| build_hash(row) }
     end
   end

   def exec(sql, name, binds)
     MyAResult.new@real_connection.execute(sql)
   end
 end
Note that you defer the creation of extra objects until someone asks for them.

BTW, I'm having a hard time keeping up with your changes :) The Arel2 stuff
is going to be a ground zero rewrite for sql server. It took me 6 weeks last
time to pass all the AR tests. Keep it up, but man, your killing me!!!! :)
Let's keep talking, I really think this is a good idea and I would love to
see more. Right now I gotta figure out why 1.9.2 on AR 2.3.8 is slower that
1.8.6. What a night.

Good luck!  That doesn't sound like fun.  :'(