kourge/dive-into-ruby-enumeration.md

## dive-into-ruby-enumeration.md

      
    Raw
  

              dive-into-ruby-enumeration.md
            
          
    Dive into Ruby enumeration

Enumeration is Ruby's term for iteration, and there are significant differences
in how it is handled within Ruby's existing conventions when compared to other
programming languages. Let us first establish some terms:

Enumerable, a built-in
mixin. Other languages usually call this "iterable".
Enumerator, a built-in
class. Other languages usually call this "iterator".

Refresher: each with a block

The most basic way to iterate over something in Ruby is to call the #each
method on an object, with a block:
[1, 2, 3].each { |x| puts x }
The #each method forms the backbone of enumeration in Ruby. For example, it is
very common to make something enumerable by including the mixin Enumerable in
a class, and the only pre-requisite for using this mixin is implementing the
#each method. This allows the class to get a lot of methods "for free", e.g.
#any? and #all? can all be implemented in terms of #each.
Blocks and yield

Let's say we want to make an object that produces 1, 2, and 3. The easy
way is to write (1..3), but let's do this from scratch:
class Podium
  include Enumerable

  def each
    yield 1
    yield 2
    yield 3
  end
end
Here, we are on the other side of the fence: rather than using #each, we are
now implementing #each. The yield keyword is of interest: it calls the
block that is passed to the method. In other words, the following code is
equivalent to the code above:
class Podium
  include Enumerable

  def each(&blk)
    blk.call(1)
    blk.call(2)
    blk.call(3)
  end
end
The & sigil in &blk binds the block — provided by the caller — to the name
blk. If the caller failed to provide a block, then blk becomes nil. The
following two are equivalent:

yield v if block_given?
blk.call(v) unless blk.nil?

Remember that a block by itself is not an object. Consult
"What is the difference between a block, a proc, and a lambda in Ruby?"
for more information on this topic.

Warning: if you are coming from Python, the yield keyword in Ruby does not
turn a function into a generator. The strict Ruby equivalent of a Python
generator is called a Fiber,
and fibers are usually not used for iteration, unlike their Python
counterparts. Hang tight for a straightforward Ruby analog to Pythonic
iterative generators! We will explore them near the end.

Materialization and enumerators

The #each method effectively divorces how to iterate over a container from
what to do with each iterated element. The method handles the how, and the
given block handles the what. In other words, the method is responsible for
materializing every element inside the container.
What if we only want the very first element from #each, and we do not have
access to other methods, such as #[]? That is where an enumerator comes in.
Starting from Ruby 1.9, the convention is that if we call #each without a
block, then it returns an
enumerator object. We can
then call the #next method on that enumerator to produce just one element. We
can see this behavior in a built-in array:
irb(main):001:0> a = [3, 5, 7]
=> [3, 5, 7]
irb(main):002:0> e = a.each
=> #<Enumerator: [3, 5, 7]:each>
irb(main):003:0> e.next
=> 3

Our Podium class doesn't have this behavior yet, but Ruby provides the
Object#to_enum method on all objects to make it easy to support this behavior:
class Podium
  include Enumerable

  def each
    return to_enum unless block_given?
    yield 1
    yield 2
    yield 3
  end
end
It works exactly the same way:
irb(main):011:0> p = Podium.new
=> #<Podium:0x00007fe189942358>
irb(main):012:0> e = p.each
=> #<Enumerator: #<Podium:0x00007fe189942358>:each>
irb(main):013:0> e.next
=> 1

A deeper question is: why would we want to materialize one element at a time?
The value lies in space efficiency: by supporting one-at-a-time materialization,
we can operate in a low-memory environment without fear of running out of memory
when given an enormous data source.
Aside: #to_proc

In Ruby, the #to_proc method is the convention for converting something into a
callable. For example, the built-in type Symbol implements #to_proc, so that
we can abbreviate { |x| x.odd? } into :odd?.to_proc. This process is so
common that the unary operator & does the exact same thing, allowing us to
further abbreviate :odd?.to_proc as &:odd.
The choice of the punctuation & visually connects to writing &blk in an
arguments list, where the meaning of & similarly means "capture a block as a
proc".
For a deeper dive, consult
"How Does Symbol#to_proc Work?".
Operation chaining: intermediate data structures

A common idiom in Ruby is to chain multiple enumerable methods in order to
produce a list of results. For example, if we want to take the numbers from 0 to
99 (inclusive), add one to each of them, pick only the odd ones, and sum them
up, we could write:
(0..99)
  .map(&:succ)
  .select(&:odd?)
  .reduce(0, &:+)
The (0..99) is syntax sugar for constructing a
Range object. The desugared form
is Range.new(0, 99). This chain of methods looks great until we look at what
each step of the process does:
irb(main):001:0> (0..99).map(&:succ)
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

It turns out that each call produces an entirely new array! If our data source
had a million elements, we would be creating an additional, intermediate
million-long array for each additional method we add to the chain.
In the case of arrays, we can cheat this cost away by using mutative methods:
(0..99)
  .to_a
  .map!(&:succ)
  .select!(&:odd?)
  .reduce(0, &:+)
Here, we first create a new array from (0..99) using #to_a, which guarantees
us access to those mutative methods. But this forces us to remember when
mutative methods are available, and which ones are possible. The cleaner
solution should be that every method call avoids immediately doing any work
until we actually want the result!
Aside: ActiveRecord prior art

In Rails, when dealing with data models, ActiveRecord actually does
avoid doing any work until you want the result.
Every method you call gives back a relation object, which records the SQL query
it has built up so far. It's not until you try to use a model object that the
query gets executed!
Operation chaining: removing intermediates

It turns out that there is a method called Enumerator#lazy that gives us the
ability to eliminate intermediates for free:
(0..99)
  .each
  .lazy
  .map(&:succ)
  .select(&:odd?)
  .reduce(0, &:+)
The call above does not build any intermediate arrays. This is so common, in
fact, that another convention is that if an object supports returning an
enumerator from #each, it also supports calling #lazy as a shorthand for
.each.lazy:
(0..99)
  .lazy
  .map(&:succ)
  .select(&:odd?)
  .reduce(0, &:+)
We can now apply the same behavior to podium:
class Podium
  include Enumerable

  def each
    return to_enum unless block_given?
    yield 1
    yield 2
    yield 3
  end

  def lazy
    each.lazy
  end
end
We can verify that it's not until we actually try to use each value that those
values are calculated:
irb(main):015:0> p = Podium.new
=> #<Podium:0x00007fb64cf13ce0>
irb(main):016:0> p.lazy.map(&:succ).reject(&:odd?)
=> #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator: #<Podium:0x00007fb64cf13ce0>:each>>:map>:reject>
irb(main):017:0> p.lazy.map(&:succ).reject(&:odd?).each { |i| puts i }
2
4
=> nil

The Enumerator::Lazy class, used to implement Enumerator#lazy, also supports
a #force method that lets use force an eager evaluation all at once:
irb(main):018:0> p.lazy.map(&:succ).reject(&:odd?).force
=> [2, 4]

Its behavior is the same as calling #to_a.
Converted enumerator vs enumerator from scratch

So far we've built an enumerator by having an object wrap itself in an
enumerator. It turns out there is a second way to do this: by passing a block to
Enumerator.new:
def podium
  Enumerator.new do |yielder|
    yielder.yield 1
    yielder.yield 2
    yielder.yield 3
  end
end
It can be used in exactly the same way:
irb(main):008:0> p = podium
=> #<Enumerator: #<Enumerator::Generator:0x00007fdaa15e9d08>:each>
irb(main):009:0> p.lazy.map(&:succ).reject(&:odd?)
=> #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator::Lazy: #<Enumerator: #<Enumerator::Generator:0x00007fdaa15e9d08>:each>>:map>:reject>
irb(main):010:0> p.lazy.map(&:succ).reject(&:odd?).each { |i| puts i }
2
4
=> nil

Making an enumerator from scratch is very handy if we find ourselves not needing
a class.

If you're coming from Python, the block form of Enumerator.new is the
closest Ruby analog to Python generators. In fact, the low-level way to use an
enumerator works almost exactly the same way:
irb(main):008:0> p = podium
=> #<Enumerator: #<Enumerator::Generator:0x00007fa2021f2b78>:each>
irb(main):009:0> p.next
=> 1
irb(main):010:0> p.next
=> 2
irb(main):011:0> p.next
=> 3
irb(main):012:0> p.next
Traceback (most recent call last):
        2: from (irb):12
        1: from (irb):12:in `next'
StopIteration (iteration reached an end)

Enumerators in Ruby are also powered by fibers, which brings us to the
question: when should one be used over the other? If the intent is to make
something iterable, then enumerators are a better choice, because they include
the Enumerable mixin by default. Otherwise, if the purpose is to use
coroutines as a way to perform flow control, then fibers provide a much more
direct API.