thelemonyfresh/rails_anagram_finding.md

## rails_anagram_finding.md

      
    Raw
  

              rails_anagram_finding.md
            
          
    TLDR

I optimize bulk anagram record creation from ~15 minutes (using the "naive" ActiveRecord approach) to ~10 seconds.
Intro

I'm making an app to practice for anagram games (think Scrabble, Words with Friends, etc.). Given a list of words, I need a way to quickly lookup anagrams that exist as words in that list.
I'll be using the TWL06 (tournament word list used for Scrabble in the US and Canada), which has ~180k words, stored as a .txt file separated with line breaks.
For example, given the word "parsec" I want to find all anagrams that are words in the TWL06 ("crapes", "capers", "recaps", etc.)
Table of contents


Background: Anagram lookup

Brute Force
Hash lookup


Background: Building an in-memory anagram index

Performance
Memory


Rails, ActiveRecord, SQL anagram finding

Data model
Build objects, tune performance

ActiveRecord
ActiveRecord with careful transactions
SQL directly


ActiveRecord lookup performance


Notes

Background: Anagram lookup

To determine whether two words are anagrams, I sort each word's letters alphabetically and compare. If two words have the same set of sorted letters, they are anagrams of each other.
The words "capers" and "parsec" are anagrams, i.e.
> 'capers'.chars.sort.join #=> 'aceprs'
> 'parsec'.chars.sort.join #=> 'aceprs'
These sorted strings of letters are called "alphagrams." We can get the alphagram of a word by just sorting chars:
  def alphagram(word)
    word.chars.sort.join
  end
I have a way to determine if two words are anagrams, so now let's find anagrams given a particular word.
Brute Force

One way we could approach this is to check every word in the list, something like this:

get the alphagram of our 'target' word
loop over all words in our list
check if each word's alphagram is a match for our target word
if so, add that to our list of anagrams

  def find_anagrams(file_path, word)
    target_alphagram = alphagram(word)
    found_anagrams = []
    
    File.readlines(file_path).each do |line|
      word_text = line.chomp # to remove newline characters

      if alphagram == alphagram(word_text)
          found_anagrams << word_text
      end
    end
    found_anagrams
  end
  
  def alphagram(word)
    word.chars.sort.join
  end
This works, but the algorithm traverses the entire list of n words, meaning lookup is O(n). In practice, for or our list of ~180k words this means the following benchmark times:
> puts Benchmark.measure { find_anagrams('lib/word_lists/twl06.txt', 'parsec') }
0.750808   0.010643   0.761451 (  0.802890)
Almost a whole second to lookup anagrams for a single word is unacceptable for a request in production. I need a better way.
Hash lookup

To get faster lookup, I built an "index" hash called anagrams. The key is an alphagram (our sorted list of letters) and the value is an array (all words in the list you can make with those letters).
> anagrams['aceprs']
#=> ["capers", "crapes", "escarp", "pacers", "parsec", "recaps", "scrape", "secpar", "spacer"]
Here, lookup is going to be very fast regardless of the number of records, because hash lookup in Ruby is O(1).
> puts Benchmark.measure { anagrams['aceprs'] }
0.000008   0.000003   0.000011 (  0.000006)
Clearly, this is better. So how did I build it?
Background: Building an in-memory anagram index

First I made a simple PORO to store our words and anagrams. I initialize a WordList from a given file path. To build a hash like the one above, this iterates through all the words in the word list. It adds each word to the hash for a corresponding alphagram key.
class WordList
  attr_accessor :words
  attr_accessor :anagrams

  def initialize(file_path)
    @anagrams = Hash.new([])
    @words = File.readlines(file_path).map do |line|
      word = line.chomp # remove newline characters
      @anagrams[alphagram(word)] += [word]
      word
    end
  end
  
  def alphagram(word)
    word.chars.sort.join
  end
end
Now we can initialize the object and retrieve anagrams directly from the hash using alphagrams:
> wl = WordList.new('twl06.txt')

> wl.words.count
#=> 178691
> wl.anagrams.count
#=> 161019

> wl.anagrams[alphagram('parsec')]
#=> ["capers", "crapes", "escarp", "pacers", "parsec", "recaps", "scrape", "secpar", "spacer"]
Performance

> puts Benchmark.measure { wl = WordList.new('twl06.txt') }
1.052380   0.051231   1.103611 (  1.134766)
This takes just a bit longer than a single lookup in our brute force scheme above, and will also grow O(n). However, we only need to do this once; once we have our WordList object, hash lookups are almost free.
Memory

This speed does come at a price, because we need to keep the word list and anagrams hash around in memory all the time.
> wl = nil
> puts MemoryProfiler.report { wl = WordList.new('twll06.txt') }.total_retained_memsize
37130656
(I'm interested in retained memory here; see this thoughtbot article for a good writeup on this.)
At first 37.1 MB might not seem like much, but I've dealtt with enough performance/memory problems in rails apps to know that practically this is more than we want sitting around all the time.
Next, I'll look into the more practical problem of efficiently storing and retrieving anagrams in a database with Rails and ActiveRecord.
Rails, ActiveRecord, SQL anagram finding

Data model

I made two models to store words, alphagrams, and their relationship. Like this:
  -------------            -------------
 | alphagrams  |          |    words    |
 |-------------|         /|-------------|
 |text         |----------|text         |
 |created_at   |         \|alphagram_id |
 |updated_at   |          |created_at   |
  -------------           |updated_at   |
                           -------------

class Alphagram < ApplicationRecord
  has_many :words
end

class Word < ApplicationRecord
  belongs_to :alphagram
end
Build objects, tune performance

Now I'll make a modified version WordList to populate these tables instead of just creating an object in memory.
ActiveRecord

The most straightforward way of doing this with ActiveRecord is to just loop through all the words, and create the appropriate records:
class WordList
  def initialize(file_path)
    File.readlines(file_path).each do |line|
      word_text = line.chomp
      ag = Alphagram.find_or_create_by(text: alphagram(word_text))
      Word.create(text: word_text, alphagram: ag)
    end
  end
  
  def alphagram(word)
    word.chars.sort.join
  end
end
So how does this perform compared to my 1s build above?
> puts Benchmark.measure { WordList.new('twl06.txt') }
501.579515 283.982253 785.561768 (869.650190)
Almost 15 minutes!
This works out to a per-word clock time of ~4.8ms.
In reality I might not have to worry about this problem too much at -- in production I'll only have to build the word list once. But right now I'm experimenting, trying different approaches with different word lists, and I don't want to wait 10-15 minutes every time. Given the hash approach above is so fast, I've should even be able to get this to a reasonable amount of time for a single (aleit somewhat long) request.
ActiveRecord with careful transactions

Let's look at the SQL being executed for a single word:
 Alphagram Load (0.1ms)  SELECT  "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ?  [["text", "aabeltt"], ["LIMIT", 1]]
   (0.0ms)  begin transaction
  Alphagram Create (0.4ms)  INSERT INTO "alphagrams" ("text", "created_at", "updated_at") VALUES (?, ?, ?)  [["text", "aabeltt"], ["created_at", "2019-07-01 08:22:59.814528"], ["updated_at", "2019-07-01 08:22:59.814528"]]
   (0.7ms)  commit transaction
   (0.0ms)  begin transaction
  Word Create (0.4ms)  INSERT INTO "words" ("text", "alphagram_id", "created_at", "updated_at") VALUES (?, ?, ?, ?)  [["text", "abettal"], ["alphagram_id", 161545], ["created_at", "2019-07-01 08:22:59.817320"], ["updated_at", "2019-07-01 08:22:59.817320"]]
   (0.7ms)  commit transaction

Right off the bat I can see it's creating a separate transaction for each create. Let's wrap each word's statements in a single transaction:
class WordList
  def initialize(file_path)
    File.readlines(file_path).each do |line|
      ActiveRecord::Base.transaction do
        word_text = line.chomp
        ag = Alphagram.find_or_create_by(text: alphagram(word_text))
        Word.create(text: word_text, alphagram: ag)
      end
    end
  end
  
  def alphagram(word)
    word.chars.sort.join
  end
end
Now the SQL for a single word looks like:
   (0.1ms)  begin transaction
  Alphagram Load (0.2ms)  SELECT  "alphagrams" ...
  Alphagram Create (0.3ms)  INSERT INTO "alphagrams" ...
  Word Create (0.2ms)  INSERT INTO "words" ...
   (0.8ms)  commit transaction

When benchmarked, this gives a per-word clock time of ~3.8ms. A little better.
But we still have a transaction for every word. Let's try one transaction for all the words:
class WordList
  def initialize(file_path)
    ActiveRecord::Base.transaction do               # I moved this line up
      File.readlines(file_path).each do |line|
        word_text = line.chomp
        ag = Alphagram.find_or_create_by(text: alphagram(word_text))
        Word.create(text: word_text, alphagram: ag)
      end
    end
  end
  
  def alphagram(word)
    word.chars.sort.join
  end
end
Now we have one transaction creating all words and alphagrams:
  (0.1ms)  begin transaction
  Alphagram Load (0.2ms)  SELECT  "alphagrams" ...
  Alphagram Create (1.5ms)  INSERT INTO "alphagrams" ...
  Word Create (0.3ms)  INSERT INTO "words" ...
  Alphagram Load (0.1ms)  SELECT  "alphagrams" ...
  Alphagram Create (0.1ms)  INSERT INTO "alphagrams" ...
  Word Create (0.1ms)  INSERT INTO "words" ...
  Alphagram Load (0.1ms)  SELECT  "alphagrams" ...
  Alphagram Create (1.1ms)  INSERT INTO "alphagrams" ...
  Word Create (0.1ms)  INSERT INTO "words" ...
    ...

  (1.2ms)  commit transaction
Which brings us down to a per-word clock time of ~2.0ms.
This is pretty good -- less than half of where we started -- but ActiveRecord is still executing thousands of separate SQL statements. SQL can be a lot faster than this, but to take advantage of that speed I'm going to have to write a little SQL myself.
SQL directly

The approach here needs to be a little different. I'm going to use SQL INSERTs to create all the Word and Alphagram records in just two statements respectively. Because Word needs to have an alphagram_id to have a proper belongs_to relationship, we first need to create all the Alphagram records before creating the Word records.
In both cases, I'm building one big array of strings to pass to the VALUES claus of the INSERT:
class WordList
  def initialize(file_path)
    # first read all records into memory
    alphagrams_hash = Hash.new([])
    words = File.readlines(file_path).map do |line|
      word = line.chomp
      alphagrams_hash[alphagram(word)] += [word]
      word
    end
    
    time = Time.now

    # build alphagrams
    alphagrams_for_db =  alphagrams_hash.keys.map do |ag_text|
      "('#{ag_text}','#{time}','#{time}')"
    end.join(',')

    # INSERT alphagrams
    ActiveRecord::Base.connection.execute(
      "INSERT INTO alphagrams(text,created_at,updated_at) VALUES #{alphagrams_for_db}"
    )

    # build words
    words_for_db = @words.map do |word_text|
      alphagram_id = Alphagram.find_by(text: alphagram(word_text)).id
      "('#{word_text}',#{alphagram_id},'#{time}','#{time}')"
    end.join(',')

    # INSERT words
    ActiveRecord::Base.connection.execute(
      "INSERT INTO words(text,alphagram_id,created_at,updated_at) VALUES #{words_for_db}"
    )
  end
  
  def alphagram(word)
    word.chars.sort.join
  end
end
Now the SQL looks like this. Two big INSERTs with a lot of SELECTs to get the anagram_ids in the middle:
INSERT INTO alphagrams(text,created_at,updated_at) VALUES ('aa','2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aah','2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aadeh','2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200') ...

  Alphagram Load (1.1ms)  SELECT  "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ?  [["text", "aa"], ["LIMIT", 1]]
  Alphagram Load (0.1ms)  SELECT  "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ?  [["text", "aah"], ["LIMIT", 1]]
  Alphagram Load (0.1ms)  SELECT  "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ?  [["text", "aadeh"], ["LIMIT", 1]]
  Alphagram Load (0.1ms)  SELECT  "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ?  [["text", "aaghin"], ["LIMIT", 1]]

  ...

(4.6ms)  INSERT INTO words(text,alphagram_id,created_at,updated_at) VALUES ('aa',163662,'2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aah',163663,'2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aahed',163664,'2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200') ...


This gets us down to a per-word clock time of ~0.18ms. Nice!
There is one more SQL optimization I can see here, which comes from the fact that we're still doing an individual lookup for each word's corresponding alphagram_id on the line:
alphagram_id = Alphagram.find_by(text: alphagram(word_text)).id

I can elimate these individual lookups by loading all the newly-created Alphagrams with a single SELECT before I create the words, then building a hash:
alphagram_ids = {}
Alphagram.all.each do |ag|
  alphagram_ids[ag.text] = ag.id
end

Now, when setting up the word values, I can get the alphagram_id without a SELECT statement for every record:
alphagram_id = alphagram_ids[alphagram(word_text)]

This last doesn't get me much, but saves ~0.1-0.2ms from doing an individual SELECT statements for each word to load the alphagram_ids.
Finally, I get a per-word clock time of ~0.05ms.
> puts Benchmark.measure { WordList.new('lib/word_lists/twl06.txt') }
8.319700   0.555217   8.874917 (  9.011809)

That's more like it: all 180k words parsed into anagrams and stored in the DB in under 10 seconds.
ActiveRecord lookup performance

Now that I have all the records in the database, a quick note about retrieval retrieval performance. With the objects I made above, finding anagrams is a matter of just looking up the appropriate Alphagram and looping over it's associated Words:
Alphagram.find_by(text: 'aceprs').words.pluck(:text)
#=> ["capers", "crapes", "escarp", "pacers", "parsec", "recaps", "scrape", "secpar", "spacer"]

This executes two SELECT statements, giving a pretty speedy lookup:
> puts Benchmark.measure { Alphagram.find_by(text: 'aceprs').words.pluck(:text) }
0.000912   0.000080   0.000992 (  0.001290)

This isn't quite as fast as my in-memory hash lookup above, and won't be O(1) either -- something closer to worst-case O(nlogn) with a database index on anagrams.text. Still, this is plenty fast to execute (even for a much larger list of words) within a single request.
Notes


For all benchmarking I turned off logging in the console (conf.echo = false and ActiveRecord::Base.logger = nil)
For performance measurements I used Ruby's built-in Benchmark, and for memory I used the memory_profiler gem.
I used a local SQLite DB for all the above queries -- I'll maybe replicate on the proper Postgres instance when the app is a bit more developed.
This was not scientific at all -- I replicated each measurement 2 or 3 times and took averages, but don't claim any degree of statistical significance here.