I optimize bulk anagram record creation from ~15 minutes (using the "naive" ActiveRecord approach) to ~10 seconds.
I'm making an app to practice for anagram games (think Scrabble, Words with Friends, etc.). Given a list of words, I need a way to quickly lookup anagrams that exist as words in that list.
I'll be using the TWL06 (tournament word list used for Scrabble in the US and Canada), which has ~180k words, stored as a .txt file separated with line breaks.
For example, given the word "parsec" I want to find all anagrams that are words in the TWL06 ("crapes", "capers", "recaps", etc.)
- Background: Anagram lookup
- Background: Building an in-memory anagram index
- Rails, ActiveRecord, SQL anagram finding
- Notes
To determine whether two words are anagrams, I sort each word's letters alphabetically and compare. If two words have the same set of sorted letters, they are anagrams of each other.
The words "capers" and "parsec" are anagrams, i.e.
> 'capers'.chars.sort.join #=> 'aceprs'
> 'parsec'.chars.sort.join #=> 'aceprs'
These sorted strings of letters are called "alphagrams." We can get the alphagram of a word by just sorting chars:
def alphagram(word)
word.chars.sort.join
end
I have a way to determine if two words are anagrams, so now let's find anagrams given a particular word.
One way we could approach this is to check every word in the list, something like this:
- get the alphagram of our 'target' word
- loop over all words in our list
- check if each word's alphagram is a match for our target word
- if so, add that to our list of anagrams
def find_anagrams(file_path, word)
target_alphagram = alphagram(word)
found_anagrams = []
File.readlines(file_path).each do |line|
word_text = line.chomp # to remove newline characters
if alphagram == alphagram(word_text)
found_anagrams << word_text
end
end
found_anagrams
end
def alphagram(word)
word.chars.sort.join
end
This works, but the algorithm traverses the entire list of n
words, meaning lookup is O(n)
. In practice, for or our list of ~180k words this means the following benchmark times:
> puts Benchmark.measure { find_anagrams('lib/word_lists/twl06.txt', 'parsec') }
0.750808 0.010643 0.761451 ( 0.802890)
Almost a whole second to lookup anagrams for a single word is unacceptable for a request in production. I need a better way.
To get faster lookup, I built an "index" hash called anagrams
. The key is an alphagram (our sorted list of letters) and the value is an array (all words in the list you can make with those letters).
> anagrams['aceprs']
#=> ["capers", "crapes", "escarp", "pacers", "parsec", "recaps", "scrape", "secpar", "spacer"]
Here, lookup is going to be very fast regardless of the number of records, because hash lookup in Ruby is O(1).
> puts Benchmark.measure { anagrams['aceprs'] }
0.000008 0.000003 0.000011 ( 0.000006)
Clearly, this is better. So how did I build it?
First I made a simple PORO to store our words and anagrams. I initialize a WordList
from a given file path. To build a hash like the one above, this iterates through all the words in the word list. It adds each word to the hash for a corresponding alphagram key.
class WordList
attr_accessor :words
attr_accessor :anagrams
def initialize(file_path)
@anagrams = Hash.new([])
@words = File.readlines(file_path).map do |line|
word = line.chomp # remove newline characters
@anagrams[alphagram(word)] += [word]
word
end
end
def alphagram(word)
word.chars.sort.join
end
end
Now we can initialize the object and retrieve anagrams directly from the hash using alphagrams:
> wl = WordList.new('twl06.txt')
> wl.words.count
#=> 178691
> wl.anagrams.count
#=> 161019
> wl.anagrams[alphagram('parsec')]
#=> ["capers", "crapes", "escarp", "pacers", "parsec", "recaps", "scrape", "secpar", "spacer"]
> puts Benchmark.measure { wl = WordList.new('twl06.txt') }
1.052380 0.051231 1.103611 ( 1.134766)
This takes just a bit longer than a single lookup in our brute force scheme above, and will also grow O(n)
. However, we only need to do this once; once we have our WordList
object, hash lookups are almost free.
This speed does come at a price, because we need to keep the word list and anagrams hash around in memory all the time.
> wl = nil
> puts MemoryProfiler.report { wl = WordList.new('twll06.txt') }.total_retained_memsize
37130656
(I'm interested in retained memory here; see this thoughtbot article for a good writeup on this.)
At first 37.1 MB might not seem like much, but I've dealtt with enough performance/memory problems in rails apps to know that practically this is more than we want sitting around all the time.
Next, I'll look into the more practical problem of efficiently storing and retrieving anagrams in a database with Rails and ActiveRecord.
I made two models to store words, alphagrams, and their relationship. Like this:
------------- -------------
| alphagrams | | words |
|-------------| /|-------------|
|text |----------|text |
|created_at | \|alphagram_id |
|updated_at | |created_at |
------------- |updated_at |
-------------
class Alphagram < ApplicationRecord
has_many :words
end
class Word < ApplicationRecord
belongs_to :alphagram
end
Now I'll make a modified version WordList
to populate these tables instead of just creating an object in memory.
The most straightforward way of doing this with ActiveRecord is to just loop through all the words, and create the appropriate records:
class WordList
def initialize(file_path)
File.readlines(file_path).each do |line|
word_text = line.chomp
ag = Alphagram.find_or_create_by(text: alphagram(word_text))
Word.create(text: word_text, alphagram: ag)
end
end
def alphagram(word)
word.chars.sort.join
end
end
So how does this perform compared to my 1s build above?
> puts Benchmark.measure { WordList.new('twl06.txt') }
501.579515 283.982253 785.561768 (869.650190)
Almost 15 minutes!
This works out to a per-word clock time of ~4.8ms.
In reality I might not have to worry about this problem too much at -- in production I'll only have to build the word list once. But right now I'm experimenting, trying different approaches with different word lists, and I don't want to wait 10-15 minutes every time. Given the hash approach above is so fast, I've should even be able to get this to a reasonable amount of time for a single (aleit somewhat long) request.
Let's look at the SQL being executed for a single word:
Alphagram Load (0.1ms) SELECT "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ? [["text", "aabeltt"], ["LIMIT", 1]]
(0.0ms) begin transaction
Alphagram Create (0.4ms) INSERT INTO "alphagrams" ("text", "created_at", "updated_at") VALUES (?, ?, ?) [["text", "aabeltt"], ["created_at", "2019-07-01 08:22:59.814528"], ["updated_at", "2019-07-01 08:22:59.814528"]]
(0.7ms) commit transaction
(0.0ms) begin transaction
Word Create (0.4ms) INSERT INTO "words" ("text", "alphagram_id", "created_at", "updated_at") VALUES (?, ?, ?, ?) [["text", "abettal"], ["alphagram_id", 161545], ["created_at", "2019-07-01 08:22:59.817320"], ["updated_at", "2019-07-01 08:22:59.817320"]]
(0.7ms) commit transaction
Right off the bat I can see it's creating a separate transaction for each create
. Let's wrap each word's statements in a single transaction:
class WordList
def initialize(file_path)
File.readlines(file_path).each do |line|
ActiveRecord::Base.transaction do
word_text = line.chomp
ag = Alphagram.find_or_create_by(text: alphagram(word_text))
Word.create(text: word_text, alphagram: ag)
end
end
end
def alphagram(word)
word.chars.sort.join
end
end
Now the SQL for a single word looks like:
(0.1ms) begin transaction
Alphagram Load (0.2ms) SELECT "alphagrams" ...
Alphagram Create (0.3ms) INSERT INTO "alphagrams" ...
Word Create (0.2ms) INSERT INTO "words" ...
(0.8ms) commit transaction
When benchmarked, this gives a per-word clock time of ~3.8ms. A little better.
But we still have a transaction for every word. Let's try one transaction for all the words:
class WordList
def initialize(file_path)
ActiveRecord::Base.transaction do # I moved this line up
File.readlines(file_path).each do |line|
word_text = line.chomp
ag = Alphagram.find_or_create_by(text: alphagram(word_text))
Word.create(text: word_text, alphagram: ag)
end
end
end
def alphagram(word)
word.chars.sort.join
end
end
Now we have one transaction creating all words and alphagrams:
(0.1ms) begin transaction
Alphagram Load (0.2ms) SELECT "alphagrams" ...
Alphagram Create (1.5ms) INSERT INTO "alphagrams" ...
Word Create (0.3ms) INSERT INTO "words" ...
Alphagram Load (0.1ms) SELECT "alphagrams" ...
Alphagram Create (0.1ms) INSERT INTO "alphagrams" ...
Word Create (0.1ms) INSERT INTO "words" ...
Alphagram Load (0.1ms) SELECT "alphagrams" ...
Alphagram Create (1.1ms) INSERT INTO "alphagrams" ...
Word Create (0.1ms) INSERT INTO "words" ...
...
(1.2ms) commit transaction
Which brings us down to a per-word clock time of ~2.0ms.
This is pretty good -- less than half of where we started -- but ActiveRecord is still executing thousands of separate SQL statements. SQL can be a lot faster than this, but to take advantage of that speed I'm going to have to write a little SQL myself.
The approach here needs to be a little different. I'm going to use SQL INSERT
s to create all the Word
and Alphagram
records in just two statements respectively. Because Word
needs to have an alphagram_id
to have a proper belongs_to
relationship, we first need to create all the Alphagram
records before creating the Word
records.
In both cases, I'm building one big array of strings to pass to the VALUES
claus of the INSERT
:
class WordList
def initialize(file_path)
# first read all records into memory
alphagrams_hash = Hash.new([])
words = File.readlines(file_path).map do |line|
word = line.chomp
alphagrams_hash[alphagram(word)] += [word]
word
end
time = Time.now
# build alphagrams
alphagrams_for_db = alphagrams_hash.keys.map do |ag_text|
"('#{ag_text}','#{time}','#{time}')"
end.join(',')
# INSERT alphagrams
ActiveRecord::Base.connection.execute(
"INSERT INTO alphagrams(text,created_at,updated_at) VALUES #{alphagrams_for_db}"
)
# build words
words_for_db = @words.map do |word_text|
alphagram_id = Alphagram.find_by(text: alphagram(word_text)).id
"('#{word_text}',#{alphagram_id},'#{time}','#{time}')"
end.join(',')
# INSERT words
ActiveRecord::Base.connection.execute(
"INSERT INTO words(text,alphagram_id,created_at,updated_at) VALUES #{words_for_db}"
)
end
def alphagram(word)
word.chars.sort.join
end
end
Now the SQL looks like this. Two big INSERT
s with a lot of SELECT
s to get the anagram_id
s in the middle:
INSERT INTO alphagrams(text,created_at,updated_at) VALUES ('aa','2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aah','2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aadeh','2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200') ...
Alphagram Load (1.1ms) SELECT "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ? [["text", "aa"], ["LIMIT", 1]]
Alphagram Load (0.1ms) SELECT "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ? [["text", "aah"], ["LIMIT", 1]]
Alphagram Load (0.1ms) SELECT "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ? [["text", "aadeh"], ["LIMIT", 1]]
Alphagram Load (0.1ms) SELECT "alphagrams".* FROM "alphagrams" WHERE "alphagrams"."text" = ? LIMIT ? [["text", "aaghin"], ["LIMIT", 1]]
...
(4.6ms) INSERT INTO words(text,alphagram_id,created_at,updated_at) VALUES ('aa',163662,'2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aah',163663,'2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200'),('aahed',163664,'2019-07-01 12:36:04 +0200','2019-07-01 12:36:04 +0200') ...
This gets us down to a per-word clock time of ~0.18ms. Nice!
There is one more SQL optimization I can see here, which comes from the fact that we're still doing an individual lookup for each word's corresponding alphagram_id
on the line:
alphagram_id = Alphagram.find_by(text: alphagram(word_text)).id
I can elimate these individual lookups by loading all the newly-created Alphagrams
with a single SELECT
before I create the words, then building a hash:
alphagram_ids = {}
Alphagram.all.each do |ag|
alphagram_ids[ag.text] = ag.id
end
Now, when setting up the word values, I can get the alphagram_id
without a SELECT
statement for every record:
alphagram_id = alphagram_ids[alphagram(word_text)]
This last doesn't get me much, but saves ~0.1-0.2ms from doing an individual SELECT
statements for each word to load the alphagram_id
s.
Finally, I get a per-word clock time of ~0.05ms.
> puts Benchmark.measure { WordList.new('lib/word_lists/twl06.txt') }
8.319700 0.555217 8.874917 ( 9.011809)
That's more like it: all 180k words parsed into anagrams and stored in the DB in under 10 seconds.
Now that I have all the records in the database, a quick note about retrieval retrieval performance. With the objects I made above, finding anagrams is a matter of just looking up the appropriate Alphagram
and looping over it's associated Words
:
Alphagram.find_by(text: 'aceprs').words.pluck(:text)
#=> ["capers", "crapes", "escarp", "pacers", "parsec", "recaps", "scrape", "secpar", "spacer"]
This executes two SELECT
statements, giving a pretty speedy lookup:
> puts Benchmark.measure { Alphagram.find_by(text: 'aceprs').words.pluck(:text) }
0.000912 0.000080 0.000992 ( 0.001290)
This isn't quite as fast as my in-memory hash lookup above, and won't be O(1) either -- something closer to worst-case O(nlogn) with a database index on anagrams.text
. Still, this is plenty fast to execute (even for a much larger list of words) within a single request.
- For all benchmarking I turned off logging in the console (
conf.echo = false
andActiveRecord::Base.logger = nil
) - For performance measurements I used Ruby's built-in
Benchmark
, and for memory I used thememory_profiler
gem. - I used a local SQLite DB for all the above queries -- I'll maybe replicate on the proper Postgres instance when the app is a bit more developed.
- This was not scientific at all -- I replicated each measurement 2 or 3 times and took averages, but don't claim any degree of statistical significance here.