Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save absyah/3bf0f634b30adce8badb6426a37e5a8a to your computer and use it in GitHub Desktop.
Save absyah/3bf0f634b30adce8badb6426a37e5a8a to your computer and use it in GitHub Desktop.
Nate Berkopec's Ruby Performance Newsletters
A memory-saving ActiveRecord setting has been used by just one application ever, according to GitHub
There's a common performance problem in many Rails background jobs.
Background jobs often do operations across large sets of data. Basically, they do silly things like User.all.each(&:send_daily_newsletter).
So, there's a problem with that query. In development and test environments, User.all will probably return a few rows, maybe a dozen at most. Most developers have extremely limited seed data on their local machines.
In production, however, User.all will probably return quite a few rows. Depending on the app you work on, maybe a few hundred thousand.
There's a tiiiiiny issue with a result set that returns 100,000 rows, and it's not just that the SQL query will take a long time to return. It will have irreversible effects on your Ruby app too!
The problem with that is large result sets in ActiveRecord can really blow up your memory usage. Due to some intricacies of the memory allocator Ruby uses, once a Ruby process uses a lot of memory, it tends to not give it back to the operating system, even though the memory is garbage collected. This is the allocator's fault, not Ruby's. I gave a conf talk at RubyKaigi about this behavior. We might be able to fix it over the next few years, but for now, we have to deal with it.
This is the kind of behavior that can often go overlooked and blow up in production. Even worse, most people seem to not really care about what goes on in their background jobs as long as they work, but then complain that their queues are filling up or they're running out of memory.
Having the tendency to fetch massive result sets is just one of the reasons background jobs can be slow and use lots of memory, but it's a common one. Since Rails 5.0 though, we've had a config setting that helps us to identify when this is happening. It's called warn_on_records_fetched_greater_than.
Just one app has ever used it, according to GitHub.
When set to an integer, this setting will print a warning to the logs if any ActiveRecord::Relation object returns a result set greater than that integer.
It's intended as a reminder that if you're fetching large result sets, you should probably be using find_each or other methods from ActiveRecord::Batches.
So, instead of
User.all.each(&:send_daily_newsletter)
You can use:
User.find_in_batches do |group|
group.each(&:send_daily_newsletter)
end
I think it would be reasonable for everyone to go into their config/environments/development.rb and drop this warning in right now. 1500 would be a sane value, as a result set greater than 1500 would be fetched in at least 2 pretty full batches.
config.active_record.warn_on_records_fetched_greater_than = 1500
Then, you'll see this warning in the logs when you exceed that limit:
Query fetched 1501 User records: SELECT "users".* FROM "users" ORDER BY "users"."id" ASC LIMIT $1
Now, I'm not sure if I would drop this into application.rb or my production.rb environment file. It uses ActiveSupport::Notifications.subscribe("sql.active_record"), which means that we're using notifications to subscribe to every single SQL query. Local benchmarks show a very small overhead per query, but "adding overhead to every ActiveRecord request" is not always something you want to do.
It's probably useful to just turn it on development all the time and perhaps turn it on again only when you're running background jobs. For example, you could drop this into config/initializers:
if Sidekiq.server?
ActiveRecord::Base.warn_on_records_fetched_greater_than = 1500
end
I hope that this has given you a new pull request to submit today. Happy Friday deploy!
Until next time,
-Nate
ActiveRecord::QueryCache — do you know what those CACHE lines in your logs mean?
There's one really common performance mistake I see constantly in Ruby/Rails web applications. It relies on a little-known and poorly-understood feature known as the ActiveRecord QueryCache.
Have you ever written this line of code before?
@current_user ||= User.find_by(id: session[:user_id])
Seems innocuous, right? Not a problem!
But what happens when the user isn't logged in? That is, when session[:user_id] is nil.
Try it in your Rails console:
irb(main):002:0> User.find_by(id: nil)
User Load (4.9ms) SELECT "users".* FROM "users" WHERE "users"."id" IS NULL LIMIT $1 [["LIMIT", 1]]
=> nil
Hmmm. A query is executed for a null id user! Well, that's obviously inefficient. But, current_user is checked usually many times per page. What happens if we call current_user many times during the request? It won't be memoized in @current_user, because the result of the query was nil.
In your console, you'll see the query execute multiple times, like this:
irb(main):003:0> User.find_by(id: nil)
User Load (0.5ms) SELECT "users".* FROM "users" WHERE "users"."id" IS NULL LIMIT $1 [["LIMIT", 1]]
=> nil
irb(main):004:0> User.find_by(id: nil)
User Load (0.5ms) SELECT "users".* FROM "users" WHERE "users"."id" IS NULL LIMIT $1 [["LIMIT", 1]]
=> nil
But that's not the way it will work in production, because in prod we have something called the ActiveRecord::QueryCache turned on. Turn that on in your console by doing this:
irb(main):007:0> ActiveRecord::Base.connection_pool.enable_query_cache!
=> true
irb(main):008:0> User.find_by(id: nil)
User Load (0.5ms) SELECT "users".* FROM "users" WHERE "users"."id" IS NULL LIMIT $1 [["LIMIT", 1]]
=> nil
irb(main):009:0> User.find_by(id: nil)
CACHE User Load (0.0ms) SELECT "users".* FROM "users" WHERE "users"."id" IS NULL LIMIT $1 [["LIMIT", 1]]
=> nil
See how the repeat of that query looks different? Instead of saying User Load (0.5ms) it said CACHE User Load (0.0ms). 0 milliseconds! That's great! That means getting things from a "hot" ActiveRecord QueryCache is free, right?
Not so fast.
That little number, in milliseconds, is just the amount of time spent going to the database and back, waiting on the I/O for your database result. It does not include the time spent building the query, generating the correct SQL string, and, most importantly, copying and creating a new object.
We can see this overhead clearly when we benchmark some different uses of ActiveRecord when the QueryCache is on.
Here's the benchmark I wrote:
user = User.first
ActiveRecord::Base.connection_pool.enable_query_cache!
Benchmark.ips do |x|
x.report("local variable") { user }
x.report("AR every time") { User.find_by(id: 1) }
x.report("AR every time, nil result") { User.find_by(id: nil) }
x.compare!
end
And here's the result:
Comparison:
local variable: 27650490.8 i/s
AR every time: 6236.7 i/s - 4433.53x slower
AR every time, nil result: 4555.9 i/s - 6069.19x slower
When you realize that the alternative to using ActiveRecord::QueryCache is to store and access the result as a local variable, you can begin to see why it's slow in the first place. Local variable access is very, very fast. Executing the hundreds (perhaps thousands) of lines of Ruby that kick off when you call `User.find_by` is always going to be much, much slower than that.
In my original example, it's far safer to just have a safety valve when the session is nil or falsey:
@current_user ||= User.find_by(id: session[:user_id]) if session[:user_id]
I did a quick Github search for this little "performance hack," and found over 100,000 examples of just this exact current_user situation alone.
Note how, on my machine, each call to the QueryCache took about 1/5 of a millisecond, while local variable access is basically free. I've had clients shave up to 100 milliseconds off controller actions through careful removal of QueryCache usage, simply by searching their logs for the "CACHE X Load" string.
So give it a shot. Pay more attention to "CACHE X Load" in your development logs. Because it certainly doesn't take 0.5ms, and it ain't free.
Hello Rubyists,
Most Ruby web applications will eventually run into a memory constraint. There's just two resources that you have to balance on a web application box: memory and CPU. Everyone that's running Ruby web applications is maximizing their headroom based on one or the other - 90% CPU utilization but 10% memory utilization, or, more frequently, 90% memory utilization and 10% CPU utilization.
This pattern of memory bottlenecking is frequently found in background job processing and in applications which serve low traffic but have a large amount of features (enterprise/B2B).
That means that if you can find a free or easy way to decrease your memory usage, you'll take it. It means you might be able to run more Ruby processes per server, decreasing your operational costs.
One way to reduce memory usage that's not exactly groundbreaking or extreme but that I think is extremely underused is to only `require` the parts of the Rails framework that you actually use.
This won't give you massive savings (I think anywhere between 2-10 megabytes per process would be a reasonable expectation), but every little bit helps.
Basically, the magic is all in a single line in your config/application.rb:
require 'rails/all'
All that does is load all.rb from Railties.
Instead of letting all load all of our frameworks for us, we can pick and choose what frameworks to load ourselves:
require "active_record/railtie"
require "active_storage/engine"
require "action_controller/railtie"
require "action_view/railtie"
require "action_mailer/railtie"
require "active_job/railtie"
require "action_cable/engine"
require "action_mailbox/engine"
require "action_text/engine"
require "rails/test_unit/railtie"
require "sprockets/railtie"
From there, we can just comment out anything we don't actually need. There are some obvious ones in this list that some people will just never use, ever, in their application, if they've chosen a different gem or if the app doesn't need it: ActiveRecord, Sprockets, ActiveStorage, ActionCable and ActionMailbox come to mind.
derailed-benchmarks, by Richard Schneeman, is great for measuring the gains you can get from these kinds of changes. The output looks like this:
active_record/railtie: 3.3203 MiB
This memory usage comes from initializers, railties, and code loading. If this app didn't use ActiveRecord (it does, but say it used Mongo), that would be a wasted 3MB just hanging around.
One of the biggest gains you can make is usually removing ActionMailer, if your application sends no mails. Depending on the version of the mail gem you depend on, the savings can be a few MB to as much as 30 megabytes. Sprockets, ActiveRecord, and ActiveStorage will save you a few megabytes each as well.
I like this hack as it only requires a few lines of change and it's not very brittle. It also makes any future Rails frameworks (e.g. the new ActionText and ActionMailbox) opt-in rather than opt-out, which I prefer.
I gave a talk at Railsconf a while ago about the underlying modularily of the Rails framework, where I talk much more in depth about this exact hack and use it to help me write a tweet-length (140 characters or less) Rails application.
Talk again soon,
-Nate
Most of the time, software developers are encouraged to think in terms of the implementation. We are simply agents whose job it is to Implement the Spec. The user stories are treated like stone tablets, engraved upon the Mount, and while we can do whatever we want to implement them, the stories and specs themselves are above our paygrade to question or modify.
Siloing engineers into a "implementation-only" mindset can cause issues with software performance, as performance is a requirement that often goes unspecified and unsaid.
When was the last time you wrote a performance spec into a feature request or user story? Of course, when the PM says "the user should be able to change their password", they meant "change their password some time before the heat death of the universe", right?
It's often up to you, the engineering team, to make that implicit requirement explicit.
However, there's also a number of performance improvements you can make when you help shape the spec.
Take pagination. "A user should be able to view their posts" - great. But it's 3 years later and now the PostsController#index is creaking under the load of users with thousands of posts. Adding pagination here is a *design-level decision*, but changing the page that way would increase performance by an order of magnitude.
Sometimes, providing an exact answer is computationally expensive. Often we can provide slightly inexact answers much faster - for example, by caching data and serving slightly-out-of-date results.
The fastest code is no code. Sometimes the performance cost of a feature or part of the page is often so great that it's just not worth even including. Maybe it should just be moved to a separate page, rather than a heavily-trafficked index action, so that people who really need that feature have to go to a special area.
I sped up page loads on Rubygems.org by changing the Webfont strategy. Instead of loading several weights of a heavy font from Typekit, I loaded just a few weights of a visually similar but comparatively lightweight font from Google. The results were pretty spectacular. That was using the design space to improve performance.
Rails apps that serve a JSON API often run into performance issues due to failures in the design space. Over time, their endpoints all become "kitchen sinks", serving hundreds of KB of all kinds of data. Rather than redesign the endpoint (or try a new strategy for fulfilling requests, such as GraphQL), they just keep adding things to the endpoint until it feels like it just serves the entire database in JSON.
Next time you're running into a performance issue, see if there's not a better and easier way to modify the spec just slightly for an outsized performance gain. Remember that performance is a requirement too, even if it's unstated.
Until next time,
-Nate
Hello Rubyists!
I'm experimenting with some additional email-only content this year, as a small thank you for subscribing to this list. To start with, I'm doing a many-part series on Sidekiq. This will be based on all the lessons I've learned over the years and applied with my clients.
Most of us are now using Sidekiq to run our background jobs. It's a big step up in throughput from single-threaded job processors, like Resque, and it's a big scalability boost over SQL-backed queues like DelayedJob. However, over the years, Sidekiq has exposed a few problems in the C-Ruby implementation related to threads, specifically around memory and the GVL (Global VM Lock). As a result, we've "learned" a lot about deploying Sidekiq in production since its 1.0 release.
Recently, a few settings in Sidekiq core (as of this writing, at v5.2.5) have changed their default values. I'd like to talk about one of those settings changes, and why I think *everyone*, regardless of Sidekiq version, should backport these changes to their own Sidekiq installs.
## Fragmentation and Concurrency ##
With Sidekiq 5.2.0, the default concurrency setting was changed from 25 to 10: https://github.com/mperham/sidekiq/commit/8ff96ae0b0358dc273d19ae1f8474f6ff4fd2b64
Why?
One of the most important things we've learned over the years about Sidekiq is that a bad interaction between the C-Ruby runtime and the `malloc` memory allocator included in Linux's glibc can cause extremely high memory usage. I'll talk about what causes this bad interaction in a later email, but for now, let's just concentrate on the effects.
Sidekiq with high concurrency settings, when running on Linux, can have what *looks like* a "memory leak". A single Sidekiq process can slowly grow from 256MB of memory usage to 1GB in less than 24 hours. However, rather than a leak, this is actually memory fragmentation.
Memory fragmentation occurs when a memory space starts to look like swiss cheese: it's got lots of little holes all over the place of odd and strange sizes, and the entire space isn't full. Remember defragmenting your disk in Windows 95? It's exactly like that.
I'm going to share this great Stack answer (https://stackoverflow.com/questions/3770457/what-is-memory-fragmentation) by Steve Jessop because it's got a great ASCII art explanation:
=========================================================
Imagine that you have a "large" (32 bytes) expanse of free memory:
----------------------------------
| |
----------------------------------
Now, allocate some of it (5 allocations):
----------------------------------
|aaaabbccccccddeeee |
----------------------------------
Now, free the first four allocations but not the fifth:
----------------------------------
| eeee |
----------------------------------
Now, try to allocate 16 bytes. Oops, I can't, even though there's nearly double that much free.
=========================================================
What we've learned since Sidekiq became popular is that memory fragmentation becomes much *worse* in a direct relationship with the number of Ruby threads. Sidekiq's "concurrency" setting is just a setting of how many threads it will run at once. High numbers lead to more fragmentation. You could imagine that 1 thread uses log(x) memory over time, so increasing the number of threads leads to n * log(x) memory use. Similarly, Ruby processes which use just one thread (Unicorn, DelayedJob, Resque) don't experience this issue almost at all.
This led to the reduction of the default concurrency setting. But, aren't we losing a lot of throughput and capacity by reducing the total thread count by 60 percent? Will I have to run 3x the amount of Sidekiq processes or servers in production after making this change?!
I'm not so sure. This is where the GVL comes in.
Another thing we've learned in production over the years is just how useful a thread is. You might think of this as "the marginal benefit of adding 1 additional thread". What we're sure of is that there are greatly diminishing returns. The second thread adds more throughput than the third, which adds more throughput than the fourth, and so on. However, at some point, the fragmentation and memory cost of 1 additional thread is greater than the added throughput.
The Global Virtual Machine Lock is an intimidating concept for people, but it's actually quite simple. The clues are in the name. This is a *global lock*, that is, only one thing can hold this lock at a time. What is it a lock around? The Ruby Virtual Machine. So, only one thread in our Ruby processes can run the Ruby virtual machine at a time. In effect, we cannot run Ruby code in parallel. But our threads perform *other work* which is not running Ruby code and does not require the Ruby Virtual Machine. The biggest task in this category is I/O: sending and receiving data, particularly across the network. This is because I/O in CRuby runtime is implemented in C code, not in Ruby.
So, the marginal usefulness of one additional thread with CRuby is actually just proportional to how much of the workload is I/O calls. This intuition is formalized by something called Amdahl's Law (https://en.wikipedia.org/wiki/Amdahl%27s_law), which mathematically relates speedup in work, the amount of processor cores we can take advantage of, and the portion of the work which can be done in parallel.
As already stated, with CRuby, the percentage of the work we can do in parallel is the same thing as the amount of work that is I/O. Many background jobs do quite a bit of I/O: lots of database calls, lots of network calls to external services like your mail server or your credit card processor.
Let's say the average background job spends 75% of it's time in I/O. According to Amdahl's Law (https://en.wikipedia.org/wiki/Amdahl%27s_law#/media/File:AmdahlsLaw.svg), we can expect about a 3x speedup with 8 threads and 8 processors. After that 8th thread, the benefits become quite marginal.
What you'll also notice here is that if you're doing an extremely high amount of I/O in your Sidekiq jobs (90-95% I/O), high thread settings may still be useful to you. If all of your Sidekiq jobs spend 950 milliseconds waiting on a network call from Mailchimp and 50 milliseconds running Ruby code, thread counts as high as 128 may be useful to you. Memory fragmentation costs will be extremely high (more on that in a minute), but your threads would probably still be much more efficient than running more Sidekiq processes. Likewise, if your background jobs are doing very little I/O (less than 50% of their total runtime) you may want concurrency settings as low as 5.
In conclusion, while the memory costs of adding 1 additional are linear, the benefits are not. 10 threads is a great default for most apps using Sidekiq, but 5 to 128 is a reasonable range. According to Amdahl's Law, the following table will provide a good starting point:
| % Time in I/O | Concurrency Setting |
|---------------|---------------------|
| 50% or less | 5 |
| 50-75% | 10 |
| 75-90% | 25 |
| 91%+ | 125 |
Judge the amount of time spent in I/O for your background jobs by doing a "back-of-the-envelope" calculation by looking at your APM (New Relic, Scout, Skylight) dashboards. Look at the top 5 jobs by "percentage of time consumed" (that metric is always average time multiplied by jobs-per-minute), and get a sense of how much time each job spends waiting on the database or calling out on the network.
You may have the intuition here that concurrency settings should somehow be related to the number of CPU cores available. In theory, if we set concurrency correctly, we should always have 1 thread running Ruby code (holding the GVL) at any given time. The other threads will either be waiting on I/O to return or idling, neither of which uses a lot of CPU. So, in theory, each Sidekiq process should completely saturate 1 CPU thread when we're under heavy load. So, concurrency settings are actually *independent* of the CPU thread/core count, and in order to take advantage of additional CPU threads, we're going to need to spawn more Sidekiq *processes*, which is something I'll cover in a separate email.
So I've covered the "theoretical" or the "starting point" of where to set your concurrency - so what are the metrics in production that will tell you whether or not your settings are correct?
If memory usage is too high after changing this setting, you'll have to tune concurrency downward. I'll talk about managing Sidekiq memory usage in later emails, but remember that high concurrency will directly lead to high memory usage. If CPU utilization is low *while you're under high load* (that is, while the process is working through a full queue), you may benefit from tuning the concurrency setting upward.
Don't forget that your concurrency setting *must not exceed* the available ActiveRecord database pool size. More on that in a later email.
Well, that's all I've got on Sidekiq's concurrency setting. You can reply to this email with questions, comments or ideas for future emails in this series. Here's what's coming up:
* MALLOC_ARENA_MAX and Jemalloc
* Idempotency
* Multi-threading safety
* Setting database connection pools safely
* Running multiple processes per host
* Fanout, batching and push_bulk
* Queue design
* Locks (SQL + Redis)
* Deployment and host configuration
* Scaling (queue depths)
* Timeouts
* Maybe even more!
See you next week with more!
-Nate
Hello Rubyists,
Welcome to the THIRD part of my many-part series on Sidekiq in practice, based on the many years of experience I've had working with Sidekiq in production on client apps.
I've gotten some feedback about how plain-text emails like mine are poorly formatted in many email clients. To fix that, you may now read this email as a (markdown-formatted) Github Gist: https://gist.github.com/nateberkopec/56d16a58b5666c46d8346f2f36e8444d
Links to previous emails in this series are at the bottom of this email.
This week's email is about *how to make jobs idempotent*.
Let's look at some real-world examples in CodeTriage (https://www.codetriage.com/), an open-source Rails application.
CodeTriage has a very simple job which updates an ActiveRecord object with some information from Github (https://github.com/codetriage/codetriage/blob/master/app/jobs/update_repo_info_job.rb):
class UpdateRepoInfoJob < ApplicationJob
def perform(repo)
repo.update_from_github
end
end
`update_from_github` does a network call and overwrites some attributes on a Repo with whatever the current info is on Github. **This operation is inherently idempotent**. If you enqueue this job 100 times for the same `repo`, the end state of the repository's database row **is exactly the same**, despite the fact that you performed 100 network calls and 100 row updates. This job is completely idempotent without any extra code.
A lot of jobs are like this. They're just like pushing the "push to walk" button or the "open door" button in the elevator. Adding uniqueness constraints to these jobs is pretty silly!
Another job, SendSingleTriageEmailJob, sends a single email to a user (https://github.com/codetriage/codetriage/blob/master/app/jobs/send_single_triage_email_job.rb). Here's a simplified version of that real-world job:
class SendSingleTriageEmailJob < ApplicationJob
def perform(id, create: false)
repo_sub = RepoSubscription.find_by(id: id)
return unless repo_sub
assignment = find_assignment_for_repo_sub(repo_sub)
if assignment
assignment.update!(delivered: true)
UserMailer.send_triage(assignment: assignment).deliver_later
end
end
private
def find_assignment_for_repo_sub(repo_sub)
repo_sub.user.issue_assignments.order(:created_at).eager_load(:repo_subscription)
.where(repo_subscriptions: { repo_id: repo_sub.repo_id }).last
end
end
This job is not idempotent. Run it twice with the same arguments, and you will send two emails. How can we make this job idempotent?
Note how the `delivered` column is updated right before we enqueue an email to send. We can work with this attribute to ensure idempotency.
A sort of "naive idempotency" can be achieved simply by checking to see if the assignment is already `delivered`:
def perform(id, create: false)
# ...
if assignment && !assignment.delivered
assignment.update!(delivered: true)
UserMailer.send_triage(assignment: assignment).deliver_later
end
end
This will work 99% of the time. This "naive idempotency" - update a database column when the operation is done, and check before starting the work to make sure the state of the column is not yet changed - is Good Enough for a lot of jobs and work. You could think of this idempotency pattern as "Check State, Change State": check to see if the state is already what we wanted it to be, and if not, do the work to change it.
However, it's not completely robust. What happens if this job is enqueued twice *and* two threads start processing this job at the exact same time? Then, you've got a race condition, where both threads may think that the assignment is not delivered!
To fix this, we can introduce a row-level database lock:
def perform(id, create: false)
# ...
return unless assignment
assignment.with_lock do
return if assignment.delivered
assignment.update!(delivered: true)
UserMailer.send_triage(assignment: assignment).deliver_later
end
end
The introduction of this row-level pessimistic lock ensures that only 1 Sidekiq thread can be executing the block at one time. This more or less guarantees the idempotency of this job - any additional threads beyond the first will enter the locked block *only after the first job has completed*, which means `assignment.delivered` will be true, exiting the block.
However, there's one last bug here. How do we know the email is actually sent? Currently, we don't - it's given to another job via `deliver_later`. To be completely robust, we should only update the delivered attribute after the email is confirmed to have sent:
def perform(id, create: false)
# ...
return unless assignment
assignment.with_lock do
return if assignment.delivered
UserMailer.send_triage(assignment: assignment).deliver_now
assignment.update!(delivered: true)
end
end
You may be wondering - Nate, didn't you just change a Redis lock (job uniqueness) to a row-level SQL lock? Yes. Yes I did.
There are three reasons row-level SQL locks are superior to Redis locks:
* Locks are "first class citizens" in a SQL database. Literally the entire database is designed around locking. Locks in Redis are much more "ad-hoc" - Redis is a key/value store, it's not designed to manage locks.
* Job uniqueness (i.e. Redis locks for idempotency) adds an extra network roundtrip to *every single job enqueue* in your client, increasing latency *during the web response*.
* The big one: Job uniqueness locks live for at least as long as the job is in the queue (and, depending on your setup, while the job is executing). Row-level SQL locks only exist *during the specific portion of the job execution which requires idempotency*. This means that fewer row-level SQL locks will exist at any one time than a comparable Redis-based locking approach.
There are, however, some drawbacks, of course.
* Row-level SQL locks *hold the database connection* while waiting for the lock to unlock. This connection cannot be given to any other thread while we're waiting for it to unlock. This means that we can't really do anything inside the lock block that will take a large amount of time.
* This means we need to set a lock timeout in our database. Postgres, for example, doesn't have one by default.
In conclusion: don't just slap unique constraints on every job, use the "Check State Change State" pattern to make jobs idempotent, and consider row-level database locks before deciding you *must* do distributed locks in Redis with a job uniqueness plugin.
Until next week,
-Nate
Hey Rubyists,
Welcome to the fourth part of my Sidekiq in Practice email series. This email series is intended to be a "missing manual" for running Sidekiq in production, a guide to all the little details of running and scaling our beloved background job framework. I hope you're all finding it useful. I'm really enjoying your comments and questions as well - please do remember that you can reply directly to any of my emails to contact me.
This week's email is about SQL database connection pools.
This email is available as markdown/Github Gist here: https://gist.github.com/nateberkopec/2d1fcf77dc61e747438252e3895badf0
Links to previous emails in this series are at the end.
## DB Connection Math
Database connection pools. Everybody's favorite setting to forget about.
I had a client a while back whose entire Sidekiq installation had been brought to the ground - jobs taking 30+ seconds, throughput nearing zero - all because their database connection pool size was set to the old Rails default of 5 and their Sidekiq concurrency was the default 25.
Database connection pools are really confusing for people because you have to manage database connections at 4 different levels:
1. At the database
2. At each server/dyno
3. Per Ruby process (this is where the ActiveRecord "pool" setting takes effect)
4. Per thread.
Let me talk a bit about each level.
First, there's the database. We have just one SQL database in most setups. A database can only handle so many connections. Generally, over 500 connections, things start to slow down if the server doesn't have enough CPU resources. Heroku, for example, enforces a 500 connection limit on all of their database plans.
The reason Heroku has a connection limit is because idle database connections are not free, and they want you to use a database connection pooler at the dyno level to reduce idle connection count. I've seen benchmarks showing that a MySQL database with 1000 idle connections is 1% as fast as a database with just 1 idle connection - it's that bad! Of course, more CPU resources means more of an ability to handle lots of idle connections.
What does a connection pooler actually do? Let's go down a level to a single server to find out.
Connection poolers manage our database connections on a per-dyno basis. I'm going to use pgbouncer, the popular postgres pooler (ha!), as an example but all database engines have similar projects. Pgbouncer is a proxy for database connections. It sits between your Ruby processes and your database.
Most pgbouncer deployments, such as the Heroku pgbouncer buildpack, run an instance of pgbouncer on each of your servers. All of it's settings, therefore, are on a per-server basis (there is no "awareness" of what's going on in other servers).
Regardless of whether or not you use a connection pooler, the total connections to a database equals the number of connections per server times the number of servers. Total DB Connections = Connections per server * server count.
We also have a connection pool in each Rails process - this is the ActiveRecord connection pool. We set it in database.yml. Total DB Connections = AR Database Pool Size * Processes per server (usually set with WEB_CONCURRENCY or SIDEKIQ_COUNT) * Server count.
As an example, if you're using the default ActiveRecord pool size of 5, the default Sidekiq concurrency of 10, 5 Sidekiq processes per server, and you have 5 servers running, you'll use 125 database connections.
However, I already alluded to the idea that running a Sidekiq concurrency *higher* than the number of available database connections in the pool might be a bad idea. Why?
Threads are the things which actually need and use database connections. They drive the entire calculation. Think of threads as a single "context of execution", one little worker that executes jobs. All of our Ruby threads share the same memory, but they can execute entirely different jobs concurrently.
All of our Sidekiq threads need an ActiveRecord database connection.
If a thread can't get a free database connection within 5 seconds, you get a connection timeout error. That's bad. But even if they don't timeout, we don't want our threads to spend time *waiting* for a free database connection *at all* - that's wasted time. It may not raise an exception, but if each thread ends up waiting for a free database connection because concurrency is set to 10 but the pool size is just 5 and 5 threads are already using connections, then we're adding latency to job execution that doesn't need to be there.
In Sidekiq world, then, total DB connections = minimum(Threads That Need a Database Connection, AR Database Pool Size) * Processes per Server (WEB_CONCURRENCY or SIDEKIQ_COUNT) * Server Count.
Note the "minimum" function there. Setting your database pool to 100 and your Sidekiq concurrency to 10 won't use 100 connections, because there's only 10 threads to actually *use* the database connections you've made available.
Now that we've got the defintions and the theory down, let's answer the two important questions:
1. What do I set my ActiveRecord pool size to, given a particular Sidekiq concurrency setting?
2. When do I need to use a connection pooler (such as pgbouncer), and what settings should I use?
**Most of the time, your ActiveRecord pool size should be exactly the same as your Sidekiq concurrency setting**. Setting the pool size to a number *smaller* than Sidekiq's concurrency means some Sidekiq threads will become blocked waiting for a free database connection. Setting the pool size to a number *higher* than Sidekiq's concurrency has no effect.
A great way to do this is to use a single environment variable to set thread counts across your entire codebase. Luckily, this is supported and encouraged by Sidekiq.
In database.yaml:
```
production:
url: <%= ENV["DATABASE_URL"] %>
pool: <%= ENV["DB_POOL"] || ENV['RAILS_MAX_THREADS'] || 5 %>
```
In your Procfile or whatever you use to start Sidekiq:
```
$ RAILS_MAX_THREADS=10 bundle exec sidekiq
```
The magic of RAILS_MAX_THREADS is that Sidekiq will use it to configure it's own concurrency if you haven't specified that anywhere else (like in sidekiq.yml) (https://github.com/mperham/sidekiq/blob/85a1be368486e22e17ee8a30bce8b4a8f7b9dca2/test/test_cli.rb#L36). So, we can use it to set our database pool size and Sidekiq concurrency at the same time!
Second, **you need to use a connection pooler if you have a large number of idle threads**. This does often happen with Sidekiq, as Sidekiq load can be very "bursty" as big batches of jobs get enqueued. The rest of the time, those idle database connections just add load to your database.
Pgbouncer has a lot of config settings (https://pgbouncer.github.io/config.html), but the main one is the `default_pool_size`. That's the number of *outgoing* connections pgbouncer will make to the database. Thus, the idea is to set this default pool size to some number *less than* your current Sidekiq concurrency times the number of sidekiq processes per server.
For example, if we have a Sidekiq concurrency of 10 and have 4 Sidekiq processes on our server, we might set the pgbouncer pool size to something like 20 (half of 10 times 4).
What happens, though, when *all 40* of those threads needs to talk to the database at the same time?
What connection poolers CAN'T do is reduce database load from *active* connections that need *do work*. If, like in the previous example, you have more Sidekiq threads that want to talk to the database than you have connections in the pgbouncer pool, those Sidekiq threads will have to wait until a connection becomes available.
Don't bother with pgbouncer or other connection pools until you absolutely are required to by your database provider, or you exceed 500 total connections. It's generally not worth the hassle and, frequently, as applications scale, they have lots of *active* database connections and not lots of *idle* ones, and so a connection pooler doesn't solve their problems.
If you're struggling with connection limits, particuarly on Heroku Postgres, consider Amazon RDS, which allows higher connection limits in the thousands on their higher-end database plans.
One other approach might actually be to reduce Sidekiq's concurrency setting.
I delved into Sidekiq's concurrency setting very deeply in a recent email newsletter (here: https://gist.github.com/nateberkopec/b0a10f2f5659b76c6e52a129f03fb3b2). Summary: The best Sidekiq thread count setting for you depends on the percentage of I/O done in a job. Sometimes, reducing Sidekiq concurrency would be better than increasing it in order to reduce database connection load, because each additional Sidekiq thread has less additional benefit than the one that came before it, but it still uses 1 more db connection. Marginal costs are linear, but marginal benefits are reducing. For example, 10 might be an appropriate compromise for someone trying to save on DB connections, even if a Sidekiq process at concurrency 20 could process 25% more jobs than one at concurrency 10.
Hello Rubyists!
Welcome to the fifth part of my Sidekiq in Practice email series. This email series is intended to be a "missing manual" for running Sidekiq in production, a guide to all the little details of running and scaling our beloved background job framework. I hope you're all finding it useful. I'm really enjoying your comments and questions as well - please do remember that you can reply directly to any of my emails to contact me.
This week's email is about why many apps experience high memory usage, bloat and leaks with Sidekiq.
This email is available as markdown/Github Gist here: https://gist.github.com/nateberkopec/56936904705da5a1fa8e6f74cb08c012
Links to previous emails in this series are at the end.
## Memory Bloat and "Leaks"
I get this one a lot. "My Sidekiq instance uses 14GB of memory!" "My Sidekiq instance has a memory leak, it's growing out of control!" I'm not sure why, but everyone seems to blame Sidekiq first.
I guess it could never be the perfect, 100% bug-free could we're all writing, right?
Well, in reality, most Sidekiq memory issues are caused by **your own code loading too many objects into memory at once** or **caused by the system memory allocator**.
Loading lots of objects at once tends to result in memory "cliffs" - memory usage that was 1GB one minute and 2GB the next, but the memory doesn't go back down. I call this "memory bloat". Allocators can cause long, slow growth over time that many assume to be a memory leak (it isn't). This is memory fragmentation.
Let's talk through each problem in turn, but first: the root cause of both of these issues.
### A Root Issue: Ruby's Memory is Disorganized, and Immovable
A question I get a lot is "why isn't my memory usage going *down*?" You allocate a million objects in a controller action - well, ok, fine, but shouldn't the garbage collector get rid of all of those objects and my memory usage will go back down to what it was before? Most are surprised when memory usage *doesn't* go back down after Jane in Accounting runs a CSV export that loads 10,000 Users into memory at once. This is what we have GC for!
Remember that the Ruby runtime (that most of use) is just a C program. We say we're running Ruby in production, but we're really also running C in production. Many (most) C programs suffer from a critical memory issue: they can't move things around very easily in memory. This is because of a C language feature called pointers. Pointers are just raw virtual memory addresses. Often, C programs pass around these pointers and *expect some data to be there, no matter what*. This is how Ruby's memory management works internally.
C extensions in particular often hold pointers to data in the Ruby heap. If Ruby moved that data somewhere else, you would probably crash Ruby as the data in that memory address wouldn't be what the C extension expected.
So, in general with Ruby, we can't move objects around once they're created. Memory is more or less immovable.
Now, the second characteristic that contributes to these issues is that memory is frequently much more disorganized that we think. For example, consider that CSV action that allocates 10,000 user objects. You may think that's not a problem - even though they're immovable, your virtual memory will just look like 10,000 user objects packed together like sardines, all neat in a row.
Unfortunately, that's not how it works in practice. Creating a single ActiveRecord object is an extremely complex process that involves object allocations on several different layers of abstraction: in the database driver, in Arel, in ActiveRecord itself. There are several *caches* as well in ActiveRecord that are being filled with entires each time your create a new ActiveRecord object. So, far from 10,000 user objects with nothing in between, the total amount of objects created during your 10,000 user export is probably only 5-10% User objects, and there's a bunch of objects in between them. Some of those objects may be active and long-living, such as cache entries.
When it comes to freeing memory, operating systems really need chunks of 4kb or so completely free in order to reclaim that memory from a program. Even one little byte of active memory in a 4kb memory "page" means the program must hold onto that memory.
So, now you're seeing the problem:
1. Ruby's heap is disorganized, and long-lived objects often are allocated right next to ones that won't live through the next GC.
2. Ruby cannot move objects around in the heap.
3. The operating system needs a contiguous 4kb chunk of free memory to reclaim, otherwise the program keeps the memory.
That's a recipe for slow, steady memory growth over time.
Aaron Patterson (GitHub) has been investigating approaches to fixing #1 and #2 for several years now, but it's hard work and is still ongoing.
This week, a blog post from Hongli Lai has added some further understanding to this issue. That post is here: https://www.joyfulbikeshedding.com/blog/2019-03-14-what-causes-ruby-memory-bloat.html
Hongli identified another cause of memory bloat: `malloc` doesn't release space back to the operating system if that free space is in the middle of the heap. `malloc` only releases space back to the kernel if that free space is at the *end* of the heap.
Hongli's patch is interesting, but there's a lot of study yet to be done on that, so I won't get too much further into it here.
This means that the position of the last, live, un-freed memory in your heap is *extremely important*. It also greatly explains why *Ruby's long-term memory usage tends to equal it's maximum instantanteous memory pressure*. That is, in the long run, your Ruby process will use as much memory as it needs to at any possible moment, and it won't tend to use much *less* than that.
Our updated understanding of the problem now looks like this:
1. Ruby's heap is disorganized, and long-lived objects often are allocated right next to ones that won't live through the next GC.
2. Ruby cannot move objects around in the heap.
3. The operating system needs a contiguous 4kb chunk of free memory to reclaim, otherwise the program keeps the memory.
4. The default Linux `malloc` will not release memory back to the OS unless it is at the *end* of the Ruby heap.
Next week, I'll talk about what this understanding of the problem means for our possible solutions.
## Questions/Comments from Last Week (Database Connections)
Remember, you can always reply to this email to send me your own questions and comments.
From Andrew Babichev:
> Some libraries, e.g. globalize, eagerly grabs AR connection in the main (master) thread before Sidekiq spawns worker threads. Particularly, globalize gathers info from translation table during model class load on translates macro/directive/declaration/class method call. Hence it's essential to have +1extra connection in the AR pool for master thread. Of course the is a specific AR extension gem subject (and it's laziness considerations), however people pretty often forget about this kind of problem keeping Sidekiq concurrency exactly equal to AR pool size.
Great point Andrew. If you're getting Connection Timeout errors when Sidekiq concurrency equals AR pool size exactly, this may be happening. Add extra connections to the DB pool until the errors stop - not a big issue.
## Plain text?
I'm wondering if I should switch back *off* of plain-text email to HTML. I switched to plain text originally because I enjoy the privacy it afford you - it's basically impossible for me to track you at all, and if I try to track your link clicks it's very obvious. However, in long-form emails like this the format becomes basically unusable, since email client formatting is pretty bad across the board for readability.
What do you think? HTML email of course doesn't *require* me to track you in any way, but it makes it a little easier. I sort of like the purity of the plaintext format, but if it results in unusable content, then there's no point?
Please reply to this email with your thoughts.
Hello Rubyists!
Welcome to the second part of my many-part series on Sidekiq in practice, based on the many years of experience I've had working with Sidekiq in production on client apps. The first email was very warmly recieved - thank you for all of your comments!
Eventually, I'll compile all of these emails into a short book that I'll be selling. So, slightly nicer formatting, maybe some other additional content and goodies will be included too.
This week's email is about idempotency.
You may view the previous email in this series, on Sidekiq's concurrency setting, here: http://eepurl.com/ghmV81
I have two quick points of clarification from that email:
Mike Perham wrote me on Twitter to say that he's skeptical of thread counts above 50, and that "people have reported instability" in MRI with thread counts that high. I was not able to find any reports online of anyone having issues with high thread counts in MRI. The only issue I could find on the Ruby core mailing list related to this was someone reporting a segfault with a high thread count, but it turned out that the error was actually that they were running out of memory.
I still think very high (50-128) concurrency settings may be appropriate for some users with very high I/O workloads. That said, most applications fall in the 50-75%-of-a-jobs-time-is-in-IO range, so the default setting of 10 is great for most (as I mentioned in the email). I do not have any evidence that high thread counts in MRI are unstable, but, as I mentioned (and as others have found), they will use quite a lot of memory. More on memory reduction in another email.
Second, I got an email from Benoit Daloze of the TruffleRuby project kindly reminding me that Ruby != MRI/CRuby, and that the Global VM Lock is a feature of the C Ruby runtime, *not* of the language itself, so alternative implementations like JRuby and TruffleRuby have no limit on the number of threads which can execute Ruby code in parallel. By the way, did you know TruffleRuby runs Sidekiq now? (https://twitter.com/nateberkopec/status/1096878033762365442)
Right, onto our main topic!
## Idempotency ##
It's a big computer-science word, that. Idempotency. It scares people into thinking it's more complicated than it really is. But, it can *really* save you a lot of headache and trouble if you understand it and can implement in all of your Sidekiq workers.
Simply put, an *idempotent operation* is one that, if executed twice (or any number of times), the *result* or *end state* remains the same.
Multiplying a number by 1 is idempotent. No matter how many times you multiply any number by 1, you still have the same number. Imagine a Sidekiq job that took a row in the database, multiplied a number in the row by 1, and then saved the row. You could perform this job an infinite number of times and the result would be the same.
Multiplying by 2 (x * 2) is *not* an idempotent operation. Start with x = 1. Multiply by 2. Now you have 2. Perform the operation again - now you've got 4. And so on.
Mathematically, we could express an idempotent operation as f(f(x)) = f(x).
Imagine standing at a crosswalk. You push the button for the walk signal. This is an idempotent operation. You can push that button as many times as you want, but the operation is done, and you will not encourage the walk signal to appear any sooner. You could imagine that the pseudocode would be something like:
def push_walk_button
@walk_requested = true
end
No matter how many times you run that method, the end result won't be any different. **The state remains the same**.
Why is this important for Sidekiq?
In practice, "exactly-once" delivery of data in distributed systems is extremely difficult, some may even say impossible. This means that messages in the system (in Sidekiq's case, job tuples of [SomeJobClass, *args]) may be delivered to the end consumer 0, 1, or many times. Sidekiq is generally designed to be an "at least once" system. If you enqueue a Sidekiq job, it will be executed at least once, but it may be executed multiple times.
This is where people get into trouble: how do you ensure that something happens *exactly once* in Sidekiq? For example - sending an email to a customer, or importing some users from an external service, or charging a user's recurring subscription?
While I just talked about how Sidekiq is generally an "at least once" system, I was talking mostly about "reliability" guarantees against things like people unplugging machines from walls and SSDs going up in puffs of smoke. In practice, the problem most people have with "at least once" delivery is *their own code* enqueueing the same job twice. For example, consider the following controller:
def create
@user = create_user!(params[:user])
SendWelcomeEmailJob.perform_async(@user)
end
What might happen if someone hits the form submission button twice in a row (and create_user! succeeds thanks to some poor data validation)? You guessed it, two welcome emails!
The first tool I see most people reach for to solve this problem is job uniqueness plugins. Essentially, this is attempting to outsource idempotency to someone else. The most popular plugin is sidekiq-unique-jobs: https://github.com/mhenrixon/sidekiq-unique-jobs
Job uniqueness plugins generally work by creating a global lock around the tuple of [WorkerClassName, arguments] while the job is in the queue. So, the same job can't be enqueued with the same arguments.
In theory, a job uniqueness plugin should eliminate the need for the programmer to think about idempotency. Just mark all your jobs as `unique: true` and you're done, right? No welcome email will ever be sent twice! Voila!
There's two major problems with this approach: uniqueness can only be a best effort, and not guaranteed, and all of these unique job plugins greatly increase Redis load and enqueueing latency. I'm not going talk too much about the former - just know that if you're running a bank on Sidekiq (!!!) you can't slap job uniqueness on your workers and be sure that everyone's deposits and withdrawals will only enqueue once.
The second part is more insidious. Job uniqueness *really* increases load on Sidekiq. Essentially each time you enqueue a job, you're creating a new lock. This can really get out of hand if you've got deep queues, too, since these locks live for as long as the job is in the queue. I'm going to talk more about locks in a later email, but they're relatively expensive on Redis. I've seen job uniqueness 10x a client's Redis load. And, as you may already know, Redis is designed to scale vertically (bigger and bigger single machines), so this is an expensive problem to create for yourself!
Any job uniqueness plugin will also increase the time it takes to enqueue a job. At least once additional network round-trip is required ("is this job already enqueued?"), doubling Redis load on enqueuing and doubling enqueuing latency.
What I would have you do instead?
Next week, we'll take a look at concrete examples of idempotent and non-idempotent workers, and how to make almost any worker idempotent.
Until then,
-Nate
Hey-o Rubyists!
This is part 6 of my Sidekiq in Practice email series. It's all about running our
beloved background job framework in production; a series of tips, tricks and concepts
that will help you to scale your application on Sidekiq.
This is the follow-up to the previous email in the series about memory. In that email,
we talked about some of the causes of excessive memory usage. In this email, we'll
be talking about the *solutions* to those problems.
This email is available as markdown/Github Gist here: https://gist.github.com/nateberkopec/62e318fdf0a48ed6880fd861b3def55b
Links to previous emails in this series are at the end.
### Memory Bloat: Allocate Less at Once
In the last email, I said:
> 1. Ruby's heap is disorganized, and long-lived objects often are allocated right next to ones that won't live through the next GC.
> 2. Ruby cannot move objects around in the heap.
> 3. The operating system needs a contiguous 4kb chunk of free memory to reclaim, otherwise the program keeps the memory.
> 4. As recently discovered by Hongli Lai, `malloc` only releases space back to the kernel if that free space is at the *end* of the heap.
These four factors lead to Ruby's characteristic memory behavior - long, slow logarithmic growth
over time that *approaches* an asymptotic limit but never quite reaches it.
The four factors of Ruby memory growth are *greatly* aggravated by one common allocation pattern.
This pattern is extremely common in background jobs as well. I think, as I describe it,
you'll immediately think of at least one background job that does this in your application.
Everyone's got one!
The pattern is allocating a massive, massive collection.
You know, that one job that loads all the users from the database before it sends them all an email?
That one. The job that exports a massive CSV file by loading it all into memory first. The one
that loads every company in the database before making a calculation. Literally every codebase I work
on has at least one of these.
They're pretty easy to spot in your production metrics, too. They leave behind a massive "cliff" in memory usage. Before these jobs get run, you're using 256MB of memory. Afterward, 512MB, 1GB, sometimes even worse.
If you're using Scout (https://scoutapp.com/), you can sort your background jobs by "maximum number of allocated objects". **Background jobs that create 1 million or more objects are bad** - you'll need to figure out ways of holding less objects in memory at once to reduce the memory impact of these sorts of jobs. More on the solutions later.
In New Relic, the way I identify these jobs is to first navigate to the memory usage of 1 of my worker instances. Once I've got that graph, I zoom in on one of these memory "cliffs" by clicking and dragging on the graph. This will narrow the time period that New Relic is looking at. I zoom in until I'm looking at only about 10 minutes worth of data. Then, I navigate to the transactions tab and see what jobs ran during this 10 minute timeslice. Usually, theres 1 or 2 which have an extremely long execution time (10 seconds or more). 99% of the time, that's the culprit.
And thanks to the 4 factors I listed above, memory won't be returned to the operating system and memory usage will remain constant after this massive growth, even though all of the objects created have been freed and there's no memory leak.
So, now you know which job is the one that's causing the issue, but how do you know what lines of code are the actual culprits? Well, I go through that process in The Complete Guide to Rails Performance (www.railsspeed.com) but truthfully, most of the time you don't need a full forensic accounting of every allocation. The problem is almost always 1 massive collection. Usually, it's pretty obvious where that is in a job.
90% of the time it's an each loop or an Enumerable call (like map or reduce). It looks like this:
```
User.some_scope.each do
```
If it's an ActiveRecord collection, one fix is pretty simple - use find_each instead (https://api.rubyonrails.org/classes/ActiveRecord/Batches.html):
```
User.some_scope.find_each do |user|
user.do_awesome_stuff
end
```
Rather than loading *every* user that matches User.some_scope into memory at once, find_each loads
them in batches of 1000. This drastically cuts down on the maximum memory pressure of these jobs.
The second reason I've seen for high-memory-bloat jobs is accidentally copying a collection.
There are many methods which copy the entire collection before performing an operation on them. They do this because these methods are intended to return a new object rather than modify the existing object. Here are some examples:
```
["a","b","c"].map # copies the array and everything in it
["a","b","c"].map! # modifies the array IN PLACE
```
Effectively, the "non-bang" versions of methods have *twice* the memory requirement of a "bang!"
version. This normally isn't a big deal. However, if the collection you're calling it on
is very large, then we could be talking about the difference between 1 million and 2 million
objects!
OK, so the most common ways to fix this are:
1. Reduce the amount of objects you add to a collection at once, usually with something like `find_each`
2. Use `!` versions of methods on large (100k+ object) collections to modify them in-place
And finally, the third thing I'll do is to "fan-out" wherever possible. Fanouts use 2 job classes: one job class is the "organizer", which looks up all the records or items in a collection you need to do work on, and the 2nd job class is the "do-er", which performs the operation for one of those records or collection items.
I love fanouts for a lot of different reasons.
First, they clean up the code a bit. One job has the responsibility to gather up the data. The second job works on just one datum from that dataset. It's a nice division of work. SOLID and all that.
Second, fanouts *embrace the distributed nature of Sidekiq*. Calling `.each` on a 1 million element collection and having your job take 5 minutes to complete is *not distributed computing*. It's also *way way faster*, since the actual work itself is being done in parallel rather than serially.
Third, fanout jobs usually use much less memory than a single-job approach, which is the point of this email :)
## Two Quick Stories
Okay, to close this out, I've got two quick stories about memory bloat that I've gathered along the way in my client work.
One client had a background process that would run for 10+ minutes and grow in memory usage over the course of those 10 minutes. Usually with memory-bloated background jobs, you see a big spike in memory usage as the collection is created, and then memory usage stays constant. Not so here - this job would increase Sidekiq's memory usage by 10 MB a minute!
It turned out that it was iterating over a collection and doing something like this:
```
user = User.first
imported_data = a_big_csv_of_data
imported_data.each do |user|
user.posts.find_or_create_by(imported_data)
end
```
What was happening was that internally, `user.posts` was getting a large Post object added to it during every iteration of the loop. Since `user` and `user.posts` was always in-scope, that huge `user.posts` array couldn't be garbage collected!
Second story. A client had a Sidekiq job that would use massive amounts of memory. We sat down and worked through it using the memory profiling skills I discussed in the CGRP, and discovered something like this:
```
massive_collection.map { ... }
```
There wasn't really anything we could do about `massive_collection` easily - it would have required a pretty big rewrite. But one thing we could do easily was modify the collection in-place with `map!` rather than `map` - problem solved.
## Previous Emails in the Sidekiq-In-Practice Series:
* The Concurrency Settings http://eepurl.com/ghmV81
* Idempotency: Problem http://eepurl.com/gif1KH
* Idempotency: Solution http://eepurl.com/giY8L5
* Database Connections http://eepurl.com/gjXA55
* Memory: Problem http://eepurl.com/gkTHcX
Database, Ruby, Memory - the three areas to check when speeding up Ruby apps.
A little bit ago, I blogged about the non-technical reasons why Rails apps are slow (https://www.speedshop.co/2019/06/17/what-i-learned-teaching-rails-performance.html). Those are the primary reasons - the main ones. However, they're not that helpful when you're sitting down and trying to make the Rails app in front of you any faster. "Alright, Nate, I work in a messed up software organization, but what do I do about this 500 millisecond controller action right in front of me?"
Well, that's an entirely differently checklist altogether! But it's a bit simpler and easier to run through.
When trying to make any Ruby code faster, I investigate these three areas, in order:
Database (not just time spent waiting on the database, but also time in ActiveRecord)
Ruby (time spent in Ruby)
Allocation (memory) (creating objects)
So, let's talk about each area a little bit.
I always start with the database. It's the primary reason Ruby applications get slow. Cavalier usage of ActiveRecord leads to unnecessary queries, and even queries which "hit the cache" are far slower than their properly eager-loaded equivalents.
One example of poor database usage is covered in this article I wrote about count, exists and present?
count and exists? always execute a query. They're a common cause of unnecessary round-trips to the database or N+1s. That article goes pretty deeply into their correct usage.
In order to profile database usage, my weapon of choice is Rack-Mini-Profiler. RMP profiles every SQL query that hit the database and shows you how long it took *and* where it came from - what line numbers and what parents called that line. It's invaluable for figuring out where to eager load properly.
Next, I take a look at the time spent in Ruby itself. This is a bit more complex and involved, and it's often *not* the biggest area for improvement, which is why I start in the database layer (we're increasing cost and reducing benefit as we get down this list).
To investigate time spent in Ruby, we use Ruby profilers, such as stackprof, Ruby-Prof, rbspy and more. I prefer stackprof and sometimes ruby-prof. stackprof is extremely easy to use if you're using rack-mini-profiler, since it's built-in.
I want to know what my Ruby program or controller action or background job or whatever is actually doing. Sometimes these profilers make things incredibly obvious - oh, you're spending 20% of your time reading a YAML file on every request, for example. Sometimes it's less obvious, especially if you're suffering from a only-slightly-slow function that's called 500+ times during the profile. But figuring out what the Ruby code is actually doing is the second most important thing we can learn about "why is this Ruby slow"?
Third is allocation. I think there's often an inordinate focus on garbage *collection*, but what more people should be focusing on is garbage *creation*. Creating objects is far from free - in fact, it's pretty expensive.
The Ruby core team uses a particular benchmark to measure their progress on Ruby 3x3 - the goal that Ruby 3 will be 3x as fast as Ruby 2. This benchmark essentially emulates an old NES console. It looks like we'll reach that goal - the benchmark currently runs about 2.5-2.8x faster than Ruby 2. However, this benchmark doesn't really create very many objects. It just generates a lot of CPU instructions and keeps memory access and allocation pretty low. This, in my opinion, is one of the primary reasons why that benchmark performs so much better than Rails benchmarks. Rails applications are only running about 70% faster on the latest version of Ruby when compared to Ruby 2, according to Noah Gibbs, maintainer of Rails Ruby Bench. So where'd the other 75-90% improvement go? Object allocation!
Richard Schneeman has doing an excellent conference talk this year on reducing object allocations to improve performance. You can see the RailsConf version of that talk here.
So, that's my usual process for making Ruby faster. Database, time in Ruby, memory allocation. I guess you could call it the DRM method (even though I hate DRM and love Ruby perf!). It's up to you, as the application developer, to figure out what work can be done more efficiently in those areas without compromising the correctness of the program. It still has to do what it has to do!
Until next time,
-Nate
Last week was the first week of my summer workshop tour. Part of the reason I do the workshops is because performance work is often much harder in practice than it is in theory, and I like to "be there" to help attendees when they get stuck applying performance improvements to their complex, real-world Rails apps.
Most applications that even need to start thinking about performance are successful legacy applications - the apps are often 4 years old or more, the business has now achieved product market fit and is making decent money, but now the app feels slow or is costing a lot of money to deploy. The old mantra: "Make it work, then make it clean, then make it fast" - they've now reached that final step. But doing performance work on that kind of application is a hell of a lot harder than blog posts and even my course can sometimes make it look.
Fixing N+1s with includes and eager_load is the perfect example.
Last Friday was the inaugural running of my ActiveRecord performance workshop, and one of the attendees was struggling with an N+1. It was easy to identify the final callsite of the N+1 (the line of code the actually triggered the query), but it was far, far harder to figure out where to insert the includes call to fix it.
includes and eager_load are actually pretty easy to understand conceptually, and your run of the mill intro blog post will make it seem like a piece of cake.
You've got a blog app that has posts and comments. If you want to render the comments for each post, make sure you Post.includes(:comments). Bam.
However, in the real world, the distance between where the call is actually triggered and where the includes has to go is often several callstack layers apart.
In this particular example at my workshop on Friday, there were about 5 or 6 layers of indirection between where the includes call needed to go and the final callsite!
This is part of why tell people to "bring the Frankenstein" app to my workshops. If I teach you how to use includes on a simple, toy app and then you go home, try it on your Frankenstein app and get stuck, I haven't done my job.
Anyway, figuring out where includes or eager_load goes is something of an art that requires a bit of knowledge about how ActiveRecord lazily prepares queries and a mental model of what's happening at each layer of the callstack. Working through specific examples together really seems help people to "get it".
So, if you've read a blog post (or even part of my course) and struggled to go implement it on your own app, know that you're not alone. Applying the concepts to a complex app is hard, and you'll need to do it a few times to get the hang of it. Stick with it!
-Nate
Performance work is mostly the application of the scientific method to making a software program faster or more resource-efficient.
First, we start with an observation. One possible observation might be that the program uses a lot of memory. Most Rails processes use about 512MB of resident set size, so a process using more than 1GB usually has a memory consumption issue. Or, as another example, maybe we open our New Relic dashboard and see that a particular controller action has an obvious N+1, because the User table is looked up 30 times on average during that action.
This observation about the state of the world moves us into the second stage of the scientific method - generating hypotheses. Sometimes, we can do this without the aid of any tools at all. For example, I know a lot of stuff from looking at hundreds of Rails applications over the years, and can form hypotheses about performance with very limited observations. I know, for example, that a Sidekiq process with 3GB of memory usage and a concurrency setting of 25 probably is experiencing a bad interaction with glibc malloc and should switch to an alternative memory allocator.
However, most of the time we need more observations and more data to generate better hypotheses about what is happening. This is what profiling is for - it's a bit like a microscope. It makes very minute behaviors obeservable, and helps us figure out what is going on. We can account for every millisecond or megabyte, down to the level of an individual line of code. This level of detail gives a very clear picture.
Profiling helps us to form a very detailed and precise hypothesis about what's wrong with the application. Let's say our profiler has identified that we spend 30% of our memory usage on a single line of code in the application. That's good to know. Now what?
You could just make a change that you think will fix it, and then deploy it to production. But that's skipping a step of the scientific method, which is to test our hypothesis against the real world by doing an experiment.
This is what a benchmark is. Profiling isn't the "real world", because it introduces a lot of overhead and makes our code a lot slower. Benchmarking, by comparison, has almost no overhead at all and can tell us the "truth" about our proposed change.
We craft a benchmark to test our hypothesis. In the case of a line of code allocating a lot of objects, I would probably write a custom benchmark using ObjectSpace.count_objects:
GC.disable
before = ObjectSpace.count_objects
# code to be benchmarked goes here
after = ObjectSpace.count_objects
puts(before - after)
You could run this code by using ruby on the command line ("ruby my_benchmark.rb") or rails runner ("rails runner my_benchmark.rb") if you wanted to benchmark some code from a Rails application.
Now we have a test of our change. Usually, I would encourage you to check this test in to a benchmarks folder ("/benchmarks") and post the results in the pull request that you're making to your project.
This process is good, but of course, it can't be perfect. There will always be differences between our development machines and our production environment. But generally, the effect sizes will be similar, if not exactly the same. If they're not the same, you'll learn a lot by trying to track down why!
Then, we'll analyze what effect our change has in production using the same production metrics we used to start the whole investigation.
This process is simple and timeless. It's the scientific method:
* Observation (Noticing behavior in production or development)
* Hypothesis (Profiling to figure out what's causing the behavior)
* Test (Benchmarking)
* Analyze (Observing new production behavior)
There's major downsides to doing this process out-of-order. Premature optimization is the act of skipping this process altogether and making changes to applications without regard to observation or testing. Benchmark-driven-development is the act of using microbenchmarks for things like appending to arrays or concatenating strings, skipping the observation and hypothesis steps, and then applying those changes to your application.
The scientific method isn't broken, so don't fix it - increase the rigor of your performance process and reduce wasted work and time by following its steps every time.
Talk to you next time,
Nate
Why work on performance? There are just two reasons.
It's really important to know why you're investing in performance work, because there's really only two reasons to do it.
The first to improve the customer experience - to make the site "feel" faster. The second is to decrease your operational costs.
Knowing which of these two objectives is the more important one for your app and your business is really important, because they lead to completely different priorities and objectives in your performance work.
Deciding between these two priorities isn't too hard. If you're spending too much money, you want to spend less. I think most companies should be spending an amount of money on their service hosting (your "all-in" cost, essentially your total Heroku or AWS bill) that's equal to or less than their requests-per-minute. So, if you've got 1,000 RPM, you can probably spend $1k or less on your hosting per month. If you're doing worse than that, and reducing that cost would make a difference to your company, you can work on that.
Knowing when to work on customer experience can be a little more complicated - you generally want to fix your site being too slow *before* people complain about it. But the link between website latency and revenue is fairly well documented. If it feels slow to you, it feels slow to your customers, and they like using your site less, and they'll use it less!
So, if you want to focus on reducing operational costs, what do you do?
This goes into what I was talking about in Monday's email: Little's Law and offered traffic (Erlangs). These two very simple formulas establish the relationship between latency, horizontal scale and the load your service can handle. The both use the same form: concurrent traffic equals average latency times the arrival rate of requests. For example, a service that has 100 requests per second and an average latency of 500 milliseconds will, on average, be serving 50 requests in parallel at any point in time. 50 parallel requests means you need to have the server capacity to handle that - not only in terms of EC2 instances and dynos, but also in your database plans. Reducing average latency decreases concurrent traffic and therefore reduces costs. I gave a talk at RubyKaigi this year about this very topic, and that talk has just been uploaded here.
Focusing on operational costs means focusing on reducing average latency and configuring everything in your stack for maximum throughput. Frontend is irrelevant, because you're not paying for that CPU time (phew).
If instead you choose to focus on improving the customer's experience, your performance work priorities have to change in a way you may find surprising: you have to focus on the frontend. Your customer's experience of your site is not to just read the HTML source directly once it returns successfully to their browser - there's dozens and dozens of additional resources and scripts that have to be downloaded, laid out and rendered. This is why I focus so much on frontend in the CGRP. Once people learn how to profile their frontends, their always shocked at how much of their frontend load time is *not* their backend responses, but other things like WebFonts and JavaScript.
One of the common mistakes I see here is people who want to improve customer experience focusing on backend response times. If your backend response times are extremely slow - let's say 1 second - they're still probably about 50% to 25% of the total page load or navigation experience. Profile it if you don't believe me, but scripting, layout and rendering really add up.
Know what you want to work on, and know your performance priorities.
-Nate
I've learned a lot over the last 4 years of teaching Rails performance.
I wrote a blog post about the things I've learned about Rails software shops while trying to teach them performance over the last 4 years. You can read that post here.
One question I got yesterday was "how much of what you do is Rails-specific?" The question was in regards to my course, the Complete Guide to Rails Performance.
The answer is that shockingly little of what I teach is specific to Ruby on Rails. People who have purchased and run through the CGRP already know this, but I think many who buy that course are surprised with how little of it actually talks about Rails.
Much of what I do isn't even language-specific. All of what I teach on horizontal and vertical scaling, for example, is applicable not just to any language but to any system that queues and does work. That's a big bucket that includes databases, background jobs, even the CPU itself. It turns out so many of the systems around us, both in computer programming and in the world at large can be modeled by M/M/c queues.
I do think there's a bit of prejudice to the question, sometimes, though, of my "Rails-specificity". That assumes that there's something unique that makes Rails applications slow.
Is it Ruby? I don't think so. I've blogged about that in the past.
Is it something in the framework itself? This one is more of a maybe. Most Rails applications have issues with their usage of ActiveRecord. It's possible that the ActiveRecord pattern encourages poor database usage. However, I simply haven't seen enough non-Active-Record apps to know if someone has divined the secret solution yet. I doubt that they have, but I could be convinced. In any case, there aren't any perf problems caused by ActiveRecord usage that are unfixable. These problems make up the majority of my Rails-specific content.
So much of what I teach is applicable to any Ruby application. I'd estimate 80%+ or more of my workshops and course content is not specific to Rails at all. Your profiler, for example, doesn't care what framework you're using - it's just profiling whatever runs!
If you've got a non-Rails Ruby app that you get paid to work on every day, do tell me about it. You can reply to this email and it goes right to my inbox. I hope to hear from you.
-Nate
Time Consumed - one of the first metrics I check on a new client app
At Monday's Rails performance workshop, I asked the attendees if they knew what "Time Consumed" meant. No one raised their hands, which I found shocking - I think Time Consumed is probably the most useful metric any application performance monitor can provide!
Time Consumed is a prominent feature of New Relic and Scout dashboards. Skylight has a similar metric, called Agony. It's actually a really simple metric.
For the selected time period - say, the last 24 hours - take the total number of requests the app has served and multiply it by the average response latency. That's your total "time consumed". So, if our theoretical app served 10,000 requests in 24 hours and the average time per request was 100 milliseconds, that's a total time consumption of 1000 seconds.
Then, for any individual controller action, we can do the same calculation and calculate what *percentage* of the total time consumed that action took up.
So, if our UsersController#index action was called 100 times over that same 24 hour period, and it's average latency was 300 milliseconds, then it consumed 30 seconds. New Relic and Scout express this as a percentage of the total time consumed - so 30 seconds divided by the total time consumed (1000 seconds) is 3%. New Relic and Scout would both report that controller as 3% of time consumed.
What's so powerful about that is that 99% of web applications follow a very interesting pattern: the top 10 controller actions, sorted by time consumed, usually make up about 80% of the total time consumed!
So that means that 80% of the time your application processes are busy doing work, it's just 1 of 10 controller actions. That's crazy!
This is no surprise to readers of the Complete Guide to Rails Performance, of course. Most fields of numbers follow this exact pattern. In nature, exponential distributions are actually the norm - not normal distributions. The world is not normally distributed.
So, what do we do with this information? Our APM is telling us what controllers make up 80% of our total time consumed, what does that mean?
This actually relates back to the discussion of offered and carried traffic that I've talked about on this newsletter before. Thanks to Little's Law, we can infer that if a controller action takes 50% of our time consumed, that means that 50% of the time that one of our web processes is active and doing work, it's performing that particular controller action. It's also accurate to say that 50% of our carried traffic is consumed by that controller action.
This has huge implications for performance optimization for throughput. If you're trying to scale your application to reduce request queueing times and improve your ability to handle traffic, making a controller action that takes up 50% of time consumed 2x faster means that you've *reduce the entire carried traffic of the application by 25%*.
Another way of looking at it might be that "time consumed" is "time spent waiting on your application by your customers". If a controller action takes up 50% of total time consumed, speeding it up by 2x means your customers are spending a total of 25% less time waiting on your backend in the future.
Whenever I come in to a new client application, one of the first thing I do is look at the APM dashboard of controller applications and sort by time consumed. This gives me a list of 10 controller actions where I can focus my performance work to have the biggest overall impact on customer experience and on scaling the backend.
I hope that was helpful. Talk to you again soon.
-Nate
Confusing and often maligned - what is the GVL and what does it mean for you as a Ruby developer?
One thing that I find confuses many Rubyists is the Global Virtual Machine Lock, or the GVL. It's a unique feature to CRuby, and doesn't exist in JRuby or TruffleRuby.
Why is it called the GVL?
One thing that's confusing right off the bat is the name - isn't it the GIL? Well, GIL stands for Global Interpreter Lock, and it's something that was removed from Ruby in Ruby 1.9, when Koichi Sasada introduced YARV (Yet Another Ruby VM) to Ruby. We'll talk more about that change later and why it swapped a GIL for a GVL, but for now, let's just set out that the correct terminology for over a decade now has been GVL, not GIL.
So, you may be vaguely aware that this thing called the GVL exists and that it has something to do with parallelism and threads. But what is it *actually*, and how does it affect how Ruby's threads work? Let's dive in.
The TL:DR; is that only one thread in any Ruby process can hold the global VM lock at any given time. Since a thread needs access to the Ruby Virtual Machine to actually run any Ruby code, effectively only one thread can run Ruby code at any given time. However, your programs actually do a large number of things that don't need access to the Ruby Virtual Machine. The most important is waiting on I/O, such as database and network calls. These actions are executed in C, and the GVL is explicitly released by the thread waiting on that I/O to return. When the I/O returns, the thread attempts to reacquire the GVL and continue to do whatever the program says.
Let's break down each piece of this in turn.
Processes vs Threads
What's a process, and what's a thread?
Processes are instances of a computer program. That program is actually run by one or many threads (processes have one or many threads). The process itself isn't really the thing that runs the code - it's more of a collection of shared resources that all of the threads use together.
The most important resources that a process owns are the code (which the threads will then take and execute), the memory, and the file descriptors (sockets, ports, actual files, etc.).
Threads use these shared resources to execute the code. These threads share their memory with each other, and run the code when they are scheduled to by the operating system's kernel. The Ruby runtime itself doesn't manage when threads are executed - the operating system decides that.
There is one additional shared resource in a Ruby process that is extremely important - that's the Global VM Lock. Despite the very authoritative name, there really isn't anything "global" about the GVL. Each Ruby process has it's own GVL, so it might be more accurate to say that it's a "process-wide VM lock". It's "global" in the same sense that a "global variable" is global.
Think of the GVL like the conch shell in the Lord of the Flies - if you have it, you get to speak (or execute Ruby code in this case). If the GVL is already locked by a different thread, other threads must wait for the GVL to be released before they can hold the GVL.
Let's talk a little bit about the thing we're locking - the Ruby Virtual Machine.
What are we locking? The Virtual Machine
A virtual machine is a little bit like a CPU-within-a-CPU. Virtual machines are computer programs that take usually very simple instructions, and those instructions manipulate some internal state. A Turing machine, if it was implemented in software, would be a kind of virtual machine. We call them virtual machines and not machines because they're implemented in software, rather than in hardware, like a CPU is.
So we have a Ruby Virtual Machine that takes a simple instruction set. Those instructions are generated by the Ruby code you write by the interpreter, and then the virtual machine instructions are fed into the Ruby VM by a thread.
Before Ruby 1.9, Ruby didn't really have a separate virtual machine step - it just had an interpreter. As your Ruby program ran, it actually interpreted each line of Ruby as it went. Now, we just interpret the code once, turn into a series of VM instructions, and then execute those instructions. This is much faster than interpreting Ruby.
You can see what Ruby's VM instruction sequences look like by using the --dump option on the command line.
You can execute Ruby from the command line using the -e option:
$ ruby -e "puts 1 + 1"
2
You can then dump the instructions for this simple program by calling --dump=insns:
$ ruby --dump=insns -e "puts 1 + 1"
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,10)> (catch: FALSE)
0000 putself ( 1)[Li]
0001 putobject_INT2FIX_1_
0002 putobject_INT2FIX_1_
0003 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache>
0006 opt_send_without_block <callinfo!mid:puts, argc:1, FCALL|ARGS_SIMPLE>, <callcache>
0009 leave
Ruby is a "stack-based" VM. You can see how this works by looking at the generated instructions here - we add the integer 1 to the stack two times, than call plus. When plus is called, there are two integers on the stack. Those two integers are replaced by the result, 2, which is then on the stack.
I mentioned before that your threads don't need the GVL all the time. You can see this by searching Ruby's source code - in particular, note how often the GVL is mentioned in io.c. You can also see this in gems that have C-extensions, such as pg, the Ruby postgres gem. When doing IO, the C code that's being run must explicitly lock and unlock the GVL.
So, that's how the GVL works. What does it all mean, though?
The upshot - what it all means for you
I sometimes see inaccurate descriptions of Ruby's multithreading or parallel behavior because the author is playing "fast and loose" with terminology. Given all that you now know about the GVL, I'd like to be very precise.
Performing two operations concurrently means that the start and end times of those operations overlapped at some point. For example, you and I sit down to a sign a contract. However, there is only one pen. I sign where I'm supposed to, hand the pen to you, and then you sign. Then, you hand the pen back to me and I initial a few lines. You might say that we signed the contract concurrently, but never in parallel - there was only one pen, so we couldn't sign the contract at the exact same time.
Peforming operations in parallel means that we are doing those operations *at the exact same instant*. In my contract example, it would mean that there were two pens (and probably two copies of the contract, otherwise it would get a little crowded).
In CRuby, we can execute Ruby code concurrently but not in parallel. Only one thread can hold the GVL at any time, so parallel execution of Ruby code is impossible.
However, we can still do important things in parallel, such as waiting on I/O. It would be inaccurate to say that "Ruby is not parallel", because the runtime's threads *do* run in parallel when waiting on I/O. They're not running Ruby language code, but they're Ruby threads all the same.
This also means in an important sense that Ruby is "multi-core". Whenever a thread does not need to hold the GVL, it can and probably will be scheduled onto many different cores by the kernel thread scheduler.
This also means that "how many threads does my Sidekiq or Puma process need" is a question answered by "how much time does that thread spend in non-GVL execution?" or "how much time does my program spend waiting on I/O?" Workloads ery high percentages of time spent in I/O (75%+ or more) often benefit from 16 threads or even more, but more typical workloads see benefit from just 3 to 5 threads.
Adding more threads to a Ruby process helps us to improve CPU utilization at less memory cost than an entire additional process. Adding 1 process might use 512MB of memory, but adding 1 thread will probably cause less than 64MB of additional memory usage. With 2 threads instead of 1, when the first thread releases the GVL and listens on I/O, our 2nd thread can either pick up new work to do, increasing throughput and utilization of our server.
For now, the GVL exists and is going nowhere. It's simply too complicated to remove it. Other languages have similar locks: CPython has one too. The vaunted V8 JavaScript engine essentially has a "GVL" too, except V8 can create multiple VMs (V8 calls them Isolates, but only one thread can access an Isolate at a time).
I hope this has been a complete introduction to the GVL for you - if you have any questions, you can reply directly to this email.
-Nate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment