Dial back instance counts
Discover true steady-state memory usage per instance. Aim for 300MB per.
Avoid major allocations
A good goal would be 300k allocations maximum per controller action. Average allocations per action should be less than 20k. Minimize the number and size of allocations.
- Use an APM. Skylight and Scout are strongest in tracking memory allocation.
- If you don't want to pay for an APM, look at
memory_profiler
,oink
. - Build your own profiling with
ObjectSpace
andGC.stat
from stdlib. - If all else fails, move heavy allocations to Rake tasks.
deraied
Audit Gemfile with bundle exec derailed bundle:mem
and you're done! Remember that every gem in your Gemfile is immediately required upon startup, so look for opportunities to "require: false". Sprockets will require most asset gems if it needs them, you don't have to require them. This will reduce production memory usage because asset gems aren't needed (since you precompiled your assets).
gem 'sass', require: false
Use jemalloc
It's just a better malloc with better fragmentation avoidance. I prefer to compile
Ruby with jemalloc, but you can also use the LD_PRELOAD
environment variable.
brew install jemalloc
LD_PRELOAD=/usr/local/Cellar/jemalloc/4.2.0/lib/libjemalloc.dylib ruby myscript.rb
or
./configure --with-jemalloc
make
make install
Use a forking webserver
Puma, Unicorn, Passenger all work. Be sure to use whatever "preload" options are available. Copy-on-write increases shared memory usage, which decreases overall memory usage. Remember that you may not see any improvement in RSS, because shared memory is sometimes included in how tools display the resident set.
Use a threaded webserver
Threads are lighter memory-wise than processes. Many webapps can benefit from threads, especially those that have lots of database I/O or interact with external webservices. Puma and Passenger Enterprise are threaded webservers.
Keep Ruby and Rails up-to-date
Ruby 2.2 and Rails 4.2 include very important performance improvements. Watch out for Ruby 2.4, which looks like it will include a faster Hash, faster Regex and better control over free slots.
Tune malloc
When using a threaded webserver, you may experience a major growth in memory usage.
This is due to malloc's arena implementation, which can get pretty greedy in an
effort to reduce thread contention for memory. If you see huge memory bloat with
threads, try setting the MALLOC_ARENA_MAX
environment variable to a number like 2 or 3.
This will slow down your program slightly, be sure to benchmark.
For more environment variables to tune malloc behavior, see mallopt
.
mallopt() option Env var Default value Notes
M_TRIM_THRESHOLD MALLOC_TRIM_THRESHOLD_ 128KB
M_TOP_PAD MALLOC_TOP_PAD_ 0
M_MMAP_THRESHOLD MALLOC_MMAP_THRESHOLD_ 128KB 0 disables
M_MMAP_MAX MALLOC_MMAP_MAX_ 64 0 disables
Tune GC
If you can't read gc.c and understand these variables yourself, don't touch them (yet)
GC Tuning can fix:
- Too many free slots
- Slow startup
- Too many or too few GCs
Be careful. Fix one problem and you may make another worse.