schneems/gist:8849116

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    I said

I wanted to talk about this so bad on the show, but it wasn't released yet. Yesterday we launched performance dynos https://blog.heroku.com/archives/2014/2/3/heroku-xl it basically lets you have a dedicated 6gb of RAM and 8 cores per each dyno. The idea is that if you really want to drop your tail latencies there's no getting around the need for high concurrency. By running on a dyno like this you could easily run 12x the number of Unicorn or Puma workers or more if you're using a Ruby that is copy on write friendly like 2.1.0. You can still scale out horizontally with more "performance" dynos but this is one way you can also scale vertically. Ask me your performance dyno related questions, and I'll do my best to answer them here!
He Said

Could you explain the "tail latencies" bit? I saw a bunch of people post stats from their dashboards showing the switch to PX dynos and the difference it made was really impressive, but I guess I don't understand why it made a difference. You mention you could run more unicorn/puma workers in a dyno, because it has so much more RAM, but isn't that the same as just running more dynos with less workers? Is it just that the PX dyno is dedicated, and regular dynos are not? Thanks!
I said

Great question! When most people think of the time it takes to render a web page, they think of it the way they experience it in development. Your webserver is ready to go, as soon as you click a button or type in a link then the server works to process the request as fast as possible. In this scenario it doesn't matter if one page takes 200ms to load and one takes 1000ms. Your server is working as fast as it possibly can. What happens in production under load is quite different.
By tail latency, I am referring to the perc95 or perc99 times on your site. Imagine instead of a flat performance your website performs on a bell curve, at the bottom (perc 1) is the fastest responses similar to what you would see in development. In the middle at the top of the curve is perc 50, or what most people see when they visit your site. At the far right hand side is the long tail, the really slow responses. These don't happen all the time, but when they do...they are really slow. It's possible to have a website that has a perc50 of 400ms and a perc 95 of 2000ms or more. This is the tail latency. Even if you think "well only a fraction of my users see this slow behavior" as you grow this fraction will grow higher, and users will perhaps think of your site as "slow" even if most web pages load quickly.
That's the what, let's look at the why. Remember when i said our hypothetical app had some pages that load quickly (200ms) and some slowly (1000ms). Well when your site begins to get more traffic than your servers can handle at a time instead of returning errors, those responses get queued. This is a good thing otherwise you would need to maintain an excess capacity just for spikes and surges. Previously it didn't matter that one page was slow and one was fast, but now what if a user requests a slow page, and another user right behind them requests a fast page. Even though the "fast" page renders in 200ms, it has to wait for the "slow" page to finish first, so the second request takes 1200ms (yikes!). The most brute force way to get out out of this scenario is by having enough capacity (concurrency) to handle all requests as they arrive. This is one area where PX dynos can help. Since there are extra resources you design your app to be more concurrent. We recommend using Puma or Unicorn web servers as they can handle multiple requests at a time. The more resources you have, the more concurrency each web server can handle. The neat thing about Ruby 2.0+ is that it is Copy on Write friendly, this is a fancy way of saying that if you copy a Ruby process it takes up less memory than the process it was copied from. This means it is more efficient to run a bunch of unicorn or puma workers on one box, than running many workers on many boxes. With a PX dyno you have more resources, and with the right Ruby you actually consume those resources in a more efficient manner, this gives you more concurrency and helps the tail latency problem.
The second thing that helps by having more workers per dyno is the infamous routing problem. Our router in the Bamboo stack did something neat, it knew that you were using the Thin web server which could only take one request at a time and it knew if your dyno had returned a response. Based on these two pieces of information we knew if your dyno was "busy" and we could route to another dyno. This wasn't free, it requires maintaing distributed state across a distributed routing system, as the system grew it grew less efficient. When we moved to Cedar, we took away the server restriction so now you could run Unicorn and take 5 requests at a time per dyno and we no longer "knew" if a dyno was busy. So instead of limiting each dyno to only one request at a time (slow) the router sends the request to one of your dynos and doesn't try to manage distributed state (which i said before actually slows things down a bit). This is similar to how ELB or NGINX routes requests. This is also where we started to see the problem above 200 vs 1000ms latencies. Another way to describe an app like that is it has a high "request variance" or each response time can be dramatically different. What if we knew exactly how much capacity your server had and never queued up a fast request behind a slow one? Well it turns out routers are a lousy place for this logic, and it's actually built in by default into web servers.
If you are running Unicorn with 5 workers, and 4 of them are busy, instead of waiting for a busy worker to become free it sends it to the free worker. It makes sense. Unicorn can do this because all of the workers are in memory and it can communicate with them with very little overhead (not the slow down associated with a global distributed lock). But we still have the problem if 5 workers are working on slow requests, the fast request still has to be queued. Queuing is a good thing after-all as it allows for surge capacity. Ideally we would add a 6th worker, or a 7th and so on until we have enough capacity so that if only 10% of our requests are slow, we have enough unicorn workers to process all of those 10% slow requests at a time while leaving some free workers open to work on the "fast" requests. As i mentioned before a PX dyno allows you to have a huge amount of unicorn/puma workers (not to be confused with Heroku's "Worker" dyno type). Since a dyno is 512mb of RAM and a PX dyno is 12x times the size of a regular dyno, it would stand to reason you could get 12x (or more) workers on a PX dyno. This gives us the extra concurrency that we want and also gives each web server more capacity to better handle high request variance. The result is that your slow requests are still slow, but not EXTRA slow, and the fast requests are now fast again. Your site behaves now more like it does in development.
In short tail latencies comes from a high request variance and a high volume of traffic. Increasing your concurrency and giving web servers more capacity to deal with long running requests is a deadly combination. If your site doesn't get much traffic, you don't need all this and a PX dyno won't actually help get more speed (only more throughput). We linked to some tools (librato, new relic, etc.) use them to see if you're seeing a high perc 95 time and it's likely that you could benefit by using a 2x dyno or a PX dyno and more capacity. Then use log-runtime-metrics to tune the number of unicorn/puma workers on each dyno and you're good to go.
This was a bit long winded, so please follow up if I was unclear on some points or you want me to expand on others. There's literally a field dedicated to queuing theory so I won't claim to know everything, but I've helped optimize real world sites running on Heroku seeing real "internet" load.