almet/scaling.rst

## scaling.rst

      
    Raw
  

              scaling.rst
            
          
    Scaling readinglist

There is actually quite a lot to say about what we've learned in the past week or so about scaling the readinglist servers.
Let's start by the begining. When we pushed the first version of the readinglist service in production, it turned out that the performance was pretty bad. We had really slow RPS (requests per second) and the webheads weren't using all the CPU available.
We tried to fix this in every possible direction. We added a lot of metrics to know what the average response time from PostgreSQL was, and investigated the use of uWSGI at the same time.
We also added some profiling to our application, in order to understand what was making it slow.
All of this allowed to find a bunch of problems: Connection pools weren't reused (and were created again each time!), the uWSGI wasn't actually reading the right configuration file, and so on.
When trying locally, it was burning all the CPU (which is something you want in such cases, so good!)
After some more tweaking, we issued a new release and deployed it on our staging environment.
But it wasn't all. The performance gain wasn't visible at all, which was a bit frustrating.
After a day of investigations, here is a summary of the changes:

Nginx can use the uwsgi protocol to communicate with uWSGI (rather than doing the HTTP proxy). It avoids parsing requests multiple times and is a performance boost;
New-Relic helped us to see clearer what was going wrong. It was very easy to see what was our bottlenecks and fix them;
It showed us, into other things, that some of our SQL queries were really slow; After checking on the database instances, it turns out that it was burning all of its CPU. We started a new instance with more CPU (and multiple ones) and it's now using only 10% of it.
I tried to tweak the number of workers on stage, and it turns out that for one CPU (which is what we currently have) 3 is the right choice. Now the bottleneck is our webhead (we're using 100% cpu there) and I beleive we would benefit a lot of multiple cores / CPUs (in which case having more workers would be needed as well).

The vast majority of the changes were configuration changes in our stack.
I also think we can tweak our SQL queries to be faster, but I'm no SQL expert so I'll let this part to people who know SQL better than I.
— Alex