kpdecker/nodebf.md

## nodebf.md

      
    Raw
  

              nodebf.md
            
          
    Mobile Server Side Rendering

This year marks the first year that we are doing full scale rendering of our SPA application on our mobile.walmart.com Node.js tier, which has provided a number of challenges that are very different from the mostly IO-bound load of our prior #nodebf.
The infrastructure outlined for last year is the same but our Home, Item and a few other pages are prerendered on the server using fruit-loops and hula-hoop to execute an optimized version of our client-side JavaScript and provide a SEO and first-load friendly version of the site.
To support the additional CPU load concerns as peak, which we hope will be unfounded or mitigated by our work, we have also taken a variety of steps to increase cache lifetimes of the pages that are being served in this manner. In order of their impact:
Event Loop Management

The absolute largest issue that we found was that under load we would see convoy effects on the event loop as some pages would take a seemingly benign 10-80ms to render. Under normal traffic patterns this isn't likely to be an issue but under server melting peak traffic, this became a massive problem. At a certain level of concurrency, the backlog of these rendering processes would cause event loop durations measured in seconds as both the server-side JavaScript (SSJS) and all other requests were competing for time on the loop.
To fix this we split our view rendering pipeline into numerous async operations, all taking less than 1ms each (we didn't break out process.hrtime to get a better measurement). This did have the effect of increasing the time taken to render complex views as the pipeline now had the slight overhead of having 50+ additional event loop executions to complete but in practice the cost was very minimal and we were able to successfully handle other operations in parallel with the more expensive rendering operations without seeing significant slow downs.
Automatic Failover

Since the site is built on a client-side JavaScript (CJS) foundation that was utilized last year, we have the option of disabling the the SSJS pipeline under load. This provides us with fault tolerance and allows for the system to dynamically respond to load without human intervention.
We tried a few different approaches before finally settling on limiting the number of concurrent rendering processes and queuing the requests above and beyond the limit. If a given request has been in the queue for too long or the queue itself reaches a maximum size then the lightweight CJS pipeline will be used for that given request.
In addition to the queue limits, we also implemented a event-loop delay monitor, effectively opting to use the CJS path if operations are delayed more than a given time period due to CPU-bound activities. While this does provide some relief, it is not as effective as the queue limiting as it is based on sampling and requires some smoothing to make effective and avoid false positives.
The exact configuration of these parameters will vary from application to application, but for our own servers we have
  "vm-pool-size": 25,
  "vm-max-queue": 100,
  "vm-queue-timeout": 500,

  "event-delay-failover": 50,
Meaning that a given server will render 25 pages at a time, hold up to 100 in the queue, for up to 500ms. The event delay parameter prescribe failover at 50ms of delay (which is very high for a Node application and is a sign that something is horribly wrong).
Caching

The entire design of the SSJS infrastructure is one that attempts to give the best possible case for long term caching of the responses.

Broke some of our primary services into split requests with long term data vs. short term data in distinct requests.
Applied the longest cache values that we could with the above changes.
Serve only public content from the SSJS response pipeline.
Aggressively cache any content that can be in our CDN.

External caching operations helped some but our catalog is very diverse so cache hit rates are not as high as we had hoped when cache durations and geographic distribution are factored in. With our traffic patterns, we found that utilizing shared catbox cache for API data had more of an impact as it reduced the stress on our upstream services tier. This alone halved our response time as the cache hit rate within the data-centers was quite high.