Skip to content

Instantly share code, notes, and snippets.

@kpdecker
Last active June 2, 2016 18:02
Show Gist options
  • Star 17 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save kpdecker/c89361e2b452457f9d2e to your computer and use it in GitHub Desktop.
Save kpdecker/c89361e2b452457f9d2e to your computer and use it in GitHub Desktop.
mobile.walmart.com #nodebf 2014

Mobile Server Side Rendering

This year marks the first year that we are doing full scale rendering of our SPA application on our mobile.walmart.com Node.js tier, which has provided a number of challenges that are very different from the mostly IO-bound load of our prior #nodebf.

The infrastructure outlined for last year is the same but our Home, Item and a few other pages are prerendered on the server using fruit-loops and hula-hoop to execute an optimized version of our client-side JavaScript and provide a SEO and first-load friendly version of the site.

To support the additional CPU load concerns as peak, which we hope will be unfounded or mitigated by our work, we have also taken a variety of steps to increase cache lifetimes of the pages that are being served in this manner. In order of their impact:

Event Loop Management

The absolute largest issue that we found was that under load we would see convoy effects on the event loop as some pages would take a seemingly benign 10-80ms to render. Under normal traffic patterns this isn't likely to be an issue but under server melting peak traffic, this became a massive problem. At a certain level of concurrency, the backlog of these rendering processes would cause event loop durations measured in seconds as both the server-side JavaScript (SSJS) and all other requests were competing for time on the loop.

To fix this we split our view rendering pipeline into numerous async operations, all taking less than 1ms each (we didn't break out process.hrtime to get a better measurement). This did have the effect of increasing the time taken to render complex views as the pipeline now had the slight overhead of having 50+ additional event loop executions to complete but in practice the cost was very minimal and we were able to successfully handle other operations in parallel with the more expensive rendering operations without seeing significant slow downs.

Automatic Failover

Since the site is built on a client-side JavaScript (CJS) foundation that was utilized last year, we have the option of disabling the the SSJS pipeline under load. This provides us with fault tolerance and allows for the system to dynamically respond to load without human intervention.

We tried a few different approaches before finally settling on limiting the number of concurrent rendering processes and queuing the requests above and beyond the limit. If a given request has been in the queue for too long or the queue itself reaches a maximum size then the lightweight CJS pipeline will be used for that given request.

In addition to the queue limits, we also implemented a event-loop delay monitor, effectively opting to use the CJS path if operations are delayed more than a given time period due to CPU-bound activities. While this does provide some relief, it is not as effective as the queue limiting as it is based on sampling and requires some smoothing to make effective and avoid false positives.

The exact configuration of these parameters will vary from application to application, but for our own servers we have

  "vm-pool-size": 25,
  "vm-max-queue": 100,
  "vm-queue-timeout": 500,

  "event-delay-failover": 50,

Meaning that a given server will render 25 pages at a time, hold up to 100 in the queue, for up to 500ms. The event delay parameter prescribe failover at 50ms of delay (which is very high for a Node application and is a sign that something is horribly wrong).

Caching

The entire design of the SSJS infrastructure is one that attempts to give the best possible case for long term caching of the responses.

  1. Broke some of our primary services into split requests with long term data vs. short term data in distinct requests.
  2. Applied the longest cache values that we could with the above changes.
  3. Serve only public content from the SSJS response pipeline.
  4. Aggressively cache any content that can be in our CDN.

External caching operations helped some but our catalog is very diverse so cache hit rates are not as high as we had hoped when cache durations and geographic distribution are factored in. With our traffic patterns, we found that utilizing shared catbox cache for API data had more of an impact as it reduced the stress on our upstream services tier. This alone halved our response time as the cache hit rate within the data-centers was quite high.

@alexmcpherson
Copy link

Great writeup, good luck tomorrow. Quick edit: s/intensive purposes/intents and purposes/

@zebulonj
Copy link

Regarding your queue management, failover, and virtual machines... do I infer correctly that the queue management and and failover processes take place in a process external to the worker virtual machines (e.g., the 25 VMs in your pool)? If not, would you mind elaborating on the above to describe the relationship between queue management and the rendering pipeline in the context of your infrastructure?

@kpdecker
Copy link
Author

@zebulonj VM in the context of the parameters above refers to the fruit-loops page instances which are isolated from the rest of the node process via contextify. The queuing and failover mechanisms occur outside of this, in the Hapi context, via a combination of the fruit-loops pool instances and hula-hoop's page endpoint.

Basically the pool and page endpoint avoid handling a request in the VM if it thinks that it's going to fail. We never enter the more expensive path there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment