Mobile Server Side Rendering
This year marks the first year that we are doing full scale rendering of our SPA application on our mobile.walmart.com Node.js tier, which has provided a number of challenges that are very different from the mostly IO-bound load of our prior #nodebf.
To support the additional CPU load concerns as peak, which we hope will be unfounded or mitigated by our work, we have also taken a variety of steps to increase cache lifetimes of the pages that are being served in this manner. In order of their impact:
Event Loop Management
To fix this we split our view rendering pipeline into numerous async operations, all taking less than 1ms each (we didn't break out
process.hrtime to get a better measurement). This did have the effect of increasing the time taken to render complex views as the pipeline now had the slight overhead of having 50+ additional event loop executions to complete but in practice the cost was very minimal and we were able to successfully handle other operations in parallel with the more expensive rendering operations without seeing significant slow downs.
We tried a few different approaches before finally settling on limiting the number of concurrent rendering processes and queuing the requests above and beyond the limit. If a given request has been in the queue for too long or the queue itself reaches a maximum size then the lightweight CJS pipeline will be used for that given request.
In addition to the queue limits, we also implemented a event-loop delay monitor, effectively opting to use the CJS path if operations are delayed more than a given time period due to CPU-bound activities. While this does provide some relief, it is not as effective as the queue limiting as it is based on sampling and requires some smoothing to make effective and avoid false positives.
The exact configuration of these parameters will vary from application to application, but for our own servers we have
"vm-pool-size": 25, "vm-max-queue": 100, "vm-queue-timeout": 500, "event-delay-failover": 50,
Meaning that a given server will render 25 pages at a time, hold up to 100 in the queue, for up to 500ms. The event delay parameter prescribe failover at 50ms of delay (which is very high for a Node application and is a sign that something is horribly wrong).
The entire design of the SSJS infrastructure is one that attempts to give the best possible case for long term caching of the responses.
- Broke some of our primary services into split requests with long term data vs. short term data in distinct requests.
- Applied the longest cache values that we could with the above changes.
- Serve only public content from the SSJS response pipeline.
- Aggressively cache any content that can be in our CDN.
External caching operations helped some but our catalog is very diverse so cache hit rates are not as high as we had hoped when cache durations and geographic distribution are factored in. With our traffic patterns, we found that utilizing shared catbox cache for API data had more of an impact as it reduced the stress on our upstream services tier. This alone halved our response time as the cache hit rate within the data-centers was quite high.