Skip to content

Instantly share code, notes, and snippets.

@topicus
Last active July 20, 2022 17:33
Show Gist options
  • Save topicus/9956cac3970b91bfa1be6665d64be1be to your computer and use it in GitHub Desktop.
Save topicus/9956cac3970b91bfa1be6665d64be1be to your computer and use it in GitHub Desktop.
Dataloader dynamic batch

Solution

Problem statement

GraphQL is great for exposing a single API that hides the complexity of the systems behind. As such, it is commonly use to aggregate data from different data sources. For example, databases, services, etc.

But most corporate web services are build around RESTful APIs, hence it is crucial to handle this use case efficiently.

Given the reference implementation for GraphQL is written in node and the close relationship between the frontend and the language, it is very popular to use node as the platform for the server.

However, node applications must be build around the event loop limitations. Every node developer knows blocking the even loop is bad for performance. During that time the server is not able to process new requests or move forward other tasks. The problem is well documented in the following paper https://edge.seas.harvard.edu/files/edge/files/wcui19vee.pdf.

When dealing with REST APIs, the most common data transfer format is JSON. So, parsing JSON in a GraphQL server is expected to happen a lot. By nature, that operation is CPU bound.

In other words, this kind of server will spend most of its CPU cycles on processing JSON strings.

Sadly, parsing JSON in node blocks the even loop. With a time complexity of O(N), the bigger the response payload, the more time the we would block the loop.

There is a very well know formula to reduce the number of requests we do against backends: DataLoaders. This component was created address the N+1 problem by deduplicating and batching requests.

But, batching imply more resources to fetch in a single request, which, by definition will increase the size of the payload. And, we already mentioned a bigger payload will likely block the event loop for more time.

The problem with fixed batches

The DataLoader library created by Facebook already provides a mechanism to set the maximum number of elements we allow to batch. But there is an underlying assumption the response shape is static. That is, the data returned does not change over time.

However, this does not consider more sophisticated use cases. Imagine your GraphQL server is connected with an API that returns a list of products. Based on the current response size, you decided to limit your requests to batches of 100 products.

After a while, the API owner decides to nest a list of simples inside each product so customers can select between different product variants.

Suddenly, the event loop lag of your application increased considerably, and you have to find the cause. Once you find out the new payload size is the culprit, you will need to figure out the proper value for the batch.

This situation can repeat multiple times during the lifetime of the your system.

But, let’s consider a more substantial use case.

Introducing field filtering

As we demonstrated above, backend payload sizes may change without previous notice. However, when the number of services our GraphQL server is connected to is small enough, changing the values on our data loaders occasionally may be acceptable.

But what happens if the response sizes are not stable. Namely, when they change per query basis.

Such is the case when we have the ability to filter fields out of responses in the backend(e.g., 1, 2, 3), so we can create a dependency map between GraphQL and backend fields. The benefit of using field filtering in JSON responses is outlined in the this post.

While this will bring a significant performance improvement, it makes hard to establish a find a number for the dataloader batch size. In one query, we may ask for a few fields while in another query we can request dozens.

At this point, it is very tempting to create multiple dataloader instances for each of the use cases that we manage. However, this does not solve the root problem and it does not scale well.

After all, we want a mechanism to maximise the utilisation of the CPU without stalling it.

Choose solution

In most cases, the size of the a response is driven by input parameters. For example, it is expected that requesting a larger number of elements to the server, we will get a bigger response. Thus, considering an API that takes limit as input, we can infer the size of the response from this number.

Similarly, it is intuitive that requesting more fields from the backend will increase the response size. In fact, as outlined in this paper, the size of the response in GraphQL is exponential to the size of the query. Though not mathematically proven, it is fair to assume that similar field selection mechanisms like JSON field filtering will perform alike.

Hence, we have the means to predict the size of the responses feeding our GraphQL server.

However, dataloader do not allow us to tune the batches we dispatch without overriding the whole scheduling mechanism. Which, it is not a bad alternative but other users won’t benefit of this feature. Moreover, if we don’t want to copy and paste the same functionality again and again, we will end up creating an internal library that we would have to maintain.

Instead, we can add a function to dataloader interface to hook into the decision of creating a new batch. Let’s name this function shouldDispatch for now.

With this parameter, we are now able to change the batch size dynamically based on our application needs.

function computeMaxBatchSize() {
  // heuristic to get the optimal batch size.
}
const myLoader = new DataLoader(myBatchFn, {
  shouldDispatch: (currentBatch) => {
    return currentBatch.length > computeMaxBatchSize();
  }
})

Implementation

The implementation is rather simple: We just need to return a new batch whenever the shouldDispatch function returns true:

if (
  existingBatch !== null &&
  !existingBatch.hasDispatched &&
  existingBatch.keys.length < loader._maxBatchSize &&
  (!existingBatch.cacheHits ||
    existingBatch.cacheHits.length < loader._maxBatchSize) &&
  !loader.shouldDispatch(existingBatch)
) {
  return existingBatch;
}

Improvements

A possible improvement is to monitor parsing operations and use that data to adjust automatically the size of the batch.

Alternatives solutions

Worker Threads

For engineers used to languages with better support for multithreading/multiprocessing, it will be natural to experiment with the worker threads node module. In particular, if the problem we are trying to address is a unit of work starving the thread where the event loop runs.

However, worker threads are better suited for use cases when the data that flows through threads is small, but the computation is expensive. In the JSON parsing use case, the overhead of passing data around makes this solution nonviable.

JSON Streams

An alternative could be to use a stream parser. We achieve the goal of releasing the event loop for other computations by pausing and resuming the parsing operation.

But, stream parsers usually rely on a custom parser to preserve the state of the parsing operation. Thus, all the action happens in the JS land which in general tends to throw worse results in terms of performance than code running directly in V8.

Thought this kind of parsers are good for keeping the memory footprint low, they fail to achieve the goal of increasing the throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment