by Avi Flax
2 November 2012
The current codebase has many problems which negatively impact the performance and scalability of the system:
The codebase uses the Web server framework Tornado, which is designed from the ground up to use only non-blocking I/O, with all computation occuring on a single thread. This is an excellent design which has the potential to fully utilize the computing resources of a given machine — taking full advantage of CPU resources without requiring a large amount of memory. However, the design is predicated on all computation performing only non-blocking I/O, so that the single thread can jump from task to task instead of waiting for I/O to complete.
Unfortunately, the codebase uses blocking I/O to interact with its database. This means that if, say 100 users send concurrent requests to an instance of the API, and one of those requests sends a query to the database which requires 500 MS to process, then the other 99 users will have their requests delayed by 500 MS. And that’s in a best case scenario — in all likelihood, the other users would continue to delay each other.
There are ways to reduce the impact of using blocking I/O with a single-threaded framework, but they add their own costs.
There are 2 options for correcting this:
- Switch to a multi-threaded framework
- Change the database interaction to use non-blocking I/O
Our current DBMS, MongoDB, is a document-oriented database. This means that all data is stored in independent documents, and the system provides no facilities for relating documents to each other. This has performance advantages: “By keeping related data together in documents, queries can be much faster than in a relational database where related data is separated into multiple tables and then needs to be joined later.” (source: 10Gen)
However, the codebase uses the library MongoEngine to interact with the database. MongoEngine’s design is seriously flawed, as it attempts to restore relationships to the data model in the application tier. When this is translated to the database layer, however, it results in each interaction requiring many extra queries, which signficantly slows down the interaction, and puts extra load on the database server.
The solution is to drop MongoEngine and use MongoDB as it is intended, by keeping all closely-related data in a single document, and only retrieving related documents when absolutely necessary. This involves a small degree of denormalization, but that’s generally considered a small price to pay for the improvements in performance.
There are some cases wherein the API needs to return a particular dataset, say, all the Collections for a given Edition. In this case, and a few others, the codebase currently does some of the “filtering” in the application tier, instead of the database tier. This means that instead of querying the database for all Collections matching a given Edition, the application will retrieve all the Collections from the database, load them into memory, and loop over the entire set in order to filter it down. The application relies on garbage collection to free the utilized memory.
This is extremely inefficient.
The solution is to change all of these cases so that the “filtering” is specified as part of the database tier, so the DBMS can utilize its indices to make the filtering extremely fast, and drastically less data is transferred between tiers, and drastically less memory is used to process the query results.
The Reuters Next API — that is, the interface, as defined by its specification, as opposed to the current implementation — is intended to conform to the architectural style known as REST. Key aspects of REST are that clients and servers use a “uniform interface” to exchange representations of the current and desired states of “resources”. Resources are “the key abstraction of information in REST”. They essentially represent any concept which is useful to a system. Individual resources are identified by URLs; every request starts by identifying the resource which is the target of the request.
Any codebase which implements a RESTful system is best organized by having units of code which represent or implement each indivual resource. That way the paradigm of the implementation matches the paradigm of the specification, and of the way that clients interact with the system. Use cases then have a direct one-to-one mapping to code units. This results in an easy-to-understand structure for the code, new features can be added easily, and debugging can be very straightforward.
The current RN API codebase is not organized according to the resources which make up the API. Instead, the code units are grouped by the second-level path segment of the URL patterns which identify each type of resource. This has resulted in many code units which are very large, rambling, and exceedingly difficult to understand or change.
The solution is to reorganize the codebase so that its code units mirror the types of resources defined by the API specification.
The codebase is organized into classes which extend RequestHandler
. Each class is responsible for responding to certain requests. An attempt was made to implement most common functionality in a class named BaseHandler
; requests are then routed to subclasses of BaseHandler
which may override parts of how BaseHandler works. This approach, while well-intentioned, is extremely problematic, and was not executed well. There are some cases where subclasses override methods of BaseHandler
, and other cases where they need to implement certain specially methods which BaseHandler
then calls as a form of “hooks” — unfortunately those “hook” methods are inconsistent, under-documented, inefficient, and ultimately unnecessary.
The end result is that it’s very difficult to follow how a given request is handled — which code units are executed when. This also makes the codebase very rigid; it’s difficult to reuse certain pieces of the code while implementing slightly different behavior.
The solution is to refactor the codebase so that the code path followed by each kind of request is centralized in a single place, with a clear sequence of operations. Common operations should be implemented in shared functions.
All database interactions are mediated by a set of classes which are intended to model all the domain entities. This sort of “model layer” is useful in a GUI application, or in an application backed by a relational database, because the model classes inherit the ability to persist themselves.
In an API server backed by a document database, a “model layer” is unnecessary. Objects can be easily serialized and deserialized directly to and from the database as needed, decoupling domain modeling from storage.
The solution is to remove the “model” layer, and replace it with direct database interaction.
Some methods are hundreds of lines long, making them very difficult to follow, and limiting the composability of the behavior implemented in the methods.
The solution is to refactor these methods, breaking them up into small functions or methods, with each one having a clear, specific function.