Parting Thoughts on the Reuters Next API
by Avi Flax • February 2013
Since tomorrow is my last day on the project, I hope this brain dump of my thoughts might prove useful.
I recommend that Mongo Connector be replaced sooner rather than later. Two main reasons: reliability and speed.
Reliability: the Mongo Connector is clearly not sufficiently reliable. It doesn't handle errors well — they sometimes even disable it altogether. And it has no mechanism for retrying any failed transaction.
Speed: I mean the speed of search in the UX; it could be a lot faster. The reason is that Mongo Connector transfers “raw” documents from MongoDB; once these documents are retrieved as search results, they need to have their references resolved and undergo some other processing in order to be prepared to be sent to the client. This involves retrieving data from the DB. The speed of the search resource would be much better if this work was done before the documents are added to the ElasticSearch index.
I recommend you replace Mongo Connector with a new service which monitors the MongoDB oplog and creates a Celery job for every change. This way the service will work for changes processed by a local instance of the API server, or changes processed by a remote instance and transferred via replication.
The Celery jobs would perform all the work required to prepare the documents to be directly usable as search results, including retrieving referenced documents from the DB, and then upsert the document into ElasticSearch. That way, when the search resource retrieves search results from ElasticSearch, they could be sent to the client immediately with minimal post-processing required.
Another advantage of using Celery jobs for this is that if they fail they can be retried later, which would make the system more resilient, and search results eventually more accurate.
A third reason the Mongo Connector should be replaced is that it is effectively abandonware. I posted a Pull Request to its GitHub repository almost 2 months ago, and it’s been moldering with no response since then.
Handling DELETE Requests
I know “hard” deletes are a business requirement, but I see it as risky and ill-advised to actually wipe out all traces that a given resource ever existed in our system.
I recommend that instead consider one of these options be considered:
repurpose the Item status “retracted” for these “hard” deletes. When an Item is retracted, don’t just change its status — also wipe out its content, replacing it with a notice that it has been retracted. The headline could be replaced as well, or “retracted” could be prepended to it. While it’s true that the Item’s slug might cause conflicts with a future Item an editor wishes to post, I think that’s OK. Actually, I think it’s correct; it should conflict, because in fact an Item did exist with that slug, and its existence probably shouldn’t be completely erased from the system. When a client attempts to GET an Item with the status “retracted” the API server should response with the status line 410 Gone.
create a MongoDB collection to store records of these deletions; when a DELETE is processed, add a new document to that collection detailing exactly which document was deleted, and when, and why, and by whom. At least then there will be a record of the deletion, and we’ll be able to prove that we performed it.
There are a few rough edges of the API design which I recommend be addressed when possible:
The “author” resources should be removed and replaced by Collections of type “author” (or “journalist”). Any “special” data needed to be tracked for each author/journalist can be stored in a “special” property of these Collections, just as we do with Companies, Countries, etc.
The Collection type
streamshould be removed, in favor of
The “user” resources should be disabled and commented-out of the codebase and the spec until they’re actually needed
The “source” resources should be made dynamic and database-backed so a new “source” could be added on the fly at runtime
All query parameters which have an underscore in their name should be renamed to replace the underscore with a dash/hyphen character
All “list” resources should respond to a POST request with either:
- a 303 with no body
- a 201 with the body containing a representation of the updated state of the target “list” resource (rather than a representation of the newly created “single” resource)
The Item properties
content_typeshould be renamed to
subtypeshould be optional
urlshould be removed until they can be rethought
The way Editions work, in particular the way that they relate to Collections and Items and the rules governing those relationships, are partly hypothetical, and therefore not consistently and fully documented or implemented.
I think it’s likely that once multiple Editions are created, and Collections and Items are created which are intended to “live” in specific Editions other than the US edition, and especially when some Collections start being associated with more than one Edition, that problematic and undesired behavior and circumstances will arise.
Therefore I recommend that you make sure that you have plenty of time to work on multiple Editions prior to the need to actually launch them. I suspect you will need the time to iron out the requirements and to find and address the bugs and edge cases.
Collections without Editions
Recently, when working on RNAPIS-818, I noticed that the API allows clients to create Collection resources which are not associated with any Editions. This is may be OK for Collections whose status is
draft, but it should not be allowed for Collections whose status is
published. I recommend that the API reject any Collection entity where the status is
published and no Edition references are included.
Possible Duplicate Company and Index Collections
When working on RNAPIS-818, I also noticed that the production API contained 66,226 Company Collections. Curiously, it also contained 66,226 Index Collections. I doubt this is a coincidence; it seems likely that the process which creates Company Collections is also creating a Collection of type
index for each company. I recommend this be investigated and corrected.