Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?

Parting Thoughts on the Reuters Next API

by Avi Flax • February 2013

Since tomorrow is my last day on the project, I hope this brain dump of my thoughts might prove useful.

Search

I recommend that Mongo Connector be replaced sooner rather than later. Two main reasons: reliability and speed.

Reliability: the Mongo Connector is clearly not sufficiently reliable. It doesn't handle errors well — they sometimes even disable it altogether. And it has no mechanism for retrying any failed transaction.

Speed: I mean the speed of search in the UX; it could be a lot faster. The reason is that Mongo Connector transfers “raw” documents from MongoDB; once these documents are retrieved as search results, they need to have their references resolved and undergo some other processing in order to be prepared to be sent to the client. This involves retrieving data from the DB. The speed of the search resource would be much better if this work was done before the documents are added to the ElasticSearch index.

I recommend you replace Mongo Connector with a new service which monitors the MongoDB oplog and creates a Celery job for every change. This way the service will work for changes processed by a local instance of the API server, or changes processed by a remote instance and transferred via replication.

The Celery jobs would perform all the work required to prepare the documents to be directly usable as search results, including retrieving referenced documents from the DB, and then upsert the document into ElasticSearch. That way, when the search resource retrieves search results from ElasticSearch, they could be sent to the client immediately with minimal post-processing required.

Another advantage of using Celery jobs for this is that if they fail they can be retried later, which would make the system more resilient, and search results eventually more accurate.

A third reason the Mongo Connector should be replaced is that it is effectively abandonware. I posted a Pull Request to its GitHub repository almost 2 months ago, and it’s been moldering with no response since then.

Handling DELETE Requests

I know “hard” deletes are a business requirement, but I see it as risky and ill-advised to actually wipe out all traces that a given resource ever existed in our system.

I recommend that instead consider one of these options be considered:

  • repurpose the Item status “retracted” for these “hard” deletes. When an Item is retracted, don’t just change its status — also wipe out its content, replacing it with a notice that it has been retracted. The headline could be replaced as well, or “retracted” could be prepended to it. While it’s true that the Item’s slug might cause conflicts with a future Item an editor wishes to post, I think that’s OK. Actually, I think it’s correct; it should conflict, because in fact an Item did exist with that slug, and its existence probably shouldn’t be completely erased from the system. When a client attempts to GET an Item with the status “retracted” the API server should response with the status line 410 Gone.

  • create a MongoDB collection to store records of these deletions; when a DELETE is processed, add a new document to that collection detailing exactly which document was deleted, and when, and why, and by whom. At least then there will be a record of the deletion, and we’ll be able to prove that we performed it.

API Design

There are a few rough edges of the API design which I recommend be addressed when possible:

  • The “author” resources should be removed and replaced by Collections of type “author” (or “journalist”). Any “special” data needed to be tracked for each author/journalist can be stored in a “special” property of these Collections, just as we do with Companies, Countries, etc.

  • The Collection type stream should be removed, in favor of topic

  • The “user” resources should be disabled and commented-out of the codebase and the spec until they’re actually needed

  • The “source” resources should be made dynamic and database-backed so a new “source” could be added on the fly at runtime

  • All query parameters which have an underscore in their name should be renamed to replace the underscore with a dash/hyphen character

  • All “list” resources should respond to a POST request with either:

    • a 303 with no body
    • a 201 with the body containing a representation of the updated state of the target “list” resource (rather than a representation of the newly created “single” resource)
  • The Item properties document_type and content_type should be renamed to type and subtype

    • subtype should be optional
  • The properties permalink and url should be removed until they can be rethought

Editions

The way Editions work, in particular the way that they relate to Collections and Items and the rules governing those relationships, are partly hypothetical, and therefore not consistently and fully documented or implemented.

I think it’s likely that once multiple Editions are created, and Collections and Items are created which are intended to “live” in specific Editions other than the US edition, and especially when some Collections start being associated with more than one Edition, that problematic and undesired behavior and circumstances will arise.

Therefore I recommend that you make sure that you have plenty of time to work on multiple Editions prior to the need to actually launch them. I suspect you will need the time to iron out the requirements and to find and address the bugs and edge cases.

Collections without Editions

Recently, when working on RNAPIS-818, I noticed that the API allows clients to create Collection resources which are not associated with any Editions. This is may be OK for Collections whose status is draft, but it should not be allowed for Collections whose status is published. I recommend that the API reject any Collection entity where the status is published and no Edition references are included.

Possible Duplicate Company and Index Collections

When working on RNAPIS-818, I also noticed that the production API contained 66,226 Company Collections. Curiously, it also contained 66,226 Index Collections. I doubt this is a coincidence; it seems likely that the process which creates Company Collections is also creating a Collection of type index for each company. I recommend this be investigated and corrected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment