Skip to content

Instantly share code, notes, and snippets.

@ashleygwilliams
Last active November 1, 2016 17:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ashleygwilliams/5b37edd86dedf317cc19be8005f23da2 to your computer and use it in GitHub Desktop.
Save ashleygwilliams/5b37edd86dedf317cc19be8005f23da2 to your computer and use it in GitHub Desktop.

Zero One Infinity Readmes

npm has only been a company for 3 years, but it has been a code base for around 5-6 years. Much of it has been rewritten, but the cores of the CLI and registry products are still the original code. Having only worked at npm for a year at this point, there's still a lot of things left for me to learn about how the whole system works.

Sometimes, a user files a bug that, in the process of debugging it, teaches you some things you didn't know about your own system. This is the story of one of those bugs.

The Bug

Over the past week or so, several people filed issues regarding some strange truncating in npm package pages. In one issue, a user reported what appeared to be a broken link in their README:

Another user pointed out that the entire end portion of their README was missing!

As a maintainer of npm's markdown parser, marky-markdown, I was concerned that these issues were a result of some parsing rule gone awry. However, another marky-markdown maintainer, @revin, quickly noted something odd: the description was cut off at exactly 255 characters, and the README was cut off at exactly 64kb. As @aredridel pointed out: those numbers are smoking guns.

Indeed, an internal npm service called registry-relational-follower was truncating both the READMEs and descriptions of packages published to the npm registry. This was a surprise to me and my colleagues and so I filed an issue on our public registry repo. In nearly no time at all, our CTO @ceejbot responded by saying that this was intended behavior(!) and closed the issue.

"TIL!" I thought. And that's when I decided to dig into how the registry handles READMEs... and why.

Zero One Infinity Rule

Before I dive into exactly what happens to your pacakges' READMEs between your writing+publishing to their rendering on the npm website- let's address the 800lb gorilla in the room:

When I discovered that the registry was doing arbitarily truncating READMEs, I thought: "Seems bad."

Maybe you thought this, too.

Indeed, at least one other person did- commenting on the closed issue:

This may be desired by npm, but I doubt any package authors desire their descriptions to be truncated. Also, see zero-one-infinity.

So, I should start by saying that commenting negatively on an already closed issue isn't the best move in the world. However, I appreciated this comment because it gave me new words to explain my own vaguely negative feelings about this truncation situation; fancy words with a nice name: The Zero One Infinity rule.

The Zero One Infinity rule is a guiding priniciple made popular by Dutch computer scientist Willem Van der Poel and goes as follows:

Allow none of foo, one of foo, or any number of foo.

This principle stands to eliminate arbitrary restrictions of any kind. Functionally, it suggests that, if you are going to allow something at all, allow one thing or allow an inifinite amount. These seems to be aligned with a seemingly symbiotic rule: the Principle of Least Astonishment, which states:

If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature.

In the end, these principles are fancy, important-sounding ways of saying: arbitrary restrictions are surprising. And we shouldn't be surprising our users.

Ok, so now that we can agree that surprising users with strange and seemingly arbitrary restrictions is no bueno- why is it that the npm registry currently has this restriction? Certainly the npm developers don't want to be surprising developers, right?

An Archaeology of Registry Architecture

And indeed, they don't! The current restriction on description and README size is a bandaid that the npm registry developers were forced to apply as a result of the original architecture of the npm registry- large READMEs were making npm slow.

"How the heck---" you might be thinking. Reasonable. Let's take a look at that.

How npm Deals with READMEs on Publish

Currently, here is how your READMEs are dealt with by the registry:

When you type npm publish, the CLI tool takes a look at your .npmignore (or your .gitignore, if no .npmignore is present) and the files key of your package.json. Based on what it finds there, it takes the files you intend to publish and runs npm pack which packs everything up in a tarball, or .tar.gz file. npm doesn't allow you to ever ignore the README file, so that gets packed up no matter what!

So when you type npm publish your README gets packed into a package tarball. This is what gets downloaded when someone npm installs your package. But this is not the only thing that happens with your README.

So while npm publish runs npm pack, it also runs a script called publish.js that builds an object containing the package's metadata. Over the course of your package's life (as you publish new versions), this metadata grows. First, read-package-json is run and grabs the content of your README file based on what you've listed in your package.json. Then publish.js adds this README data to the metadata for your package. You can think of this metadata as a more verbose version of your package.json- if you ever want to checkout what it looks like you can go to http://registry.npmjs.com/<package-name>, for example check out http://registry.npmjs.com/marky-markdown. As you'll see- there's README data in there for whatever the version of your package has the latest tag!

Finally, publish.js sends this metadata, including your README, to npm-registry-couchapp's publish- and here is where we bump into our truncation situation.

While npm publish sends the entire README data to the registry, the entire README does not get written to the database. Instead, when the database receives the README, it truncates it at 64kb before inserting.

So while we talk about a package on the npm registry as a single entity- the truth is that a single package is actully made up of multiple components that are dealt with by the npm registry services differently. Notably, there's one service for tarballs, and another for metadata! And your README is added to both.

This means that the registry has 2 versions of your README:

  • The original version as a file in the package tarball
  • A potentially truncated version in the package metadata

As you may already be guessing- the reason that users have been seeing truncated READMEs on the npm website is because the npm website uses the README data from the package metadata. This makes a fair amount of sense- if we wanted to use the READMEs in the package tarballs, we'd have to unpack every package tarball to retrieve the README and that would not be super efficient. Reading README data from a JSON response, which is how the npm registry serves the package metadata, seems, at least, a little more reasonable than unpacking 300,000+ tarballs.

History Lesson Time

So now we know where the READMEs are truncated, and how those truncated READMEs are used- but it is still not necessarily clear why. This requires a bit of archaeology.

Like many things about npm, this truncation was not always the case. On January 20, 2014, @isaacs commited the 64kb README truncation to npm-registry-couchapp, and he had several very good reasons for doing so:

  • Firstly, allowing extremely large READMEs exposed us to a potential DDoS attack. A unsavory actor could automate the publishing of several packages with epically large READMEs and take down a bunch of npm's infrastructure.

  • Secondly, extremely large READMEs in the package metadata were exploding the file size of that document- making GET requests to retrieve package data very slow. Requesting the package metadata happens for every package on an npm install, so ostentisbly a single npm install could be gummed up having to read several packages with very long READMEs- READMEs that wouldn't even be useful to the end user, who would either use the unpacked README from the tarball or wouldn't even need the README assuming the package was a transitive dependency far down in the dependency tree.

Interestingly enough, the predicament of exploding document size was a problem that npm had dealt with before.

Remember when we pointed out that a single package is actually a set of data managed by several different services? Like many things at npm: this was not always the case.

Originally, the npm registry was entirely contained by a single service, a CouchDB App, on top of a CouchDB database. CouchDB is a database that uses JSON for documents, JavaScript for MapReduce indexes, and regular HTTP for its API. CouchDB comes with an out-of-the-box functionality, called CouchApp, which is a web application served directly from CouchDB. npm's registry was originally exclusively a CouchApp: Packages were single, document-based entities with the tarballs as attachments on the documents. The simplicity of this architecture made it easy to work with and maintain: a totally reasonable version 1. However, soon after that, npm began to grow extremely quickly- package publishes and downloads exploded- and believe it or not: the original architecture scaled poorly. As packages grew in size and number, and dependency trees grew in length and complexity, performance ground to a halt and the npm registry would crash often. This was a period of intense growing pains for npm.

To mitigate this situation, @isaacs split the registry into two pieces: a registry that had only metadata (attachments were moved to an object store called Manta and removed from the CouchDB), which he called skim, and another registry that contained both the metadata and the tarball attachment called full-fat. This splitting was the first of what would (and will continue to!) be multiple refactoring efforts to reduce the size of the package metadata document and distributing processing of packages across multiple services to improve performance.

If you look at the npm registry architecture today, you'll see the effects of our now CTO @ceejbot's effort to continue to split the monolith: slowly separating out registry functionality into multiple smaller services, some of which are no longer backed by the original CouchDB, and are backed by Postgres.

Plans for the Future

Turns out that nobody thinks that arbitrarilyy restricting README length is a good thing. There are plans in the works for a registry version 3- and changing up the README lifecycle is definitely in the cards. Much like the original shift that @isaacs made when he created the skim and full-fat registry services- the team would ideally like to see the README data removed from the package metadata document and moved to a service that can render them and serve them statically to the website. This would bring several awesome benefits:

  • Firstly, no more README truncating! Good-bye arbitrary restrictions!
  • Secondly, speeding up the website by moving the markdown parsing to its own service.
  • Thirdly, speeding up the website even more by pre-parsing the READMEs and serving them statically instead of parsing them on request (yes we cache, but still).
  • Fourthly, serve READMEs for all versions of a package! Since we have lowered the cost of READMEs, we can not only parse more of a single README, but we can now parse more READMES :)

Because npm cares deeply about backwards compatibility, as the npm regsitry grows out of its CouchApp and CouchDB origins, all of the original endpoints and functionality of the original API will continue to be supported. So there will always be a service where you can request the metadata of a package and get the README for the latest version. However, npm itself is not required to continue to use that service, and moving on from it towards our vision of registry version 3 will be an awesome improvement, across several axis.

Happy Debugging!

A friend recently tweeted:

systems as designed are great, but systems as found are awful

This is not a shot at npm; this statement is pretty ubiquitously true. Most systems that are of any interest to anyone are the products of a long and likely complicated history of constraints and motivations, and such circumstances often produce strange results. As displeasing as the systems you find might be, there is still a pleasure in finding out how a system "works" (for certain values of "work", of course).

In the end, the "fix" for the "bug" was "we've got a plan for that, but it's gonna take a bit", which isn't all that satisfying. However, the process of tracking down a seemingly simple element of the npm registry system, exploring it across services and time, was extremely rewarding. In fact, in the process of writing this post I became aware that, Crates.io, the website for the Rust Programming Language's package manager, Cargo, was dealing with a very similar situation regarding their package READMEs. However, instead of trying to remove them from their package metadata like us- they are considering putting it in! If I hadn't had the opportunity to dig around in the registry internals, I might not have been ready to offer them suggestions with the strength of 5 years of experience.

So — the moral of the story is: When you can, take the time to dig through the caves of your own software and ask questions about past decisions and lessons. And then write down what you learn. It might be helpful one day. Probably sooner than you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment