mikeal/gist:8150724

Created December 27, 2013 18:25

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mikeal/8150724.js"></script>
Save mikeal/8150724 to your computer and use it in GitHub Desktop.

Download ZIP

CouchDB attachments manta thread

Raw

gistfile1.txt

	https://twitter.com/izs/status/416635696367951873

	let's talk here, we have commments with mor ethan 140 characters!

Author

mikeal commented Dec 27, 2013

I understand why having a huge couch file is a pain in the ass but for whom? For the people replicating, they want that data anyway. Now that the load has been decreased substantially on hosted npm because of manta and the CDN is that big file still an issue?

Also, what does couch suck at big data? #trollface

indutny commented Dec 27, 2013

Well, we have isaacs blog comments too. But whatever.

I'm helping one company with their local couchdb replica, they need attachments to be in local network for:

Low latency
High reliability

And it's quite hard to have reliable service in Russia if it's owner is living in California :P I'm ok with having separate public couchdb database that exists solely for replication, but totally against making it CDN-only!

janl commented Dec 27, 2013

I think having a little proxy that I can install alongside a local couch, or a couch plugin that manages the attachments from the CDN to my local users would do the trick of separation and ease of use.

Author

mikeal commented Dec 27, 2013

@janl yeah, the little node proxy sounds much more doable than a special replicator that also pulls down the attachments, i can't even begin to imagine how that would work.

indutny commented Dec 27, 2013

@janl @mikeal that won't work for my case, as they want to have it available even without access to the internet.

janl commented Dec 27, 2013

@indutny I do mean a proxy that syncs down the attachments for local use, not just proxy to the web.

indutny commented Dec 27, 2013

@janl ok, that would work. And it seems if @isaacs will ignore minorities' issues - I'll need to implement it anyway :P

janl commented Dec 27, 2013

:D happy to help

isaacs commented Dec 27, 2013

Required reading: http://blog.npmjs.org/post/71267056460/fastly-manta-loggly-and-couchdb-attachments

Here's an idea I've been kicking around that would probably be a much better architecture. The tricky thing is figuring out the least-disruptive steps to get there, but this is the outline of the goal. Consider this a rough draft of my next post on the npm blog. I'll be posting it there when actual steps are more formalized, because I want to avoid discussing vaporware on that blog as much as possible.

Plan

There are two couches:

Attachment-free registry, where the dist.tarball url is all you have to go on, and points to a CDN-fronted data warehouse. (Specifically, Fastly in front of Manta, for the foreseeable future, though it may one day make sense to move old rarely-requested tarballs to something slower and cheaper.) Call this R1.
Attachment-filled registry, where the dist.tarball url is always going to point to the local copy. Call this R2.

The registry.npmjs.org URL points to Fastly, which is in turn backed by R1 for json, and Manta for tarballs. If a Manta tarball request receives a 404, it then falls back to the appropriate attachment url on R1.

Publishes do a PUT to R1 (well, to fastly, but it pipes, because it can't cache PUTs). This adds an attachment to R1.

A follower script listening to R1's changes feed sees the change with an attachment. It uploads the attachment to Manta. Then, it does a PUT to R1, removing the attachment.

A second follower script listening on R1's changes feed gets each document change, and compares it to the document in R2. It merges the R1 data update with the R2 data, and if necessary, fetches any missing attachments. Then it puts the resulting merged doc back into R2 (repeating the process if there are conflicts). If there are no necessary changes (eg, if the only change was an attachment removal), then it leaves R2 as-is.

R2's size on disk is pushing 200GB with the current dataset. If 2014 is anything like 2013, then it's going to be around 2TB by the end of the year. Hopefully, before then, we'll have a solution for attachment storage that doesn't rely on a single contiguous file, or else this is not sustainable. (Several people have talked about patches to CouchDB that would make it possible for attachments to "actually" live elsewhere, and keeping R2 alive at all will eventually depend on that.)

Any changes to R2 must comply with the existing couch-centric view of things: packages are docs with a tarball attachment for each version. If we're going to break that, then that means breaking the contract that R2 implements.

R2 does not expose a "registry" interface. In fact, we'll maybe flat-out block any requests to /registry/_design/*/* so that it's impossible to even run views or rewrites. However, the actually registry database (design doc and documents) will be exactly as they are today.

The only purpose of R2 will be replication, and it will tend to be a few seconds behind the "real" registry. It will probably be slower than the "real" registry, but since it is doing many orders of magnitude less work, it ought to be faster than what we have today, and more stable.

R1's size on disk will be well under 1GB by current estimations. That means that the whole thing can be easily cached in memory on commodity compute zones, even if we have several more 10x years. For users who ARE comfortable going across the network for public packages, or who are only interested in metadata analysis, it'll take much less time to replicate R1 locally.

Separating the metadata and package data opens up a lot of interesting opportunities for other distributed architectures, but the primary goal here is to provide maximum continuity and quality of service.

Current Status and Steps to Get There

The current Manta attachment importer thing is a straight call to mcouch's command line util. So, it does nothing npm-specific, but it's a good start, and super useful for the purpose of exploring the options here.

The R1 -> R2 follower script is next. Once that's in place and working, I can do the R1 "import to Manta and then delete the attachment (but don't also delete it from Manta!)" bit.

Both of these are going to first be done on a 1/256th slice of the registry in a staging environment and tested thoroughly before being unleashed on the public registry.

Of course, I recognize that a lot of other peoples' worker scripts would have to be changed if all this randomly goes away or changes drastically. That sucks, and I don't like that. I do actually care a lot about minorities.

One potential way around that is to have http://isaacs.iriscouch.com/registry be the home of R2 (or at least, one copy of it), since that's effectively what it is today. In that case, the "only" things that have to change are my own automation workers and CDN configuration. The downside there is figuring out how to get each piece set up and working such that it doesn't cause problems if it runs in the current setup, but can also do its job in the new setup. Taking down the website, killing the download counters, or destroying _users docs are all unacceptable, even in the short term.

In any case, just like getting Fastly in front of the registry in the first place, there are steps to releasing such a thing. First, I'll test it myself. Then, reach out to people who I know will be affected and ask them to test it as well. Then, turn it on in such a way that we can quickly roll it back. Once it works in production for a while, then we start dismantling any pieces that are no longer necessary. It's more work, but that's the cost of popularity :)

Final Question

If you are a current replication user, and you depend on having the attachments locally:

Do you ALSO depend on your replication being less than 1 second behind the "real" registry?
Do you depend on replicating from a host that also exposes a "registry" API as well as a "couch" API?
Is there anything else in this setup that will break your stuff or ruin your day?

indutny commented Dec 27, 2013

Delay is not that important (if it is < 10-15 minutes) :)

Author

mikeal commented Dec 27, 2013

@izs this will work provided that you write custom comparison between R1 and R2 documents because the _rev will be pretty worthless. you'll also need to run constant compaction on R1 if you actually want to keep the disc size down.

janl commented Dec 27, 2013

No.
No.
No-ish, except that the code path for a local installation then would be different from a regular setup. Special care needs to be taken that this doesn’t break in the future. I’m not too worried about this not working right away, since it used to be the default for so long.

For the viability of R2, as you know, keep close contact with the CouchDB folks through dev@ or JIRA (preferred) or IRC or me personally, but you all know that.

janl commented Dec 27, 2013

On the R1->R2 migrator script: that’s the one I haven’t manage to think through yet, but seems to me be most likely part of this that might make it fail.

Not to bikeshed too much:

An alternate setup could be a node “proxy” or couch plugin for offline users that knows how to mirror & update attachments from the CDN and serves it as a “local” or “offline” CDN. The proxy would rewrite dist.tarball to the “local CDN” and npm is none the wiser.

This would allow to use any sort of [distributed] binary update setup including rsync and bit torrent which seems more work for now, but might help down the road.

Then R2 is not really required any more.

janl commented Dec 27, 2013

Addendum: the whole motivation for the “alternate setup” is to avoid having to have R2.

And an added benefit is that npm only needs to know about one type of operation as the remote and local setups would look alike.

isaacs commented Dec 27, 2013

this will work provided that you write custom comparison between R1 and R2 documents because the _rev will be pretty worthless. you'll also need to run constant compaction on R1 if you actually want to keep the disc size down.

The comparison between R1 and R2 docs will be pretty simple, because the replication job will be the only thing ever writing to R2. Basically: "Are there attachments for all the versions? If not, go get them, and slap them on."

the code path for a local installation then would be different from a regular setup. Special care needs to be taken that this doesn’t break in the future. I’m not too worried about this not working right away, since it used to be the default for so long.

What's the difference between a "local installation" and a "regular setup"? By "regular setup", do you mean the public registry?

For the viability of R2, as you know, keep close contact with the CouchDB folks through dev@ or JIRA (preferred) or IRC or me personally, but you all know that.

I do, you guys are great :)

isaacs commented Dec 27, 2013

... offline cdn ..
Then R2 is not really required any more.

R2 is still required for existing replicators to be minimally impacted. And, in fact, if R2 can be found at the current CouchDB location, then so much the better.

That proxy would be cool, though, and I'm sure people could benefit from it. See also:

http://npm.im/npm-proxy
http://npm.im/npm-registry-proxy
http://npm.im/npmd

@ everybody

Thank you for the feedback in this discussion. It's really helped to clarify the needs here. I'll make sure that your use cases keep working, and be sure to give you a heads up if you have to change anything.

janl commented Dec 27, 2013

What's the difference between a "local installation" and a "regular setup"? By "regular setup", do you mean the public registry?

Yes, let’s use “remote” and “local” to mean the public registry and any local/offline copy respectively.

I thought about it some more and the fact that dist.tarball is a URL makes the code path the same in either case of old npm, npm+manta or any offline copy EXCEPT the logic to transparently follow up a manta 404 with a GET to the R1 document/attachment because we can assume that exists (that logic exists to avoid race conditions with the moving (copy & delete) of attachments to manta). So ignore my earlier comment about different code paths.

The only difference would be that the fallback would not work with an offline copy of R2 (in the original proposal) because there is no attachment that should return 404 (outside of other errors) whereas in my proposal the fallback has no place to fall back to that is also offline. but it could arguably fall back to online, as at that time the registry would be online anyway.

I’m not sure if I am making things more or less complicated now.

janl commented Dec 27, 2013

R2 is still required for existing replicators to be minimally impacted. And, in fact, if R2 can be found at the current CouchDB location, then so much the better.

Fair enough :)

konobi commented Dec 27, 2013

How about a plain HTTP/FTP folder, with tarballs having redirects and the json of the couchdb dumped into a 'json' folder? Then it's just a plain sync and script to read in the json (or another non-couch server to serve up the info).

isaacs commented Dec 27, 2013

The only difference would be that the fallback would not work with an offline copy of R2....
I’m not sure if I am making things more or less complicated now.

Yes, I <3 you, but now you are definitely making things more complicated :)

That fallback is written in VCL, and exists only on the Fastly CDN config. From the outside, if you're not looking at the http headers, you wouldn't know that npm wasn't fetching from Manta, or through a CDN. npm-the-client just fetches $registry/$pkg, and then follows the dist.tarball url in the doc, and expects it to be a tgz file matching the dist.shasum in the doc. It is completely server-architecture agnostic, other than that. (As proof: this is how it already works, and has for days now, and you're totally ok with it :)

The bottom line of this inquiry seems to be that I can do exactly what I want, and it's just going to cost me having a second DB setup so that replication users can keep doing their thing. Also, either the current isaacs.iriscouch.com/registry couch can be the canonical R2, or can replicate from it, and no one will even notice the change.

isaacs commented Dec 28, 2013

How about a plain HTTP/FTP folder, with tarballs having redirects and the json of the couchdb dumped into a 'json' folder? Then it's just a plain sync and script to read in the json (or another non-couch server to serve up the info).

That's roughly what's sitting in Manta under /isaacs/public/npm. You can even mount this with NFS, I'm told, though I haven't tried myself, and a folder with >52,000 entries might be rough on some systems :)

The format is:

/isaacs/public/npm
+-- $package_name
|   +-- doc.json - The same as the couchdb document
|   `-- _attachments
|       +-- $package_name-$version.tgz - Tarball attachments
|       `-- more tarballs...
`-- more packages...

till commented Dec 30, 2013

@isaacs Along with what @konobi said — how feasible are static files which I can use as a local registry? I'd like to mimimize my operational overhead. On my end, something like nginx which serves all the files from a directory. For upstream, ppl can wget on a regular basis which would make running mirrors a lot, lot easier as well

Also, I've never heard of Manta — what exactly is this? Joyent's version of S3? Also, is there anything to see yet and try?

isaacs commented Jan 2, 2014

Manta is http://www.joyent.com/products/manta. There's an SDK and docs. Because the files are in my public storage location, you can read them, link them to your space, run jobs over them, etc.

This much is already done. As of this moment, tarballs are being served through Fastly from Manta, not from CouchDB. They still are in CouchDB as attachments, but unless you're hitting the CouchDB endpoint directly (or if Manta doesn't have the file yet), you will be getting them from Manta.

It's probably not trivial to use static files as a local registry, at least, not one that you could write to. I haven't explored that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment