meiqimichelle/ipld-selectors-and-pin-API.md

## ipld-selectors-and-pin-API.md

      
    Raw
  

              ipld-selectors-and-pin-API.md
            
          
    The relationship of IPLD selectors and third-party use of the Pin API

tl;dr: Pins should probably always be derived by application logic for non-human IPFS users. That way, the application layer can solve its higher-level problem using whatever logic and syntax it requires (and it can interact with its end users in whatever words it needs to, without asking them to understand IPFS abstractions such as 'pinning'). Pinning syntax should be a background, lower-level abstraction.
In the short term, we should focus on work in two areas: one, getting pinning abstractions and language right, and two, improving garbage collection performance. For this second item, as traversing a DAG to perform GC is essentially unavoidable, we should intentionally experiment with when and how this happens, and how many other things can happen concurrently, to improve the user experience. Also, we need to address the race condition problem in the API.
In the medium term, we can look forward UnixFSv2 and selectors landing, which will make our lower-level architecture more effective, and obviate some of our existing user experience bottlenecks. This work, however, does not block the short-term actions laid out above.

Notes from conversation with @warpfork, early Oct 2019.
Problem space considerations

Traversal efficiency problems

When we pin, we make pins recursively. The implication of that is that -- it pins everything. Here’s the problem -- this is a Big O -type issue -- ie, if the system is pinning recursively and it has to work recursively on everything in it, to do the pinning you have to do a GIANT (complete) traversal of everything underneath it. If you want a sub-set of files in a path, that’s not trivial to do right now. Especially from the user’s side. And then you run into the separatation between how add and pin work. You add, and had to do this huge traversal to make it happen. Then you stop. And then you pin next, which starts all over from the root directory/CID, and have to traverse the entire giant thing again. There are opporunities to improve efficiency here.
Semantic problems

So you’ve got this human-instantiated thing where you want to pin a thing, and then the machine goes and throws a pin on all the recurively pinned stuff below. This is confusing. It's even hard to talk about it amongst ourselves when we want to trouble-shoot in our systems -- we just don't have clear words for these things. If  I can pin a picture of my dog, which pins a bunch of blocks behind the scenes directly and indirectly, when did I do the thing called pinning?
What about de-dupe + pinning + selectors + semantics? We have all these ways that you can split directories and files, and the system uses flags to understand how the splitting was done. And, you can’t de-dupe if the things were split differently at different points in time. In the future, with selectors, you might have something pinned that includes something that helps fix this performance issue -- but something here where user-defined pins versus the machine ‘pinning hashes’ are clearly different words and/or concepts. This area needs more words, more explanation, because we’re having trouble talking about it. Ie, the things that are implicitly pinned ‘below’ when I pin a thing. A shadow of a pin.
How do we and where do we separate the responsibility for what IPFS needs to know for pinning versus what an application knows about its concepts, and how to make that surface area small and sane.
Race conditions

You can add, GC, and then pin for the hash you just created in add, but by the time you pin, the hash is gonna be gone already [because you garbage-collected it]. The reason that this isn’t a huge problem is because...no one actually runs GC because it’s too slow. There’s a recursive set of problems here. However, trying to fix the concurrency of add and GC is not where I would start doing anything.
More on selectors

Before selectors

All I can do is ask the network for a CID, say, QmFoo. The network will return blocks to you by traversing the DHT via the Kademlia algorithm, but it doesn't know anything about the data structure that that CID is a part of. It just gets the blocks.
After selectors

After we have selectors, once you find the peer that has QmFoo via the method above, Graphsync could send them a message that says something to the effect of "plz gimme the selection of data starting at QmFoo, and then everything matching ./x/y/*/{a,b,c}/*" (general idea, not actual syntax of anything).
Now I can identify a DAG subset. This is like regexps but for DAGs. (The audience for this is also the same as the audience for regexps: developers and users of the libraries, or in some cases, power users.) Selectors will allow someone to match patterns in DAGs the way that people can now match patterns in strings with regex. When you have two peers and you’re talking about what you want to transfer, and you have the path via UnixFS, you might be able to ask for a set of things.
Also, this can provide lots of performance improvements. For example, after sending that request, you might stream a bunch of stuff, without you having to fully transfer QmFoo, then parse it, then find the links in it, then see that './x' is a link (say to QmZyzz), so then you ask that peer (or potentially the DHT again) for QmZyzz, and then they start sending you QmZyzz...
And these little delays vs streaming can make a big difference, because with the little delays, there's periods where you're not using your internet at all and you can down your bandwidth use.
For technical requirements, P0+P1 stuff in the early design requirements doc is  really good:  https://github.com/ipld/specs/blob/master/design/history/exploration-reports/2018.10-selectors-design-goals.md#selector-requirements
Graphsync

This might sound a bit like Graphsync. Graphsync uses selectors to describe how to transfer the thing, and then it's responsible for the actual transfer. Graphsync has actually already been implemented on the Filecoin side, but getting it to come back to IPFS will take a while because UnixFSv1 is in Protobufs, which is a serialization scheme from Google. IPLD schemas are going to be an alternative to Protobufs. Selectors work on IPLD, they don't work on Protobufs.
UnixFSv2 will be IPLD-native -- and hopefully IPLD-schema-native. And the schema work is moving along nicely now!