Skip to content

Instantly share code, notes, and snippets.

@nagydani
Created October 15, 2015 14:12
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save nagydani/b3ab04f9970952a51b0c to your computer and use it in GitHub Desktop.
Save nagydani/b3ab04f9970952a51b0c to your computer and use it in GitHub Desktop.
Integration of ipfs and Ethereum
In this document, I outline the tasks required for storing and
presenting the Ethereum block chain and Web-based Ethereum Đapps in
ipfs. Currently, ipfs is very good at locating and delivering content
using a global, consistent address space and it has a very well designed
and implemented http gateway. However, Ethereum's use cases require
additional capabilities that ipfs currently does not provide.
Redundancy and persistency
In both important use cases, we need to make sure content is available
under the condition that nodes can come and go. Ipfs, by itself, does
not provide any mechanism to ensure this, though there is a weak
incentive for replication built into their "bitswap" protocol, which
seems not to be implemented completely at this point, with important parts
of the design still not finalized.
Long-term persistency of meaningful pieces of information can be
incentivized by content availability insurance that is largely
independent of the underlying distributed storage solution. The most important
development in this regard is the Swarm Contract at
https://github.com/ethersphere/go-ethereum/blob/bzz-config/bzz/bzzcontract/swarm.sol
However, it is also worth noting that the entire infrastructure for
redundant and secure storage developed for Swarm can be used in the framework
of ipfs thanks to its pluggable hash function. If swarm hash is added
as an application-specific hash function to ipfs and swarm nodes advertize
their content in ipfs DHT, Swarm can serve as a replication infrastructure
to ipfs.
Fair allocation of bandwidth resources
Bitswap defines an API for bandwidth accounting that can be easily extended
to include micropayment transfers to balance otherwise unbalanced bandwidth
use between peers.
The vast majority of these micropayment transactions must happen off the
block chain, otherwise the use of the block chain itself becomes a significant
transaction cost. Such a micropayment mechanism has been developed for Swarm
and can be used as a plug-in for Bitswap as well as for a multitude of other
purposes not even related to storage. The relevant contract code and go API
are availabe at
https://github.com/ethersphere/go-ethereum/tree/bzz-config/common/chequebook
Names and URIs
One design principle of Swarm was to allow for arbitrary names and URIs to
resolve to both static and dynamic content served up by Swarm infrastructure.
Unfortunately, this has not been a design goal for ipfs and in its current form
it does not fulfill it. In particular, static directories with a large number
of entries are handled very inefficiently by ipfs and there is no obvious
way around this limitation.
In practice, it makes it very difficult to migrate content like
Wikipedia to our distributed storage, even though it would have one of
the obvious candidates for a high-profile applications of such an
infrastructure. Similarly problematic would be to implement commonly
used http API's for mapping content, such as OpenStreetMap tiles, on top
of ipfs, which would be another obvious candidate.
I believe that for the success of Web3, it is instrumental to retain as
much compatibility with popular and useful Web 2.0 standards and
services as possible. The URI resolution scheme used by ipfs constitutes
a very severe limitation hampering such efforts.
Decentralization
The design of ipfs provides a common abstraction for both centralized
and decentralized storage solutions so that content can be retrieved
from both using the same software; the consumer of the content does not
even need to be aware of the underlying storage architecture and ipfs
does not specify one. The content can come from a workstation with a
temporary adress, an individual small server, a large datacenter or a
sophisticated content delivery network. As long as the content conforms
to ipfs format and is advertized in ipfs DHT, the consumer will be able
to download it all the same.
Moreover, ipfs solves one of the main problems of the (http(s)-based)
web driving its rapid centralization, which is that the costs of content
distribution borne by the publisher increase with the content's
popularity. Since ipfs content is delivered bittorrent-style, all consumers
automatically contribute their upstream bandwidth towards distribution, at
least for the time of downloading, thus contributing their fair share.
However, as history with Bitcoin shows, enabling decentralization does
not prevent centralization. Economies of scale might result in a
centralization of storage infrastructure; the real question then becomes
to what extent can large players abuse their position.
Censorship resistance
In some ways, ipfs is explicitly censorship-enabling; nodes can decide
what content to store and not to store and they can credibly comply with
take-down notices. At the same time, ipfs also helps keeping content
available for all users as long as there are nodes that are willing to
serve it, although it must be noted that it also helps finding all
such nodes. This might be a workable compromise.
For this, however, to remain the case, it is important that the DHT
remains decentralized. Unfortunately, at present there are no incentives
built into ipfs for running DHT nodes. DHT nodes cannot be excluded for
not responding to queries, because ipfs DHT attaches very little value
to connections. Consumers are not punished for freeloading (only
querying other DHT nodes, but never responding to queries), while a
cartel providing most of the storage service might decide not to keep
outsider addresses in their Kademlia table and yet provide a pleasant
user experience to freeloading consumers. Over time, this might develop
into a problem.
@jbenet
Copy link

jbenet commented Oct 18, 2015

@nagydani

This is a terrible characterization of ipfs. This post does not understand what IPFS is actually doing, why certain design constraints exist, or how to use it to best support your use case. You may want to check your stuff again. If you would like, i am happy to speak with you again. You should consider asking questions to our community, instead of making incorrect assumptions.

Redundancy and persistency

In both important use cases, we need to make sure content is available
under the condition that nodes can come and go. Ipfs, by itself, does
not provide any mechanism to ensure this, though there is a weak
incentive for replication built into their "bitswap" protocol, which
seems not to be implemented completely at this point, with important parts
of the design still not finalized.

Do you understand why nodes cannot be required to store things and why the incentivization has to be separate? You cite it here like a design deficiency without addressing that:

a) the IPFS content model strictly establishes that it should be possible for nodes to ONLY retrieve and store content they EXPLICITLY request. this is a REQUIREMENT for a transport protocol to have any chance to be adopted and used in regular companies at all. No sane company would ever run a protocol that may download illegal bits to their machines, no matter how many delusional layers of (im)plausible deniability you want to throw at them.

b) you do not point out the decomposition we have, which is that protocols like Filecoin layer above to ensure persistence, or that it is absolutely trivial to produce this behavior on top.

c) you may want to track ipfs/notes#58

Long-term persistency of meaningful pieces of information can be
incentivized by content availability insurance that is largely
independent of the underlying distributed storage solution. The most important
development in this regard is the Swarm Contract at
https://github.com/ethersphere/go-ethereum/blob/bzz-config/bzz/bzzcontract/> swarm.sol

However, it is also worth noting that the entire infrastructure for
redundant and secure storage developed for Swarm can be used in the framework
of ipfs thanks to its pluggable hash function. If swarm hash is added
as an application-specific hash function to ipfs and swarm nodes advertize
their content in ipfs DHT, Swarm can serve as a replication infrastructure
to ipfs.

The pluggable hash function, while nice, is not really the reason here, at all. the reason you can do this easily is that you can

(a) plug in your own strategies into Bitswap, which may require payment of some kind.
(b) use IPFS programmatically to request and pin things as you need them.
(c) mount other protocols on the IPFS p2p system

Fair allocation of bandwidth resources

Bitswap defines an API for bandwidth accounting that can be easily extended
to include micropayment transfers to balance otherwise unbalanced bandwidth
use between peers.

This is a design constraint for bitswap: to allow strategies involving currencies to be plugged in.

Names and URIs

One design principle of Swarm was to allow for arbitrary names and URIs to
resolve to both static and dynamic content served up by Swarm infrastructure.
Unfortunately, this has not been a design goal for ipfs and in its current form
it does not fulfill it.

What are you talking about? Are you even aware of how naming works? We have both content addressed and key addressed URIs.

In particular, static directories with a large number
of entries are handled very inefficiently by ipfs and there is no obvious
way around this limitation.

unixfs sharded directories are both designed, and have been implemented. They have not been pushed to master as we've yet to confirm this is the right direction. Regardless, this should actually not be a problem at all for you-- you should be creating your own raw IPFS objects-- not using unixfs.

The fact that you suggest using unixfs directories for storing either ethereum content or the ethereum trie demonstrates you have not understood the IPFS data model, and you should probably investigate it further, read more around, or if you want fast answers, just stop by IRC and ask other people to explain it to you.

In practice, it makes it very difficult to migrate content like
Wikipedia to our distributed storage, even though it would have one of
the obvious candidates for a high-profile applications of such an
infrastructure. Similarly problematic would be to implement commonly
used http API's for mapping content, such as OpenStreetMap tiles, on top
of ipfs, which would be another obvious candidate.

Not at all. Surprise: we've migrated wikipedia content just fine. And OpenStreetmapTiles too, and it works very, very fast.

You may want to look further before you make strong claims like this. You might be incorrect.

I believe that for the success of Web3, it is instrumental to retain as
much compatibility with popular and useful Web 2.0 standards and
services as possible. The URI resolution scheme used by ipfs constitutes
a very severe limitation hampering such efforts.

What limitations are you even talking about? The entire construction of IPFS path resolution over merkle trees is precisely to interface with the web in a sane way. We go even further, bridging the gaps between the web AND unix, making all of IPFS content accessible in clean URIs to users of the web AND standard unix paths for regular filesystem uses.

This line in particular:

The URI resolution scheme used by ipfs constitutes
a very severe limitation hampering such efforts.

shows you have no idea what you're talking about. not only is what we're doing highly compatible, it has been praised by people working on both Firefox AND Chrome as the sanest way to do content addressing AND key addressing on the web they've seen yet.

Decentralization

The design of ipfs provides a common abstraction for both centralized
and decentralized storage solutions so that content can be retrieved
from both using the same software; the consumer of the content does not
even need to be aware of the underlying storage architecture and ipfs
does not specify one. The content can come from a workstation with a
temporary adress, an individual small server, a large datacenter or a
sophisticated content delivery network. As long as the content conforms
to ipfs format and is advertized in ipfs DHT, the consumer will be able
to download it all the same.

Moreover, ipfs solves one of the main problems of the (http(s)-based)
web driving its rapid centralization, which is that the costs of content
distribution borne by the publisher increase with the content's
popularity. Since ipfs content is delivered bittorrent-style, all consumers
automatically contribute their upstream bandwidth towards distribution, at
least for the time of downloading, thus contributing their fair share.

However, as history with Bitcoin shows, enabling decentralization does
not prevent centralization. Economies of scale might result in a
centralization of storage infrastructure; the real question then becomes
to what extent can large players abuse their position.

This discussion is mostly correct. The one part to be concerned about is the last paragraph, specifically that you are worried about centralization and explicitly want to force all nodes in the network to store equally. Are you somehow suggesting that a server farm and mobile phones should do the same work? obviously this is not what you want, at all. What you want is for nodes to be able to plugged in wherever they are and use their resources as effectively as you can.

In short, grandma's iphone or her laptop should not be expected to do work equal to a powerful server in the backbone. Instead, what you want, is to create a market and allow anyone to plug in. Networks ARE and WILL BE heterogeneous in capacities and roles, not homogeneous.

Censorship resistance

In some ways, ipfs is explicitly censorship-enabling; nodes can decide
what content to store and not to store and they can credibly comply with
take-down notices. At the same time, ipfs also helps keeping content
available for all users as long as there are nodes that are willing to
serve it, although it must be noted that it also helps finding all
such nodes. This might be a workable compromise.

This IS very much the only way that a transport will be adopted by law abiding citizens and corporations. If you don't see this, you might as well try running things on top of freenet.

What you also do not mention here is that nodes can trivially join from a tor or I2P transport (and there is ongoing work right now to integrate this) to hide their positions in the network. IPFS is designed to layer over tor and i2p just fine, and THAT is the right way to achieve routing privacy.

Be advised that if you expect oblivious routing, or oblivious content storage, and you do not write something provably secure (or ideally just use any of the existing systems), and yet you advertise it as such, you WILL put people in jail.

Altough please remember, even tor, i2p, and freenet are not as safe as you might think.

For this, however, to remain the case, it is important that the DHT
remains decentralized. Unfortunately, at present there are no incentives
built into ipfs for running DHT nodes. DHT nodes cannot be excluded for
not responding to queries, because ipfs DHT attaches very little value
to connections. Consumers are not punished for freeloading (only
querying other DHT nodes, but never responding to queries), while a
cartel providing most of the storage service might decide not to keep
outsider addresses in their Kademlia table and yet provide a pleasant
user experience to freeloading consumers. Over time, this might develop
into a problem.

You might think so, but in practice actually kademlia DHTs as they are work and scale up just fine. Mainline DHT has no incentive structures and has scaled to 15-30M nodes daily (15M churn) with no problems.

That said, do note that the IPFS DHT as it is today is a simple first step. You might have learned by reading up or asking around that we have plans to:

  • Upgrade towards sybil proof DHTs like Wanau
  • Produce an incentivized DHT protocol on top of IPFS (like Filecoin) that explicitly tolerates leeches

This is because in the real world we have millions of mobile devices which:

  • have terrible resources (bandwidth, latency, storage, uptime, etc)
  • CANNOT be expected to be full dht nodes serving queries, only leeching
  • CANNOT be expected to pay for these queries out of pocket (we're talking about simple web browsing for mobile phones which include the poorest regions in the world)

and you have server farms (both in the backbone and in the last mile) which:

  • have great resources (cheap storage, bandwidth, etc)
  • can be organized through an incentivized protocol

In practice, the day that DHT serving becomes a problem for us, we'll address this. We have not yet had this problem at all. if you observe it, please let us know, as we've been waiting to spring on this.

Clarifications regarding directories:

Swarm does not care at all about the structure of the URI; it builds a Merkle-Trie from URIs and addresses content through that. Thus, it can, in theory, branch at most 256-ways at each node, but in practice the range of printable characters encountered in URIs is much smaller.

What you are not understanding is that you can take that merkle trie directly as is, and put it on IPFS, using the paths and path components as walking down the trie.

Also, 256 links is small. above you claimed large directories, large would be in the thousands.

And again, this is still up to you. you can pick a lower fanout obviously, but encoding your trie key in something that yields say {64, 32, 16} link fanouts.

There is no code-level concept of directory in Swarm at all;

There isn't in raw IPFS either. unixfs is on top. raw IPFS has merkle links. the ethereum blockchain and merkle-trie does too. by definition.

it is merely mapping URIs to content and the number of URIs under one Swarm hash is practically unlimited.

What you're saying then is that your swarm hash is NOT a merkle hash.

Btw, we do that just fine with IPNS, look into it.

Also, swarm matches for the longest prefix of the requested URI, leaving open the possibility for the web-app to interpret the rest in javascript.

This is equivalent to using the fragment part of a URL, and we use it that way already. Look into how people are using IPFS to make webapps which use the fragment to load other content (like video players, etc).

Swarm has an efficient implementation of changing one content object (i.e. file) in a large structure and return the resulting root hash as well as a convenient HTTP-based API for it (based on HTTP PUT and DELETE methods).

We also have an HTTP API that supports all these operations. And a trivially nice way to bubble up updates through a dag. Again, look deeper. or just ask!

In contrast, IPFS does have a concept of a directory, splitting URIs at slash characters.

This is a path traversal to allow people access to ANY node or any subgraph. It is extremely useful.

Directory objects are flat lists of object references and if they grow large, they have to be completely re-hashed if anything changes (i.e. one file gets added, deleted or changed).

... this is how merkle dags work. you cannot call your thing a merkle trie and NOT rehash all the nodes up to the root ...

or you must mean something else -- what do you mean?

Furthermore, when requesting a signle object from such a large directory, the entire directory needs to be retrieved and hashed to verify integrity and to search for the matching URI.

Again, your "large" problem is solved by adjusting the fanout to suit your use case.

This is, by the way, the way that {git, bittorrent, ZFS, fossil/venti, tahoe LAFS} work and it works just fine.

And IPFS is fast -- this, the directory rehashing on changes -- is not a bottleneck at all. If it is, you're doing something wrong, like not coalescing updates. You can already do this trivially programmatically. And look into ipfs files available in dev0.4.0, landing in master in a couple weeks, for a commandline/HTTP API interface to it.

@nagydani
Copy link
Author

@jbenet

Thank you for your extensive and very informative response. I belive that there are some misunderstandings between us and I would like to iron them out as quickly as we can.

This is a terrible characterization of ipfs. This post does not understand what IPFS is actually doing, why certain design constraints exist, or how to use it to best support your use case. You may want to check your stuff again. If you would like, i am happy to speak with you again. You should consider asking questions to our community, instead of making incorrect assumptions.

I would be very happy to speak with you again and I am wondering if #ipfs IRC channel, which I used to ask questions about IPFS is not the best forum to get in touch with IPFS community.

Do you understand why nodes cannot be required to store things and why the incentivization has to be separate? You cite it here like a design deficiency without addressing that:

I do not cite it as a deficiency, merely as an architectural feature that needs to be taken into account.

a) the IPFS content model strictly establishes that it should be possible for nodes to ONLY retrieve and store content they EXPLICITLY request. this is a REQUIREMENT for a transport protocol to have any chance to be adopted and used in regular companies at all. No sane company would ever run a protocol that may download illegal bits to their machines, no matter how many delusional layers of (im)plausible deniability you want to throw at them.

I understand that.

b) you do not point out the decomposition we have, which is that protocols like Filecoin layer above to ensure persistence, or that it is absolutely trivial to produce this behavior on top.

Perhaps I should have been more explicit about it, but I do understand it and I believe that I have even mentioned it. Anyway, thanks for making it even more clear here.

c) you may want to track ipfs/notes#58

Thank you! Indeed, a virtualized IPFS node may solve some issues that we have.

Long-term persistency of meaningful pieces of information ...

The pluggable hash function, while nice, is not really the reason here, at all. the reason you can do this easily is that you can

Of course. The pluggable hash function is important only insofar as content is addressed by its hash value.

(a) plug in your own strategies into Bitswap, which may require payment of some kind.

Correct. We will have to do that very soon. Would gladly receive any pointers to documentation or relevant interfaces.

(b) use IPFS programmatically to request and pin things as you need them.

That might also be an option.

(c) mount other protocols on the IPFS p2p system

That is actually a somewhat contentious issue as Ethereum has its own p2p system. But that is not a show-stopper either; we might use both.

Fair allocation of bandwidth resources

Bitswap defines an API for bandwidth accounting that can be easily extended
to include micropayment transfers to balance otherwise unbalanced bandwidth
use between peers.

This is a design constraint for bitswap: to allow strategies involving currencies to be plugged in.

Right. Is my characterization of IPFS correct here? Mind you that there is no implied criticism at all, we are ready to use that API.

Names and URIs ...

What are you talking about? Are you even aware of how naming works? We have both content addressed and key addressed URIs.

I am talking about content addressed URIs where a root hash is followed by a path.

In particular, static directories with a large number
of entries are handled very inefficiently by ipfs and there is no obvious
way around this limitation.

unixfs sharded directories are both designed, and have been implemented. They have not been pushed to master as we've yet to confirm this is the right direction. Regardless, this should actually not be a problem at all for you-- you should be creating your own raw IPFS objects-- not using unixfs.

Sorry, I have looked at the code in the master branch and asked people on IRC. Let's discuss this. Is there a way to take control of parsing the URI following the root hash of content addressed URIs?

The fact that you suggest using unixfs directories for storing either ethereum content or the ethereum trie demonstrates you have not understood the IPFS data model, and you should probably investigate it further, read more around, or if you want fast answers, just stop by IRC and ask other people to explain it to you.

This is what I did. I believe that I do understand it, but will be happy to learn more. For blockchain data, we just need the pluggable hash function. However, for web-based dapps, something resembling a filesystem would be essential.

In practice, it makes it very difficult to migrate content like
Wikipedia to our distributed storage, even though it would have one of
the obvious candidates for a high-profile applications of such an
infrastructure. Similarly problematic would be to implement commonly
used http API's for mapping content, such as OpenStreetMap tiles, on top
of ipfs, which would be another obvious candidate.

Not at all. Surprise: we've migrated wikipedia content just fine. And OpenStreetmapTiles too, and it works very, very fast.

Congratulations! Could you post a link here? Are updates fast, too? I am truly curious about this.

You may want to look further before you make strong claims like this. You might be incorrect.

I believe that for the success of Web3, it is instrumental to retain as
much compatibility with popular and useful Web 2.0 standards and
services as possible. The URI resolution scheme used by ipfs constitutes
a very severe limitation hampering such efforts.

What limitations are you even talking about? The entire construction of IPFS path resolution over merkle trees is precisely to interface with the web in a sane way. We go even further, bridging the gaps between the web AND unix, making all of IPFS content accessible in clean URIs to users of the web AND standard unix paths for regular filesystem uses.

That's great. I was told over IRC that the IPFS path resolver over merkle trees uses directories as nodes and that is what I have seen in the code as well.

In particular, as I understand, if you have a directory with 3 objects named, say AA, AB and BB, it will be one node inside the merkle tree with a three-way branching, rather than having a two-way branching separating BB from the other two beginning with A followed by another two-way branching for AA and AB (containing only A and B, of course).

This line in particular:

The URI resolution scheme used by ipfs constitutes
a very severe limitation hampering such efforts.

shows you have no idea what you're talking about. not only is what we're doing highly compatible, it has been praised by people working on both Firefox AND Chrome as the sanest way to do content addressing AND key addressing on the web they've seen yet.

Wonderful! Since this is my primary concern and I am the least sure about how IPFS actually does this, let us discuss it separately.

Decentralization ...

I think, we're on the same page here.

Censorship resistance ...

This IS very much the only way that a transport will be adopted by law abiding citizens and corporations. If you don't see this, you might as well try running things on top of freenet.

I do see this and me calling your approach a "workable compromise" is my endorsment of it.

What you also do not mention here is that nodes can trivially join from a tor or I2P transport (and there is ongoing work right now to integrate this) to hide their positions in the network. IPFS is designed to layer over tor and i2p just fine, and THAT is the right way to achieve routing privacy.

Correct.

Be advised that if you expect oblivious routing, or oblivious content storage, and you do not write something provably secure (or ideally just use any of the existing systems), and yet you advertise it as such, you WILL put people in jail.

I have quite a bit of experience with such issues, both legal and technical. What you write here is mostly correct, except that you accuse me of criticizing your approach (I do not) or being intent on endangering people out of ignorance and arrogance (I am not).

Altough please remember, even tor, i2p, and freenet are not as safe as you might think.

I do not think you know what I think.

For this, however, to remain the case, it is important that the DHT
remains decentralized. Unfortunately, at present there are no incentives
built into ipfs for running DHT nodes. DHT nodes cannot be excluded for
not responding to queries, because ipfs DHT attaches very little value
to connections. Consumers are not punished for freeloading (only
querying other DHT nodes, but never responding to queries), while a
cartel providing most of the storage service might decide not to keep
outsider addresses in their Kademlia table and yet provide a pleasant
user experience to freeloading consumers. Over time, this might develop
into a problem.

You might think so, but in practice actually kademlia DHTs as they are work and scale up just fine. Mainline DHT has no incentive structures and has scaled to 15-30M nodes daily (15M churn) with no problems.

I was merely pointing out a potential problem, using highly conditional language ("over time", "might", etc.). Sure, I do not expect it to become a real problem anytime soon and you will have plenty of time to think about it and eventually do something about it. No urgency here, but I decided to make this concern explicit. Thank you for sharing your roadmap for a solution!

Clarifications regarding directories: ...

What you are not understanding is that you can take that merkle trie directly as is, and put it on IPFS, using the paths and path components as walking down the trie.

Indeed, I might not be understanding something here. I will need to learn more about this.

Also, 256 links is small. above you claimed large directories, large would be in the thousands.

I think, you also misunderstood what I have written. 256 is the theoretical maximum for the degree of our merkle tree nodes (fanout). Directories can contain tens of millions of entries without any problems in Swarm.

And again, this is still up to you. you can pick a lower fanout obviously, but encoding your trie key in something that yields say {64, 32, 16} link fanouts.

Great.

Also, swarm matches for the longest prefix of the requested URI, leaving open the possibility for the web-app to interpret the rest in javascript.

This is equivalent to using the fragment part of a URL, and we use it that way already. Look into how people are using IPFS to make webapps which use the fragment to load other content (like video players, etc).

I have seen it and the two are not exactly equivalent. The difference in browser behavior (between chaning only the fragment part vs. other parts of the URL) is subtle, but it is there.

Swarm has an efficient implementation of changing one content object (i.e. file) in a large structure and return the resulting root hash as well as a convenient HTTP-based API for it (based on HTTP PUT and DELETE methods).

We also have an HTTP API that supports all these operations. And a trivially nice way to bubble up updates through a dag. Again, look deeper. or just ask!

I have asked on IRC and apparently got the wrong answer. But again, I will be happy to look deeper and ask again.

In contrast, IPFS does have a concept of a directory, splitting URIs at slash characters.

This is a path traversal to allow people access to ANY node or any subgraph. It is extremely useful.

And how would allowing to split at any character, not just slashes, make things worse? Actually, I believe that you could make a fully backwards compatible change here that would greatly improve things. When I understand your codebase better, I will be even willing to submit a PR.

Directory objects are flat lists of object references and if they grow large, they have to be completely re-hashed if anything changes (i.e. one file gets added, deleted or changed).

... this is how merkle dags work. you cannot call your thing a merkle trie and NOT rehash all the nodes up to the root ...

or you must mean something else -- what do you mean?

What I mean is a merkle dag over arbitrary portions of the URI, not necessarily directories. For an example, consider the AA, AB, BB case above.

Furthermore, when requesting a signle object from such a large directory, the entire directory needs to be retrieved and hashed to verify integrity and to search for the matching URI.

Again, your "large" problem is solved by adjusting the fanout to suit your use case.

Can fanout be changed without affecting anything else? If so, that would indeed solve my "large" problem.

And IPFS is fast -- this, the directory rehashing on changes -- is not a bottleneck at all. If it is, you're doing something wrong, like not coalescing updates. You can already do this trivially programmatically. And look into ipfs files available in dev0.4.0, landing in master in a couple weeks, for a commandline/HTTP API interface to it.

I am very eager to see that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment