Skip to content

Instantly share code, notes, and snippets.

@liamzebedee
Last active August 31, 2024 02:45
Show Gist options
  • Save liamzebedee/224494052fb6037d07a4293ceca9d6e7 to your computer and use it in GitHub Desktop.
Save liamzebedee/224494052fb6037d07a4293ceca9d6e7 to your computer and use it in GitHub Desktop.
layout title date tags
post
Comparing IPFS and BitTorrent.
2023-08-10
research ipfs bittorrent p2p p2p-networks dht

In this post we're going to compare two of le plus grande P2P file-sharing networks! IPFS and ye olde BitTorrent. This research piece is based on a lot of experience I have with IPFS, namely integrating it into Dappnet.

To begin with, since a lot of people don't actually know how they work, I'm going to cover a little bit about their technical designs, and then we'll go into their practical differences.

How does BitTorrent work?

BitTorrent was invented in 2001 by Bram Cohen, and you can find the original source code archive here.

BitTorrent allows you to share files faster, by downloading from multiple nodes serving the file. The way this works, is you divide the file into equal-sized pieces, and then you publish a torrent - which describes the hash of each individual piece. When you start the torrent client, it discovers peers and begins synchronising which pieces of a file that they have. When they find peers which have pieces they need, they request those pieces and then verify the integrity of them using the hash inside the torrent file.

The BT terminology usually refers to nodes that are sharing a file as seeders, and nodes that are downloading the file as leechers. Both seeders and leechers form the swarm, which is all the nodes serving a single torrent. The ratio of your upload to download is called a seed ratio. Good seed ratios get you faster download speeds, due to a protocol incentive design called tit-for-tat.

How do we find peers for a torrent? Peer discovery happens via two sources - trackers and the DHT.

  • Trackers are centralized web servers which track peers for each torrent. They are very simple - the torrent file can specify a list of tracker URL's, for example udp://tracker.opentrackr.org:1337/announce, wherein the torrent client will "announce" itself to the tracker. It will then request a list of peers for the torrent and begin sharing data.
  • The DHT is a decentralized alternative to trackers - a DHT is a distributed hash table, meaning it is a distributed data structure which stores a mapping from keys to values. Specifically, the BitTorrent DHT called Mainline stores a torrent -> (peer-list) mapping. It has over 1M nodes, and generally can resolve values in under 1s.

How does IPFS work?

(incomplete)

IPFS was invented in 2014 by Juan Benet, you can read the whitepaper here.

It's a bit different to the torrent, in that all files exist in a global namespace that anyone can publish to. It's like one big share drive.

When running a node, you can opt to host anyone's data, or only a subset that you are interested in. The latter is called pinning.

The basic terminology of IPFS:

  • Files are split into fixed-size chunks, like in BitTorrent.
  • Each file is referred to by its hash, called a CID (content ID).
  • Nodes (called peers) form a global P2P network called the DHT.

Rather than .torrent files, IPFS generally operates via the sharing of IPFS CID's, the equivalent to a magnet link.

Some general observations:

  • a great CLI - ipfs add -R dir/
  • simpler naming schemes: /ipfs/Qm12123 for content, /ipns/1Dk12123 for "mutable content"
  • strong multiplatform support - js-ipfs (browsers), go-ipfs (desktops)
  • fantastic primitives in the form of the libp2p networking stack, which are used by many other projects - such as Ethereum, my old work Keep Network, new startups like Renegade. libp2p is a seminal contribution to the P2P networking stack,

The differences between BitTorrent and IPFS.

  • Intended use cases:

    • BitTorrent is built for file-sharing.
    • IPFS is built as a global P2P file system. It's inspired by P2P approaches like the Coral CDN.
  • IPFS is quantitatively slower.

    • There is lots of anecdotal evidence to support this, e.g. my users on Dappnet, the developers of the iroh node saying "on the order of seconds to resolve DHT queries".
    • BitTorrent peer discovery runs over UDP, IPFS runs over TCP. Theoretically, BT finds peers much quicker due to the overhead in TCP handshaking + libp2p encryption.
    • Design and Evaluation of IPFS: A Storage Layer for the Decentralized Web

    The content retrieval process across all regions takes 2.90 s, 4.34 s, and 4.74 s in the 50th, 90th, and 95th percentiles

    ~800 ms, ~1.3s, ~1.5s in the 50th, 90th, and 95th percentiles

  • Broadly-speaking, IPFS exhibits a federated network architecture, whereas BitTorrent is more maximally decentralized.

    • An IPFS node is largely more intensive in every way by default: storage - storing anyone's data, networking - much higher gossip overhead for keeping data live in the DHT, CPU - anecdotally, higher CPU because of the above
    • BitTorrent nodes by comparison, are extremely lightweight.
    • IPFS is being upgraded to rely on centralized operators, like IPFS Network Indexers (IPNI).
    • Both are on the spectrum of decentralization - neither are client-server architectures. BT is to the far right end of decentralization, whereas IPFS is closer towards the middle.
  • IPFS's core design is much cleaner than BitTorrent's.

    • A global namespace for files, where independent IPFS content can reuse content from other IPFS folders, is quite easy-to-use. This compares against BitTorrent's mutable torrents.
    • Mutable file-sharing is as simple as ipfs add -r dir.
    • The IPFS gateway standard is ubiquitous. IPFS gateways have been deployed around the world, where you can access P2P content in browser (such as Cloudflare's gateway, via .eth domains using eth.limo, etc.).
  • Censorship and availability.

    • One of the interesting differences is in censorship and availability.
    • In IPFS, you are providing a chunk of your storage to everyone for free.
    • In BT, you are only seeding the content you are interested in. For this reason, there are private communities which host movies and other things (private trackers) which thrive.
    • IPFS's design is more of a public share drive - you can publish images which will be hosted by Cloudflare, for free! And likewise, the IPFS protocol generally leaves it open to operators on how much content they host (pin).
      • In practice, IPFS content gets banned from certain nodes and gateways. e.g. Tornado Cash censored from Cloudflare's IPFS gateway.
      • The general standard which has emerged in this ecosystem is the "bad bits denylist", a curated list of CID's which have been flagged for various reasons. The terminology here is quite an interesting choice - "We use the term "bad bits" when discussing topics involving copyright violations, DMCA, GDPR, Code of Conduct, or malware. This is a tactic to facilitate fruitful public discussion of concepts of freedom, censorship, privacy, and safety without slipping into destructive discussion patterns".
      • In practice, there is higher operational expense to running an IPFS gateway - e.g. Publishers Carpet-Bomb IPFS Gateway Operators With DMCA Notices.
  • Implementation-wise:

    • BT is proven for large files - see the 700 Gb LLaMA dataset which was recently shared around the world.
    • In practice, IPFS still has glaring pain points:
      • [1] - the IPFS protocol is hugely resource-intensive. Take this example of SciHub, which uses IPFS to distribute academic papers. Each paper is one IPFS CID, so for 1M papers, this involves 1M republications of "we are hosting this content" per 24h. The network gossip is quite intense - the IPFS protocol specifies gossipping the CID to the 20 closest peers, which in practice, due to the uniform distribution of these hashes, means gossipping to every known peer.

        Picture this: 3 million books, at least one CID each (in practice it's often multiple, since the libgen collection uses a chunk size of 256kb). Section 3.1 of the paper talks about content publication - for each CID, a provider record is published on up to 20 different peers. Because the CIDs are derived from a high-quality hash function, they are evenly distributed. So this means that a node with a sufficient number of items ends up connecting to every single node on the network. For 3 million CIDs * 20 publication records, this means sending out 60 million publication records, every 12 hours, i.e. an average of 1388 publication records per second (assuming one CID per file, which is conservative). This is just to announce to the network "hi... just wanted to let you know I still have the same content I did yesterday". And every full replica of libgen is doing this.

      • [2] - some features of IPFS clients are still relatively immature, and not being improved with any priority. For example, the mutable file system (MFS) has a major performance regression according to this GH issue, when adding large numbers of files.

        The changes to the MFS are crunching to a hold after a lot of consecutive operations, where single ipfs files cp /ipfs/$CID /path/to/file commands take 1-2 minutes while the IPFS daemon is taking 4-6 cores worth of CPU power.

        I have this same issue. I'm maintaining a package mirror with approximately 400,000 files. ipfs files cp gets progressively slower as files are added to MFS.

    • One team who was building an alternate high-performance IPFS implementation, iroh, has since broken rank and moved in a new direction for many of these same reasons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment