Skip to content

Instantly share code, notes, and snippets.

@mait
Last active July 8, 2023 20:27
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mait/8001883 to your computer and use it in GitHub Desktop.
Save mait/8001883 to your computer and use it in GitHub Desktop.

Thinking about 'meta' torrent file format.

Let's say I've downloaded big file using torrent. Then add very small file and recreate new torrent file. Like subtitle.

Now two torrent files are totally different file to machine. Tracker and torrent client would treat them different torrent. Of course we don't need duplicate original data file for multi seeding. But seeders and leechers split by two torrent file. They don't know about they have exact same file. Torrent client and tracker cannot connect people for exact same data. We have split share pool for exact same file. It's not efficient. More seeders, more speed.

Let's say original torrent file is 1.torrent.

[ file1 ]

Now I add some file and make new torrent file 2.torrent that looks like,

[
    [ file1 ] => This is 1.torrent.
    + file2
] => This is 2.torrent

Another person got reached 2.torrent. Hey maybe create new torrent file based 2.torrent. So we got 3.torrent.

[ 
    [
        [ file1 ] => This is 1.torrent.
        + file2
    ] => This is 2.torrent
    + file3
] => This is 3.torrent

So if you got 3.torrent, you are in same share pool with 1.torrent, 2.torrent people.

What if there is 4.torrent, 5.torrent or so more in near future?

We maybe query to torrent search engine or DHT, PEX.

"Please give me torrent list based on 3.torrent"

If there is new interesting torrent, we can upgrade 3.torrent -> X.torrent. We don't need any interaction to local files. Only added files will be downloaded.

If you know about source code management tools like git, this idea is basically 'git repoisitory in one torrent file'.

git init
make 1.torrent
git commit
make 2.torrent
git commit
...

TL;DR

  • Torrent file can contain another torrent file.
  • We can keep seeder/leecher pool big as possible as. Don't split us if we have exact same contents.
  • If there is other torrents based on particular torrent, we can discover them.

That's the key points.

How this idea can be real? Is that possible?

@pips-
Copy link

pips- commented Dec 17, 2013

What if I want to remove a file ? Or change one ?

This does not seem to allow modification, only addition.

@the8472
Copy link

the8472 commented Dec 17, 2013

Possible? Yes. But it has been proposed before and there are several (relevant) edge cases that complicate the design.

Additionally real world requirements don't really consist of incremental file adding but of creating individual torrents and batching them at a later point, which would require a multi-way merge.

Instead of trying to twist the torrent format itself to support it an external facility allowing covered-swarm-discovery which is orthogonal to the torrent itself might be simpler, but the question still would be whether the gain are worth it. Torrents have a natural lifecycle anyway, the bigger ones live longer, the smaller ones die off, so users already tend to migrate in that direction.

@hansent
Copy link

hansent commented Dec 17, 2013

wouldn't it make more sense to keep all files separate and then just have the meta files be a bundle of sorts?

e.g.

[
    [ file1 ] => This is 1.torrent.
    [ file2 ] => This is 2.torrent.
] => This is 1and2.torrent

rather than:

[
    [ file1 ] => This is 1.torrent.
    + file2
] => This is 2.torrent

in cse I ever want to have e,g, file3 and file2 bundled together?

@ncanceill
Copy link

I like it! Here are my two bits.

TL;DR

  • This can be done on top of BitTorrent
  • This would have limitations
  • This would be cool

NB: Torrent files are already called "metainfo files", so this becomes very 'meta'.

Technical stuff

Despite me disliking legacy, this could be easily added to the BitTorrent protocol. As a reminder, the info dictionary in a Torrent file has the fields: name, piece length, pieces, and the files list — pointing to dictionaries with each a path and length.

Referencing the meta

Any path key could easily reference a Torrent URL instead of the filename, which can be retrieved when the referencing chain ends. URLs offer various ways of specifying the reference, including magnet links. This addresses @the8472 's concern about "multi-way merge", and @hansent 's comment.

A special value of length could work as a flag — eg zero, if that does not break the current implementation in clients. The list type of the path key could also be abused to add more info about the referenced file, like its hash.

Preserving directory structure

The only thing that may be tricky to specify is directory chains. Let's consider two files:

0.torrent {
  name: "zero"
  files: [ {
    length: 1234
    path: [ "subzero", "filezero" ]
  } ]
}

1.torrent {
  name: "one"
  files: [ {
    length: 0
    path: [ "http://ab.com/0.torrent" ]
  }, {
    length: 1234
    path: [ "subone", "fileone" ]
  } ]
}

Then 1.torrent downloads fileone in new directory one/subone/, but what about filezero? To keep consistency, I believe the name of the referenced file should be disregarded — or maybe forced to match the updated file's name.

Pieces hashes

Using a null length for references can help to get around the pieces field: do not include the hashes of the reference — they will be retrieved from the referenced file. This way, the length of pieces remains twenty times the sum of the length fields.

Further ideas

I mentioned I like this idea: this is because it got me thinking. Maybe the refinement of git branching would be overkill, but I definitely appreciate what "versioned Torrents" could mean.

I believe the referencing idea could be taken further with some resolving mechanism. This would help avoiding loops and double downloads.

The "upgrade" mechanism must be included in clients: when starting 1.torrent, the client needs to know that 0.torrent is already finished — using download history, crawling the download directory, etcætera.

Based on my naïve sketch above, the "search" function will likely become a heavy map-reduce job, because a Torrent does not know which Torrent files reference to it. Maybe there is a way to make back-referencing easier with some clever hashing.

A trailing path could be appended to the referenced URL in order to only designate a specific file from the torrent, which would address @pips- 's concern about removing files — just reference all files but the ones you want removed. However, this would create problems if files are not aligned on the piece size.

@maintheme
Copy link

Here is a paper about a similar approach of building torrents for continous data. Might be interesting for you.

Decentralized Hosting And Preservation Of Open Data
http://btcgsa.info/wp-content/uploads/2013/07/Decentralized-Hosting-and-Preservation-of-Open-Data.pdf

@predakanga
Copy link

It's worth noting that many of your stated goals can be achieved using existing BitTorrent extensions - to whit, BEP0038 provides for finding already downloaded data so that it only requires a rehash.

That on it's own doesn't provide for combined pools of peers but in combination with BEP0039, you can ensure that with any regularly updated content (think, for instance, of a torrent containing all the hotfixes for a piece of software), peers on the n-th torrent will automatically connect to the n+1-th torrent.

I believe that using these approaches in concert can make for a much improved torrent ecosystem, particularly for periodic content. It also accounts for security through BEP0039's application of BEP0035. As such, I would love to see effort go into implementing these mechanisms in various open source clients, rather than duplicating the work.

That said, there is a certain amount of value in bridging the distinct swarms, and to that end I would suggest a much, much simpler approach.

First, forget about updating the torrents; there's already a standard for that. Second, don't use git or anything anywhere near as complicated; the beauty of the .torrent is that it's a simple, well-defined and extensible structure with no external dependencies. Finally, don't weigh the torrent down by including the full .torrent - your updated .torrent already has the hash of the old files, by necessity.

To that end, I would suggest a structure as follows:

{
    "announce": "http://tracker.microsoft.com:2710",
    "info": {
        "piece length": 262144,
        "pieces": "...",
        "originator": "com.microsoft",
        "collection": "com.microsoft.kb.windows.xp"
        "files": [
            { "path": ["kb1234.exe"], "length": 95840294 },
            { "path": ["kb1235.exe"], "length": 5948573 }
        ]
    }
    "signatures": {
        "com.microsoft": "..."
    }
    "replaces": {
        "00000000000000000000": ["magnet:?...", "http://tracker.microsoft.com/archive/WinXPJuly2013.torrent"]
    }
}

As you can see, this format differs from accepted standards in only one way - the addition of the "replaces" key.

The value of this "replaces" key would be a dictionary of torrents which the current torrent supersedes - the key of the dictionary is the info hash of each old torrent, and it's value is a list (optionally empty) of URIs to acquire that torrent from.

The torrent client would use this key in two ways - first, if it loads a torrent which replaces a torrent already loaded in the client, that would be stopped/deleted in favour of the new torrent. This would help to stop the splitting of resources, unnecessary announces (to each previous swarm), and could significantly reduce load on the tracker. Second, if the torrent client is unable to reach seeds (or is simply searching for more), it knows that it can use data from the replaced torrents, if any seeds are still active on them.

By simply providing URIs to the replaced torrents, instead of embedding the entire torrent, we achieve several purposes - we stop the actual torrent file from growing exponentially (or even linearly) as the chain grows longer, we enable the use of alternate protocols such as magnet links (thereby providing support for any future addressing protocols), and we allow older torrents to be updated to include new trackers, etc. Any key outside of the info hash on the older torrents can be updated and still be used by the current torrent.

There are still some flaws with this method, but in the end it comes down to these pros and cons in my view:

Pros:

  • Completely backwards compatible (only one new field added, will be ignored by older clients)
  • Uses existing standards, simplifies the standardisation and adoption of this
  • Avoids unnecessary load on trackers
  • Provides a means to deprecate old torrents
  • Provides for forwards compatibility using proven standard (URI)
  • Puts the onus on the torrent clients, not the users, to migrate swarms to the newer torrent

Cons:

  • Not 100% verifiable - if somewhere in the torrent chain, a data file changes content, but not name or length, the torrent client has no way of knowing this. Recommend supplementary BEP to add "file_hash" field to info->files, or mandatory piece boundary alignment for files
  • Integrity of the old torrents cannot be guaranteed - if a .torrent refers to an older torrent at example.com/old.torrent, a malicious user could hack/acquire example.com, and modify the torrent to point to his own trackers, to harvest IPs. Recommend a signed syntax (i.e. "000000@com.microsoft") or a same-originator policy to protect against this
  • Requires more effort on the part of the torrent client to find seeders, if there aren't many on the most up-to-date swarm, it will have to download the .torrent and announce to each older torrent in turn. Can be mitigated using scrapes to test swarm viability.
  • Does not provide for cases where users might legitimately want to obtain part of a collection, but not the whole - forces them to either become a partial seeder on the latest torrent, or deal with the smaller remaining swarm on the original torrent.

This is a problem that I've been considering for some time and am intending to lobby the various open source clients to improve their support for the mentioned BEPs, as I see it as being important to the ecosystem - I'm glad that other people see the problem and want to fix it too!

EDIT: Added another con, updated the first con.

@kmag
Copy link

kmag commented Dec 17, 2013

"How can a BT client find an earlier version of this torrent?" is probably not the question you're actually trying to solve. "How can a BT client discover more sources for the data represented by this torrent, given that a subset of that data is also present in other torrents?" is probably the problem you're trying to solve.

Torrent file modifications aren't a branchless phenomenon... A may give rise to B, but then one person might modify B to get C while someone else modifies B to get D, and a third person modifies A to get E. All of these could benefit by knowing about each other.

Advertising a cryptographic Merkle tree root (or other cryptographic hash, though Merkle trees have several advantages) for each file in the DHT would allow the downloaders of these files to find seeders or peers from other torrent swarms, if the Merkle tree roots are added to the per-file descriptions in the "files" section of the torrent.

Normal SHA-256, SHA-3-256, or SHA-1 could be used, but the advantage of Merkle trees is that the 64-MB-granularity (or arbitrary granuality that's a power of two number of kilobytes) row of the Merkle tree can be asked from a peer and that information can be cryptographically verified without having any trust of the peer, and allows backwards-compatible tweaking of granularity without changing the torrent file and (more importantly) without changing the Merkle tree root.

Using Merkle trees would also open up the door for easily and verifiably advertising on the DHT availability of sub-ranges of files, say at the 64 MB granularity. Disk corruption and network corruption happen, and for large rare files, there very well may be another file out there that has only a few bits corrupted and has usable data that other downloaders could use... or the file has some header metadata changed by one person (maybe correcting the date of an MLK speech or something) while the rest of the data remains identical and identically aligned. If the 64 MB granularity row of the Merkle tree for that file is advertised via DHT and present in the torrent, then these "large and nearly identical" files can also be used as source for data. On a side note, anyone creating a file format should place user-editable metadata at the end of the file whenever possible, and the same goes for creators of metadata editing tools. That way, metadata differences don't affect alignment of data and won't prevent partial cross-sharing of files that differ only in their user-edited metadata. Yes, one can deal with the data misalignment issue by using a rolling hash and breaking the file into blocks at bounded-but-irregular lengths, but this is much more complicated than just putting the user-editable metadata at the end of the file.

@predakanga
Copy link

@kmag That seems like an elegant solution - in the non-DHT sphere you could also have search engines that let you find other sources for a particular file using the Merkle tree.

My instinct was to say that that may be rolled in as a de-facto tracker protocol update (/sources as compared to /announce, /scrape), but it seems that more separation of concerns is appropriate there.

That said, it would be good to have that question solved from the start, and I do think that it should be something that the torrent client should be able to automate, outside of the DHT.

@funklord
Copy link

I'm very glad to find this discussion, but I've noticed that the discussion has veered off into three distinct directions:

  1. A way for torrents to share swarms in a metadata independent fashion (cryptographic solutions etc.)
    Which is a very hard problem, and logic dictates that solutions may become necessarily inefficient in some cases.
  2. Torrents that auto-update content.
  3. Patching torrent content by creating a new format.

Obviously all 3 ideas have merit, and should be pursued, but I'm particularly interested in number 3, because there are valid use cases where an existing torrent needs to be changed, and this change needs to be as cheap as possible, also, reasonably easy to add to existing clients.

If we have an existing torrent with lots of activity, but we want to change a couple of filenames, edit some small bits of binary data and add a few files etc.

The previous post by @predakanga seems to be on the right track, so let me get this straight:
We create a new torrent with all the new data but we also add a non-standard extension with another torrent and describe which chunks are available from it.
A new "extended" torrent client can use both swarms for improved performance.
Standard clients will ignore the extension and only see a single torrent, which is still valid, but only has the new, much smaller swarm.
I'm not so sure a URI is a good idea unless it's signed, since, in this case a torrent is still considered to be "final".
This kind of solution would also be acceptable to private trackers, since it doesn't rely on DHT etc.

You seem to have a much firmer grasp on the exact technical issues... such as, how typical modification affects data alignment etc. (therefore invalidating all subsequent chunks)
Any further insight on this would be greatly appreciated.

And, what would such a feature be called?
Nested torrents?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment