jbenet/propose-ipfs-pack.md

## propose-ipfs-pack.md

      
    Raw
  

              propose-ipfs-pack.md
            
          
    IPFS Tooling for datasets

Background

We need some tooling for a certain set of use cases around archival and dataset management. This tooling is for fitting how people work with large files and large datasets.
Grounding Assumptions

Basic grounding assumptions here:

datasets are "large" (From GB to EB in size)
datasets should not be duplicated in the filesystem (eg into a .ipfs repo)
datasets may have different versions
datasets (at a particular version) are exactly determined (can be hashed)
people prefer to read and manipulate the datasets in a "working directory" style
it is not enough to have an HTTP or RPC API, but rather a POSIX filesystem api is essential
datasets can be represented as a tree of POSIX files and directories
datasets may be moved using non-ipfs tools
it would be useful to easily replicate and back up the content (ipfs, ipfs-cluster)
it would be useful to easily serve the content on the web (ipfs-gateway)
it would be useful (but not necessary) to digitally sign manifests

Why current IPFS tooling is not enough

The current ipfs tooling assumes we can import all data into a .ipfs repository directory. There are ongoing efforts to build filestore to allow referencing content outside of that directory, but this is not yet finalized, and all metadata is stored in the .ipfs repository, not with the directory in question.
We have often discussed Certified ARchives (.car) as a replacement for tar. This could be a future replacement, along with a reliable way to mount the .cars, but this is not yet here either.
Other tooling examples


BagIt - https://tools.ietf.org/html/draft-kunze-bagit-06#section-2.1.3
WARC - https://en.wikipedia.org/wiki/Web_ARChive
BitTorrent's "manifest-like" .torrent file

Proposed Tooling Additions

This document proposes the addition or adjustment of the following tools:

dagger/dagify (or whatever is decided here) - a standalone tool that reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.
ipfs-pack - a standalone tool that creates an "ipfs pack" (similar to WARCs, BagIt, and .torrent files, but with IPLD and importers magic).
datadex or maybe gx-dataset - a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)
car (still only a proposed tool) which create certified archives (single-file hash-linked archive, like a hash-linked .tar), will work closely with ipfs-pack.
The ipfs repo filestore abstractions can leverage ipfs-packs to understand what is being tracked.

dagger/dagify

This tool (name discussion here) reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.
> dagger -fmt <fmt-string> -r foo/bar/baz
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>

Where <fmt-string> is a format string that uniquely determines (for ever) the whole dag structure, including chunking scheme, index layout, what is tracked in the index, what is left as raw nodes, etc. The idea is that this string (which ideally will be short) can uniquely describe a strategy for representing the source content as the output ipld graph, and that it can repeatably do so. Meaning that once a given fmt string produces one output, it should never change (lest there is a major bug). This is because people must retain the ability to verify their content, and they need some primitive to do so.
dagger/dagify --only-cid --only-root

This tool will have an --only-cid flag that ouputs only the cids:
> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>

And an --only-root flag that returns only the last (root) object or cid.
> dagger -fmt <fmt-string> -r foo/bar/baz --only-root
<last-ipld-object>

> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid --only-root
<last-ipld-cid>

ipfs-pack filesystem packing tool

The idea is that ipfs-pack is a filesystem packing tool, that establishes the notion of a bundle, bag, or "pack" of files. We use pack to avoid confusing it with a Bag from BagIt, a very similar format (that ipfs-pack is compatible with). The way "packs" work is this:

There MUST BE a pack root directory that defines the pack. (eg at <path-to-pack-root>/) It contains all the pack contents and represents the pack in a filesystem.
There MUST BE a pack manifest file that tracks the contents ipfs hashes of the pack contents. (<pack-root>/PackManifest)
There MAY BE a pack object database cache file or directory that stores metadata on all the ipld objects in the pack. This is ancilliary and can be reconstructed from a pack root at any time.

Subcommands

> ipfs-pack -h
USAGE
    ipfs-pack <subcommand> <arguments>

SUBCOMMANDS
    make     makes the package, overwriting the ipfs-pack manifest file.
    verify   verifies the ipfs-pack manifest file is correct.
    db       creates (or updates) a temporary ipfs object database `.ipfs-pack/db`
    serve    starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).
    bag      create BagIt spec-compliant bag from a pack.
    car      create a `.car` certified archive from a pack.

Usage Example

> pwd
/home/jbenet/myPack

> ls
someJSON.json
someXML.xml
moreData/

> ipfs-pack make
> ipfs-pack make -v
wrote PackManifest

> ls
someJSON.json
someXML.xml
moreData/
PackManifest

> cat PackManifest
QmVP2aaAWFe21QjUujMw5hwYRKD1eGx3yYWEBbMtuxpqXs moreData/0
QmV7eDE2WXuwQnvccsoXSzK5CQGXdFfay1LSadZCwyfbDV moreData/1
QmaMY7h9pmTcA5w9S2dsQT5eGLEQ1CwYQ32HwMTXAev5gQ moreData/2
QmQjYU5PscpCHadDbL1fDvTK4P9eXirSwD8hzJbAyrd5mf moreData/3
QmRErwActoLmffucXq7HPtefBC19MjWUcj1DdBoaAnMm6p moreData/4
QmeWvL929Tdhzw27CS5ZVHD73NQ9TT1xvLvCaXCgi7a9YB moreData/5
QmXbzZeh44jJEUueWjFxEiLcfAfzoaKYEy1fMHygkSD3hm moreData/6
QmYL17nYZrZsAhJut5v7ooD9hmz2rBotC1tqC9ZPxzCfer moreData/7
QmPKkidoUYX12PyCuKzehQuhEJofUJ9PPaX2Gc2iYd4GRs moreData/8
QmQAubXA3Gji5v5oaJhMbvmbGbiuwDf1u9sYsN125mcqrn moreData/9
QmYbYduoHMZAUMB5mjHoJHgJ9WndrdWkTCzuQ6yHkbgqkU someJSON.json
QmeWiZD5cdyiJoS3b7h87Cs9G21uQ1sLmeKrunTae9h5qG someXML.xml
QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm moreData
QmZ7iEGqahTHdUWGGZMUxYRXPwSM3UjBouneLcCmj9e6q6 .

> ipfs-pack db make
> ipfs-pack db make -v
wrote .ipfs-pack/db

> ls -a
./
../
.ipfs-pack/
someJSON.json
someXML.xml
moreData/
PackManifest

> find .ipfs-pack/
.ipfs-pack/
.ipfs-pack/db

ipfs-pack make create (or update) a pack manifest

This command creates (or updates) the pack's manifest file.
ipfs-pack make
# wrote PackManifest

ipfs-pack verify checks whether a pack matches its manifest

This command checks whether a pack matches its PackManifest.
# errors when there is no manifest
> random-files foo
> cd foo
> ipfs-pack verify
error: no PackManifest found

# succeeds when manifest and pack match
> ipfs-pack make
> ipfs-pack verify

# errors when manifest and pack do not match
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file1" >>PackManifest
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file2" >>PackManifest
> touch non-manifest-file3
> ipfs-pack verify
error: in manifest, missing from pack: non-existent-file1
error: in manifest, missing from pack: non-existent-file2
error: in pack, missing from manifest: non-manifest-file3

ipfs-pack db creates (or updates) a temporary ipfs object database

This command creates (or updates) a temporary ipfs object database (eg at .ipfs-pack/db). This object database contains positonal metadata for all IPLD objects contained in the pack. (It follows the ipfs repo filestore metadata concerns). It MAY be a different, simpler object-db format, or be a full-fledged ipfs node repo using filestore.
The db is a simple key-value store that supports:

maps { <ipld-cid> : <filestore-descriptor> }
supports: list() []<ipld-cid> to show all cids in db
supports: put(<ipld-object>) <ipld-cid>
supports: get(<ipld-cid>) <ipld-object>
supports: putDescriptor(<ipld-cid>, <filestore-descriptor>)
supports: getDescriptor(<ipld-cid>) <filestore-descriptor>
supports: delete() to remove itself from disk

Notes:

<filestore-descriptor> is the metadata necessary to reconstruct the entire object from data in the pack.
{get,put} should be able to add or retrieve the objects from db or from the data in the pack.
{get,put}Descriptor should be able to add or retrieve file descriptors for objects stored in the pack.
Intermediate ipld objects (eg intermediate objects in a file, which are not raw data nodes) may need to be stored in the db.

This database basically implements:
type PackObjectDB interface {  
  // Make creates or updates a pack-db at packdbPath, 
  // with data for all the objects in the pack at packPath.
  Make(packPath string, packdbPath string) error

  // Put associates the given FileDescriptor with the given ipld.CID
  // if filestore.Descriptor is nil, Put removes the entry for ipld.CID (rm)
  Put(ipld.CID, filestore.Descriptor) error

  // Get retrieves the FileDescriptor associated with the given ipld.CID
  Get(ipld.CID) (filestore.Descriptor, error)

  // List returns all ipld.CID stored in the database
  List() (<-chan ipld.CID, error)

  // Delete deletes all the database contents and clears all files
  Delete() error
}
And does so both through a programmatic interface (some go package), or via cli tooling:
> ipfs-pack-db --help
USAGE
    ipfs-pack-db <subcommand> <arguments>

SUBCOMMANDS
    make     creates (or updates) the pack-db for a pack directory
    list     lists all cids in the pack-db
    put      adds a (cid, filestore-descriptor) entry.
    get      retrieves the filestore-descriptor for a given cid.
    delete   removes all files representing the pack-db (destructive)

ipfs-pack serve starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

This command starts an ipfs node serving the pack's contents (to IPFS and/or HTTP). This command MAY require a full go-ipfs installation to exist. It MAY be a standalone binary (ipfs-pack-serve). It MUST use an ephemeral node or a one-off node whose id would be stored locally, in the pack, at <pack-root>/.ipfs-pack/repo
> ipfs-pack serve --http
Serving pack at /ip4/0.0.0.0/tcp/1234/http - http://127.0.0.1:1234

> ipfs-pack serve --ipfs
Serving pack at /ip4/0.0.0.0/tcp/1234/ipfs/QmPVUA4rJgckcf1ifrZF5KvwV1Uib5SGjJ7Z5BskEpTaSE

ipfs-pack bag convert to and from BagIt (spec-compliant) bags.

This command converts between BagIt (spec-compliant) bags, a commonly used archiving format very similar to ipfs-pack. It works like this:
> ipfs-pack bag --help
USAGE
  ipfs-pack-bag <src-pack> <dst-bag>
  ipfs-pack-bag <src-bag> <dst-pack>

# convert from pack to bag
> ipfs-pack bag path/to/mypack path/to/mybag

# convert from bag to pack
> ipfs-pack bag path/to/mybag path/to/mypack

ipfs-pack car convert to and from a car (certified archive).

This command converts between packs and cars (certified archives). It works like this:
> ipfs-pack car --help
USAGE
  ipfs-pack-car <src-pack> <dst-car>
  ipfs-pack-car <src-car> <dst-pack>

# convert from pack to car
> ipfs-pack car path/to/mypack path/to/mycar.car

# convert from car to pack
> ipfs-pack car path/to/mycar.car path/to/mypack

datadex or maybe gx-dataset

WIP
a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)
car - certified archives

WIP
cars would interop with packs.
The ipfs repo filestore

WIP
Maybe the ipfs repo filestore abstractions can leverage ipfs-packs to understand what is being tracked in a given directory, particularly if those packs have up-to-date local dbs of all their objects.