seanmacavaney/pyterrier-artifact.md Secret

## pyterrier-artifact.md

      
    Raw
  

              pyterrier-artifact.md
            
          
    Proposal: PyTerrier Artifacts API

Problem: There is currently no common pattern for working with artifacts (indexes, models, etc.).
This proposal aims to make artifacts "first-class citizens" in pyterrier.
Beyond API consistency across packages, there are advantages to having a specific object "own" an artifact.
E.g., it can ensure that the object is only opened/loaded once, even if it is shared across a variety of trnasformers.
Proposal

pt.Artifact - An abstract base class that defines common functionality common across all artifacts. Namely, code
for loading/downloading.
As a convention, artifacts should generally function as transformer factories. I.e., you can use an artifact to construct
transformers that use that artifact.
# loads an artifact & returns its proper type (here, a TerrierIndex)
index = pt.Artifact.load('/path/to/msmarco_passage.terrier')

# downloads, caches, and returns a PisaIndex object
index = pt.Artifact.from_url('https://data.terrier.org/msmarco_passage/pisa_porter2.tar.lz4')

# it's a factory:
index.bm25() # creates a PisaRetriever for this index (existing functionality) 

# dense indexes too, of course!
index = pt.Artifact.from_url('https://data.terrier.org/msmarco_passage/tasb.flex.tar.lz4')

# if you know the type, you can specify it directly:
index = PisaIndex.from_url('https://data.terrier.org/msmarco_passage/pisa_porter2.tar.lz4')

# shortcuts for well-known locations, like huggingface:
index = pt.Artifact.from_url('hf:pyterrier/msmarco-passage.pisa')
# (translates to URL like https://huggingface.co/datasets/pyterrier/msmarco-passage.pisa/main/artifact.tar.lz4)

# a specific branch/hash/etc
index = pt.Artifact.from_url('hf:pyterrier/msmarco-passage.pisa@v1.5')
# (translates to URL like https://huggingface.co/datasets/pyterrier/msmarco-passage.pisa/v1.5/artifact.tar.lz4)

# from_dataset support
index = pt.Artifact.from_dataset('msmarco_passage', 'terrier_stemmed')
# (perhaps this gets rewritten to a .from_url() call?)
Extra utilities (e.g., build_package(), artifact.upload_to_huggignface_hub(), etc.) may also be provided.
Technical details:

Figuring out artifact type:
An abstract method Artifact._load(path) lets an artifact object try to load the artifact at a specific path.
In most cases, there are key files that indicate a particular artifact type. For instance, data.properties suggests
a TerrierIndex. If it sees that file, it can try to load it. If it doesn't, it can return None (or raise exception?)
indicating that it's not supported.
This is a reasonable strategy, but it involves iterating over all artifact types, however. So we'll only use it as a
fallback strategy. We can include a new metadata.json (or similar) file that gives information about the artifact's
type (and potentially other information, like what package you'd have to install to use it?). Example:
{"type": "lexical_index", "format": "terrier"}

This convention is already used by some pyterrier extensions, but there isn't yet consistency in the naming and such.
To identify which artifact type to use, packages can define "entry points",
which let them expose what artifact types they contain and their corresponding type/format. Example:
setuptools.setup(
    ...
    entry_points = {
        'pyterrier.artifact': [
            'lexical_index.pisa = pyterrier_pisa:PisaIndex',
        ],
    },
)
In this way, Artifact can read the metadata file, check the entry points, and directly load the correct package and type
--- all without needing to load any unnecessary packages.
Artifact package format
The existing from_dataset() method build a URL and checks a well-known URL for a list of files, which it downloads
individually.
I think it makes more sense to package the artifact up as a single tar file (e.g., artifact.tar.lz4), which should
make them easier to share. Compression can reduce the time to download. Since compression provides limited value on
certain data types (e.g., dense indexes), a compression scheme that's super fast to decompress (e.g., lz4) should
be preferred. It also allows checkpointing, so systems could (theoretically) request only some of the files from the archive.
An optinal metadata file (e.g., artifact.tar.lz4.json) could provide some high-level information, like the
decompressed size, a hash, file contents, etc.
Another reasn to use a tar archive format instead of separate files is that some providers limit the total size of a
single file, whcih means that large files will need to be split up. A single tar archive can be split up across multiple
individual files and streamed together. Of this could be done on a per-file basis too, but it adds additional complexity
and means that a repository wouldn't necessarily be directly usable without further re-stiching of files anyway.
Alternative considered

The existing from_dataset() -- relies on a single authority (via data.terrier.org) to provide the prebuilt indexes.
It's also risky with the current state of the servers it runs on.
git lfs -- annoying software dependencies. Doesn't provide good compression of individual when uploading/downloading.
Benefits include versioning, branches, etc., but that can be addressed with the above propsoal too. See also reason above
about maximum file sizes.