Skip to content

Instantly share code, notes, and snippets.

@seanmacavaney
Last active April 19, 2024 11:19
Show Gist options
  • Save seanmacavaney/ceac1b5eacaac4b072caa69986089ff4 to your computer and use it in GitHub Desktop.
Save seanmacavaney/ceac1b5eacaac4b072caa69986089ff4 to your computer and use it in GitHub Desktop.

Proposal: PyTerrier Artifacts API

Problem: There is currently no common pattern for working with artifacts (indexes, models, etc.).

This proposal aims to make artifacts "first-class citizens" in pyterrier.

Beyond API consistency across packages, there are advantages to having a specific object "own" an artifact. E.g., it can ensure that the object is only opened/loaded once, even if it is shared across a variety of trnasformers.

Proposal

pt.Artifact - An abstract base class that defines common functionality common across all artifacts. Namely, code for loading/downloading.

As a convention, artifacts should generally function as transformer factories. I.e., you can use an artifact to construct transformers that use that artifact.

# loads an artifact & returns its proper type (here, a TerrierIndex)
index = pt.Artifact.load('/path/to/msmarco_passage.terrier')

# downloads, caches, and returns a PisaIndex object
index = pt.Artifact.from_url('https://data.terrier.org/msmarco_passage/pisa_porter2.tar.lz4')

# it's a factory:
index.bm25() # creates a PisaRetriever for this index (existing functionality) 

# dense indexes too, of course!
index = pt.Artifact.from_url('https://data.terrier.org/msmarco_passage/tasb.flex.tar.lz4')

# if you know the type, you can specify it directly:
index = PisaIndex.from_url('https://data.terrier.org/msmarco_passage/pisa_porter2.tar.lz4')

# shortcuts for well-known locations, like huggingface:
index = pt.Artifact.from_url('hf:pyterrier/msmarco-passage.pisa')
# (translates to URL like https://huggingface.co/datasets/pyterrier/msmarco-passage.pisa/main/artifact.tar.lz4)

# a specific branch/hash/etc
index = pt.Artifact.from_url('hf:pyterrier/msmarco-passage.pisa@v1.5')
# (translates to URL like https://huggingface.co/datasets/pyterrier/msmarco-passage.pisa/v1.5/artifact.tar.lz4)

# from_dataset support
index = pt.Artifact.from_dataset('msmarco_passage', 'terrier_stemmed')
# (perhaps this gets rewritten to a .from_url() call?)

Extra utilities (e.g., build_package(), artifact.upload_to_huggignface_hub(), etc.) may also be provided.

Technical details:

Figuring out artifact type:

An abstract method Artifact._load(path) lets an artifact object try to load the artifact at a specific path. In most cases, there are key files that indicate a particular artifact type. For instance, data.properties suggests a TerrierIndex. If it sees that file, it can try to load it. If it doesn't, it can return None (or raise exception?) indicating that it's not supported.

This is a reasonable strategy, but it involves iterating over all artifact types, however. So we'll only use it as a fallback strategy. We can include a new metadata.json (or similar) file that gives information about the artifact's type (and potentially other information, like what package you'd have to install to use it?). Example:

{"type": "lexical_index", "format": "terrier"}

This convention is already used by some pyterrier extensions, but there isn't yet consistency in the naming and such.

To identify which artifact type to use, packages can define "entry points", which let them expose what artifact types they contain and their corresponding type/format. Example:

setuptools.setup(
    ...
    entry_points = {
        'pyterrier.artifact': [
            'lexical_index.pisa = pyterrier_pisa:PisaIndex',
        ],
    },
)

In this way, Artifact can read the metadata file, check the entry points, and directly load the correct package and type --- all without needing to load any unnecessary packages.

Artifact package format

The existing from_dataset() method build a URL and checks a well-known URL for a list of files, which it downloads individually.

I think it makes more sense to package the artifact up as a single tar file (e.g., artifact.tar.lz4), which should make them easier to share. Compression can reduce the time to download. Since compression provides limited value on certain data types (e.g., dense indexes), a compression scheme that's super fast to decompress (e.g., lz4) should be preferred. It also allows checkpointing, so systems could (theoretically) request only some of the files from the archive.

An optinal metadata file (e.g., artifact.tar.lz4.json) could provide some high-level information, like the decompressed size, a hash, file contents, etc.

Another reasn to use a tar archive format instead of separate files is that some providers limit the total size of a single file, whcih means that large files will need to be split up. A single tar archive can be split up across multiple individual files and streamed together. Of this could be done on a per-file basis too, but it adds additional complexity and means that a repository wouldn't necessarily be directly usable without further re-stiching of files anyway.

Alternative considered

The existing from_dataset() -- relies on a single authority (via data.terrier.org) to provide the prebuilt indexes. It's also risky with the current state of the servers it runs on.

git lfs -- annoying software dependencies. Doesn't provide good compression of individual when uploading/downloading. Benefits include versioning, branches, etc., but that can be addressed with the above propsoal too. See also reason above about maximum file sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment