Skip to content

Instantly share code, notes, and snippets.

@bgruening
Last active January 1, 2016 13:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bgruening/48297c27cd72cbadea7a to your computer and use it in GitHub Desktop.
Save bgruening/48297c27cd72cbadea7a to your computer and use it in GitHub Desktop.
RFC: download_by_proxy action type

This is a RFC to get the implementation details right for a new action type in tool_dependencies.xml.

Since years we try to save a very crucial sustainability problem: Non-sustainable links!

A little bit of history

At first we tried to mirror tarballs with sceptical sustainability, like BioC or random FTP servers. But over time we encountered many more places which we can not trust. Google-Code, SourceForge etc ... We tried to mirror the entire BioC history by tracking the SVN history down and creating tarball for every revision ... a Herculean task ... but still limited in scope because there are so many other things that needs to be archived to make Galaxy and all tools sustainable.

In the end we ended up with the simplest solution, provide a community archive where everyone can drop tarballs that they want to be sustainable. The Galaxy Project was so generous and is funding the storage but we have plans to mirror and distribute the workload to universities and other institutes that want to help.

The biggest problem we needed to solve was the access to the archive. Who can drop tarballs? How do we control access to prevent abuse of this system?

We went ahead and the created the Cargo-Port: https://github.com/galaxyproject/cargo-port Access will be controlled by a community and via PR. Add your package and we will check the content (hopefully) automatically and the tarball will be mirrored to a storage server.

RFC

So far so good. This RFC is about the usage of Cargo-Port inside of Galaxy. I would like to propose a new action type that uses the Cargo-Port directly. It should replace <action type="download_by_url" sha256sum="6387238383883..."> and <action type="download_file"> and offer a more transparent and user-friendly solution. The current state of the art is quite cumbersome since we need to generate manually the checksum, offer the correct link and get the same information into Cargo-Port. I would like to streamline this a little bit and use this as a good opportunity to fix and work on galaxyproject/galaxy#896.

Proposal <action type="download_by_proxy">:

  • attribute for Id, Version, Platform, Architecture
  • no URL, no checksum
  • attribute for the URL to cargo-port/urls.tsv
  • default to the current github repo
  • configurable via galaxy.ini
  • this action will more or less trigger this curl command: $ curl https://raw.githubusercontent.com/galaxyproject/cargo-port/master/gsl.py | python - --package_id augustus_3_1
  • which give us the freedom to change API, columns ... in Cargo-Port without updating Galaxy core
  • the only API that need to keep stable is gsl
  • gsl will try to download from the original URL, specified in Cargo-Port. If this does not work we will download our archived one.
  • Changing the current working dir? Is this what we want, e.g. automatically uncompress and change cwd like download_by_url.
  • We will need an attribute to not uncompress. A few tools need the tarballs uncompressed.

Single Point of Failure - a small remark

Previously, Galaxy packages relied entirely on the kindness of upstream to maintain existing packages indefinitely. Obviously not a sustainable practice. Every time a tarball was moved, we had to hope one of us retained a copy so that we could ensure reproducibility. With the advent of the Cargo Port, we now maintain a complete, redundant copy of every upstream tarball used in IUC and devteam repositories, additionally adding sha256sums for every file to ensure download integrity. The community is welcome to request that files they use in their packages be added as well. We believe this will help combat the single point of failure by providing at least one level of duplication. The Cargo Port is considering plans to provide mirrors of itself to various universities and another layer of redundancy.

Thanks for reading and we appreciate any comments.

Eric, Nitesh & Bjoern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment