Skip to content

Instantly share code, notes, and snippets.

@bgruening
Last active November 9, 2015 18:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bgruening/4414529ff0d951f94ef6 to your computer and use it in GitHub Desktop.
Save bgruening/4414529ff0d951f94ef6 to your computer and use it in GitHub Desktop.
postgresql/mysql Galaxy integration as tool and visualisation

First class postgresql/mysql Galaxy integration

With this document I try to struture my ideas about a solid postgresql/mysql integration in Galaxy as part of out GIE work and beyond.

A lot of tools relying on ORDBMS for storing and quering data. This is not so common in NGS or proteomics but it is in cheminformatics, text-minining and genome annotation. See for example blast2GO which uses a mysql database as backend, or the ChemicalToolBox for storing the entire PubChem in a phchem indexed postgresql backend. In general I think it makes sense to support ORDBMS in Galaxy if data do not change so frequently and if we can share databases as Galaxy datasets. This would enable us to create one DB and query it by different users with different use-cases.

Supported use cases

  1. create database
  2. extend database
  3. query database
  4. view database

The current Docker based job runner can run Docker containers (in this case the postgres container), mounts in the input data and returns the result file. My idea is (at least for the first part GIE unrelated) to use this mechanism to create DBs.

Scenarios

Inserting

  1. A tool gets a tabular datasets and defines on it's own (maybe as configfile, or as standalone python script) a sqlalchemy script to pouplate the database
  2. This script and the dataset is mounted into the container
  3. script is executed as normal process given the input data and populates the databse
  4. the script will stop the database (but keeps running, so that the contianer does not terminate)
  5. the script runs pg_dump and returns an interoperable SQL dump as result to Galaxy

Imho this should be possible with only slight modifications to the current system. I'm not sure we should define a complex-composite datatype for postgres/mysql. For the first round maybe a tarball (not compressed) is enough. This makes it much easier, should not create any large overhead and makes it really easy to download/upload databases. I could imagine to store the blast2go mysql database somewhere online so we can easily plug it in. Interoperability with data-managers to download existing genomic resources, e.g. flybase, and update/populate these databases on the fly are also possible. The versioning is important and for this we can define a sniff() function that open only one specific file inside this tarball and extract the DB number. This number should be part of the datatype metadata, so that tools can filter by version number, as we do in gemini and snpeff.

Updating

Updating a database can work in the same way. The only difference is that we need to untar the database on the tool level before mounting them in and that the script needs to restart postrges with the new already filled database.

Query

Query should be easy with the same mechanisms as described above.

Visualisation

For an interacitve visualisation of the entire database we could use a GIE :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment