bgruening/first_class_postgresql_galaxy_integration.md

## first_class_postgresql_galaxy_integration.md

      
    Raw
  

              first_class_postgresql_galaxy_integration.md
            
          
    First class postgresql/mysql Galaxy integration

With this document I try to struture my ideas about a solid postgresql/mysql integration
in Galaxy as part of out GIE work and beyond.
A lot of tools relying on ORDBMS for storing and quering data. This is not so common in NGS or
proteomics but it is in cheminformatics, text-minining and genome annotation.
See for example blast2GO which uses a mysql database as backend, or the ChemicalToolBox for storing the
entire PubChem in a phchem indexed postgresql backend.
In general I think it makes sense to support ORDBMS in Galaxy if data do not change
so frequently and if we can share databases as Galaxy datasets.
This would enable us to create one DB and query it by different users with different use-cases.
Supported use cases


create database
extend database
query database
view database

The current Docker based job runner can run Docker containers (in this case the postgres container), mounts in the input data and returns the result file. My idea is (at least for the first part GIE unrelated) to use this mechanism to create DBs.
Scenarios

Inserting


A tool gets a tabular datasets and defines on it's own (maybe as configfile, or as standalone python script) a sqlalchemy script to pouplate the database
This script and the dataset is mounted into the container
script is executed as normal process given the input data and populates the databse
the script will stop the database (but keeps running, so that the contianer does not terminate)
the script runs pg_dump and returns an interoperable SQL dump as result to Galaxy

Imho this should be possible with only slight modifications to the current system.
I'm not sure we should define a complex-composite datatype for postgres/mysql.
For the first round maybe a tarball (not compressed) is enough.
This makes it much easier, should not create any large overhead and makes it really easy to download/upload databases.
I could imagine to store the blast2go mysql database somewhere online so we can easily plug it in.
Interoperability with data-managers to download existing genomic resources, e.g. flybase, and update/populate these databases on the fly are also possible. The versioning is important
and for this we can define a sniff() function that open only one specific file inside this tarball and extract the DB number.
This number should be part of the datatype metadata, so that tools can filter by version number, as we do in gemini and snpeff.
Updating

Updating a database can work in the same way. The only difference is that we need to untar the database on the tool level before mounting them in and that the script needs to restart postrges with the new already filled database.
Query

Query should be easy with the same mechanisms as described above.
Visualisation

For an interacitve visualisation of the entire database we could use a GIE :)