Skip to content

Instantly share code, notes, and snippets.

@stefanches7
Last active November 17, 2018 19:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stefanches7/79f167369b543ab9089dc03c26780bc3 to your computer and use it in GitHub Desktop.
Save stefanches7/79f167369b543ab9089dc03c26780bc3 to your computer and use it in GitHub Desktop.
GSoC 2017: Ensembl/EnsemblGenomes FTP datafile search API

Organisation: Genes, Genomes and Variation

Service team of Ensembl/EnsemblGenomes projects: Ensembl main site, EnsemblGenomes main site, Ensembl repositories, Ensembl blog

Mentor: Dan Staines

Student: Stefan Dvoretskii

Repository and sources are available here

Plans/work done summary

Initial plan was to create an HTTP interface and Javascript interface that would allow users to search the FTP sites of Ensembl/EG projects (ftp.ensembl(genomes).org: contain various biological data in various formats) and get the links to the separate files that match the filters they have specified. I.e., so that user could type "Equus caballus" as taxonomy branch that he needs and "embl" as the datatype he needs and get all the relevant links for such a combination without a need to manually browse the FTP sites, hence saving a lot of time and effort.

During the work project has undergone minimal changes (in a positive way), and in general successfully completes the intial plan and serves its own purpose - user is welcome to use either HTTP lookup or JSUI lookup of the Ens/EG files using filtered search and getting straight links to file as a response. Furthermore, some features exceeding the original plan, like value suggestions in the Javascript interface or paging of the results were implemented to deliver the most comfortable product to work with to a everyday user. It is now runnable as a standalone (once again, you'll find the source code here and there are still some ideas how to further improve it, which we are going to do even after the GSoC ends.

Possible improvements and prospects

  • Host the API and the Database at Ensembl/EG server(s)
  • Embed the API into one of the Ensembl project websites using <script /> tag bzw. make it embeddable into other websites as such.
  • Use the eHive, an Ensembl grid computing system, to perform the update job - this will be faster and more efficient.
  • Allow users to download the whole page/result link set in one click in the Javascript interface - pretty challenging, considering the fact that JS doesn't have inbuilt feature to download multiple files or starting download on the same page.
  • Allow users to combine filters with logical operators other than "and"
  • Describe controller endpoints using Swagger
  • Polish the looks of the error pages and error handling

Footnote

Commits and their messages do not really depict the progress and changes done to the project. For that, in the adr directory you will find the most important decisions that were taken during the implementation, and you are welcome to read the README.mds located in the module directories and read the in-code documentation to understand the project better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment