Skip to content

Instantly share code, notes, and snippets.

@lpand
Last active January 4, 2016 11:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lpand/8617036 to your computer and use it in GitHub Desktop.
Save lpand/8617036 to your computer and use it in GitHub Desktop.

Enrichment

Index

  1. What We need
  2. Input representation within BioMart domain
  3. What we want to do with it
  4. Design
  5. Implementation
    5.1 Ensembl IDs Translation
    5.2 Getting All the Annotations
    5.3 Getting Other Results' Attributes and HGNC Translation
  6. Open Issues
  7. Possible Solutions

1

####INPUT:

  • Background: a list of genes.
  • Sets: a list of genes.
  • Cutoff: a float value.
  • Annotations: one or more annotation term –– not the annotations themself.

2

Background and Sets are Filter Lists, with a display name and a value.
Cutoff is a plain Filter.
Annotations is a set of Attribute Lists.

An Attribute List is an Attribute that contains and represents a list of Attribute elements.
Same concept for Filter List.

3

If requested, translate the lists into human genes/regions.
Then, enrich them and get the results.
Get futher information about the results, like a description for instance.
Finally, translate the results into hgnc symbols.

4

A Dino is the basic computation unit external to the BioMart domain through which we want to extend the BioMart functionalities. It has requirements and dependencies.
With MartConfigurator, the deployer assigns a Dino to a Configuration and specifies the *Attribute*s and *Filter*s that will yield the requirements of the Dino.

When BioMart receives a new Query for a Dataset configured with a Configuration that has been assigned a Dino, it hands off the request to the Dino Handler.

The Dino Handler is the component of our new architecture that gets a Query as input ––along with other informations–– gathers the requirements and dependecies of the Dino, creates a new instance of the Dino passing the dependencies as parameters, bind the values of *Filter*s and *Attribute*s from the Query to *Field*s of the Dino instance and finally makes the Dino instance run (you go girl).

A Dino instance has also access to a collection tha maps its requirement names to the bound *Query Element*s.

5

During the processing, this Dino needs to fire *Query*s and it does it taking advantage of a concrete implementation of a Query Builder i made.

5.1

Before everything, we decided to translate IDs of any form to Ensembl IDs. For this purpose we need to issue to the BioMart Query Engine one query for Background and one for Sets.

For the translation query we need:

  • The name of the Filter bound to the requirement –– the value is a requirement.
  • The dataset name and configuration name in the user query.
  • To get hold of the first Attribute within the Attribute List coming from the user query.

With these informations we can submit a query for ID translations.

5.2

####Getting All the Annotations

In the enrichment phase we need annotation terms. When a new Query comes in, the Dino looks for the proper annotation file and if there isn't, it builds a Query for retrieve them.

To build this Query it's needed:

  • The second Attribute within the Attribute List inside the user query.
  • The name of Dataset and Configuration the Attribute List is part of.

5.3

Omitted for now.

6

  • Inside the Query there's no trace of *Attribute List*s nor *Filter List*s because they are flatten in a messy way: the *Attribute*s and *Filter*s being part of a list take the place of the list itself. Same *Attribute*s and *Filter*s can be part of different lists making impossible to understand which belongs to which list.
  • Inside the BioMart Registry, an Attribute List is an Attribute with a list of references to the *Attribute*s being in the list. Those *Attribute*s are NOT copied within a Container but only across different *Configuration*s (i'm not 100% sure about this assertion).
  • In so many places along BioMart's source code –– that is all the classes that deal in some way with attributes –– *Attribute*s are gathered into a single collection, *Attribute List*s are flatten and duplicated Attribute are discarded. This, plus the absence of a test suite for the BioMart source code, mean, in my opinion, that redesigning BioMart's code such that an Attribute List holds copies of its *Attribute*s implies a cumbersome effort, with a high possibility of fail.

7

Enrichment Dino

  • Must process each *Attribute List*s separately using all the flatten *Filter*s
  • and return all the results at once.

Network (This is not about network but for now it stays here...)

  • Must process all the Attribute List at once, so it uses the flatten *Attribute*s and *Filter*s.

Has no attributes but just filters. One for threshold, one for gmt file, one for the metric to use.

Solution 1

=============

Query

  • Has a list of *Attribute List*s
  • and a list of *Filter List*s.

Enrichment Dino

  • For each Attribute List will
    • Bind *Attribute*s to its *Field*s.
    • For each Filter List will bind *Filter*s to its *Field*s.

This to avoid rebinding of attributes while binding filters. Otherwise, it should know which fields are bound to attribute and which to filters.

@arekkasp
Copy link

Since this does not expose attribute/filter list logic to the external users looks like a perfect solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment