lpand/Enrichment.md

## Enrichment.md

      
    Raw
  

              Enrichment.md
            
          
    Enrichment

Index


What We need
Input representation within BioMart domain
What we want to do with it
Design
Implementation

5.1 Ensembl IDs Translation

5.2 Getting All the Annotations

5.3 Getting Other Results' Attributes and HGNC Translation
Open Issues
Possible Solutions

1

What we need

####INPUT:

Background: a list of genes.
Sets: a list of genes.
Cutoff: a float value.
Annotations: one or more annotation term –– not the annotations themself.

2

 Input representation within BioMart domain 

Background and Sets are Filter Lists, with a display name and a value.

Cutoff is a plain Filter.

Annotations is a set of Attribute Lists.
An Attribute List is an Attribute that contains and represents a list of Attribute elements.

Same concept for Filter List.
3

What we want to do with it

If requested, translate the lists into human genes/regions.

Then, enrich them and get the results.

Get futher information about the results, like a description for instance.

Finally, translate the results into hgnc symbols.
4

Design

A Dino is the basic computation unit external to the BioMart domain through which we want to extend the BioMart functionalities. It has requirements and dependencies.

With MartConfigurator, the deployer assigns a Dino to a Configuration and specifies the *Attribute*s and *Filter*s that will yield the requirements of the Dino.
When BioMart receives a new Query for a Dataset configured with a Configuration that has been assigned a Dino, it hands off the request to the Dino Handler.
The Dino Handler is the component of our new architecture that gets a Query as input ––along with other informations–– gathers the requirements and dependecies of the Dino, creates a new instance of the Dino passing the dependencies as parameters, bind the values of *Filter*s and *Attribute*s from the Query to *Field*s of the Dino instance and finally makes the Dino instance run (you go girl).
A Dino instance has also access to a collection tha maps its requirement names to the bound *Query Element*s.
5

Implementation

During the processing, this Dino needs to fire *Query*s and it does it taking advantage of a concrete implementation of a Query Builder i made.
5.1

Ensembl IDs Translation

Before everything, we decided to translate IDs of any form to Ensembl IDs. For this purpose we need to issue to the BioMart Query Engine one query for Background and one for Sets.
For the translation query we need:

The name of the Filter bound to the requirement –– the value is a requirement.
The dataset name and configuration name in the user query.
To get hold of the first Attribute within the Attribute List coming from the user query.

With these informations we can submit a query for ID translations.
5.2

####Getting All the Annotations
In the enrichment phase we need annotation terms. When a new Query comes in, the Dino looks for the proper annotation file and if there isn't, it builds a Query for retrieve them.
To build this Query it's needed:

The second Attribute within the Attribute List inside the user query.
The name of Dataset and Configuration the Attribute List is part of.

5.3

Getting Other Results' Attributes and HGNC Translation

Omitted for now.
6

Open Issues


Inside the Query there's no trace of *Attribute List*s nor *Filter List*s because they are flatten in a messy way: the *Attribute*s and *Filter*s being part of a list take the place of the list itself. Same *Attribute*s and *Filter*s can be part of different lists making impossible to understand which belongs to which list.
Inside the BioMart Registry, an Attribute List is an Attribute with a list of references to the *Attribute*s being in the list. Those *Attribute*s are NOT copied within a Container but only across different *Configuration*s (i'm not 100% sure about this assertion).
In so many places along BioMart's source code –– that is all the classes that deal in some way with attributes –– *Attribute*s are gathered into a single collection, *Attribute List*s are flatten and duplicated Attribute are discarded. This, plus the absence of a test suite for the BioMart source code, mean, in my opinion, that redesigning BioMart's code such that an Attribute List holds copies of its *Attribute*s implies a cumbersome effort, with a high possibility of fail.

7

Possible Solutions

Enrichment Dino


Must process each *Attribute List*s separately using all the flatten *Filter*s
and return all the results at once.


Network (This is not about network but for now it stays here...)


Must process all the Attribute List at once, so it uses the flatten *Attribute*s and *Filter*s.

Has no attributes but just filters. One for threshold, one for gmt file, one
for the metric to use.
Solution 1

=============
Query


Has a list of *Attribute List*s
and a list of *Filter List*s.

Enrichment Dino


For each Attribute List will

Bind *Attribute*s to its *Field*s.
For each Filter List will bind *Filter*s to its *Field*s.


This to avoid rebinding of attributes while binding filters. Otherwise, it
should know which fields are bound to attribute and which to filters.