DylanDmitri/Intron Database Recommendation.md

## Intron Database Recommendation.md

      
    Raw
  

              Intron Database Recommendation.md
            
          
    Problem Description

We have introns, and clusters of introns.
The challenge is: given a single intron, return all clusters where that intron is a member.
Data is given in the following format:


cluster_id
intron_list


1
"Intron30, Intron54, Intron 55"


2
"Intron45, Intron46"


3
"Intron24, Intron30"


4
"Intron96, Intron199, Intron87, Intron88"


...
...


There are ~600k total clusters, comprised of 2.1 million unique introns.
Cluster size ranges from 2 through 22, with around 50% of all clusters having 4 or fewer introns.
Sidenotes


Users may search for introns that do not exist in any cluster. If this happens, return no results.
After the clusters are returned, a script combines the contents of each cluster to get just the unique values.

Suggested Implementation

Goal is to enable specific intron -> set of groups.
Create a lookup table to store that information directly.
Pseudocode

lookup_table := hashmap linking (all 2.1 million relevant introns -> empty lists)

for each cluster in the database {
    get the cluster id number, ie the primary key for that cluster
    
    for each intron in the cluster {
        find that intron's corresponding list in the lookup_table
        add the cluster id to that list
    }
}

Postgress implementation details

Create a new table, with its primary key as "intron name" and add a column named "groups".
Run the pseduocode above in your favorite programming language,
then for each entry in the hashmap lookup_table add a row to the postgress lookup table.
The data should have this format:


intron_id
groups


"Intron1"
"5"


"Intron2"
"4,7,8"


"Intron3"
"1,3,52"


It's important that the name of the intron is used as the "primary key";
this lets Postgress search by name efficiently.
To search: first query this new lookup table, split the result to find which clusters are needed, look up each cluster in the original table.
Estimated Performance

Processing time to build the lookup table: 30 seconds best case, 5 minutes worst case.
Search time for a single lookup: very fast best case, 2 seconds worst case.
cluster_id	intron_list
1	"Intron30, Intron54, Intron 55"
2	"Intron45, Intron46"
3	"Intron24, Intron30"
4	"Intron96, Intron199, Intron87, Intron88"
...	...