Skip to content

Instantly share code, notes, and snippets.

@DylanDmitri
Last active April 13, 2019 02:50
Show Gist options
  • Save DylanDmitri/d57c6badc37ad50464679425586b442e to your computer and use it in GitHub Desktop.
Save DylanDmitri/d57c6badc37ad50464679425586b442e to your computer and use it in GitHub Desktop.

Problem Description

We have introns, and clusters of introns. The challenge is: given a single intron, return all clusters where that intron is a member.

Data is given in the following format:

cluster_id intron_list
1 "Intron30, Intron54, Intron 55"
2 "Intron45, Intron46"
3 "Intron24, Intron30"
4 "Intron96, Intron199, Intron87, Intron88"
... ...

There are ~600k total clusters, comprised of 2.1 million unique introns. Cluster size ranges from 2 through 22, with around 50% of all clusters having 4 or fewer introns.

Sidenotes

  • Users may search for introns that do not exist in any cluster. If this happens, return no results.
  • After the clusters are returned, a script combines the contents of each cluster to get just the unique values.

Suggested Implementation

Goal is to enable specific intron -> set of groups.

Create a lookup table to store that information directly.

Pseudocode

lookup_table := hashmap linking (all 2.1 million relevant introns -> empty lists)

for each cluster in the database {
    get the cluster id number, ie the primary key for that cluster
    
    for each intron in the cluster {
        find that intron's corresponding list in the lookup_table
        add the cluster id to that list
    }
}

Postgress implementation details

Create a new table, with its primary key as "intron name" and add a column named "groups".

Run the pseduocode above in your favorite programming language, then for each entry in the hashmap lookup_table add a row to the postgress lookup table. The data should have this format:

intron_id groups
"Intron1" "5"
"Intron2" "4,7,8"
"Intron3" "1,3,52"

It's important that the name of the intron is used as the "primary key"; this lets Postgress search by name efficiently.

To search: first query this new lookup table, split the result to find which clusters are needed, look up each cluster in the original table.

Estimated Performance

Processing time to build the lookup table: 30 seconds best case, 5 minutes worst case.

Search time for a single lookup: very fast best case, 2 seconds worst case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment