We have introns, and clusters of introns. The challenge is: given a single intron, return all clusters where that intron is a member.
Data is given in the following format:
cluster_id | intron_list |
---|---|
1 | "Intron30, Intron54, Intron 55" |
2 | "Intron45, Intron46" |
3 | "Intron24, Intron30" |
4 | "Intron96, Intron199, Intron87, Intron88" |
... | ... |
There are ~600k total clusters, comprised of 2.1 million unique introns. Cluster size ranges from 2 through 22, with around 50% of all clusters having 4 or fewer introns.
- Users may search for introns that do not exist in any cluster. If this happens, return no results.
- After the clusters are returned, a script combines the contents of each cluster to get just the unique values.
Goal is to enable specific intron -> set of groups
.
Create a lookup table to store that information directly.
lookup_table := hashmap linking (all 2.1 million relevant introns -> empty lists)
for each cluster in the database {
get the cluster id number, ie the primary key for that cluster
for each intron in the cluster {
find that intron's corresponding list in the lookup_table
add the cluster id to that list
}
}
Create a new table, with its primary key as "intron name" and add a column named "groups".
Run the pseduocode above in your favorite programming language,
then for each entry in the hashmap lookup_table
add a row to the postgress lookup table.
The data should have this format:
intron_id | groups |
---|---|
"Intron1" | "5" |
"Intron2" | "4,7,8" |
"Intron3" | "1,3,52" |
It's important that the name of the intron is used as the "primary key"; this lets Postgress search by name efficiently.
To search: first query this new lookup table, split the result to find which clusters are needed, look up each cluster in the original table.
Processing time to build the lookup table: 30 seconds best case, 5 minutes worst case.
Search time for a single lookup: very fast best case, 2 seconds worst case.