-
-
Save jbarnoud/fc27c5048d6e8f394598 to your computer and use it in GitHub Desktop.
Another one, is I think you need to have a maximum number of iterations.
I feel that with huge data, it can never goes to complete convergence (but perhaps I'm wrong)
You didn't use scipy or scilearn ?
I updated the notebook. It now works with a regular version of PBexplore (the latest one); also I added tests. Among the tests I added, is the one suggested by @alexdb27 with the dummy sequences. This test works. Yet, I also looked at the reproducibility of the clustering, and I am not really satisfied with it.
There is no test yet for empty clusters. Yet, the only case I see where a cluster can be empty is if two initial centers are the same; there is now a test to avoid that.
The profile is updated by recalculating the frequency profile with the sequences in the cluster.
@pierrepo I did not see the need. What would you have done with scipy or sk- learn?
@alexdb27 Also, there already is a maximum number of iterations, and that number must be set by the user.
Impressive.
@jbarnoud, a point about empty cluster. It was a problem in early R implementation. If you have in fact x clusters and you provide 10x possibilities. At some points, some of the 10x clusters remain not used.
I supose it will not arrive with this kind of data to be honnest, but it would be nice to have (in case of).
Concerning the comparison between clustering, I'm not sure i've understood everyrhing.
In the exemple, i would like to know the number of frames,we need to have a view with (a) a large number of snapshots (perhaps even from different MDs - for the same protein of course), and (b) a quite large number of clusters (at least 10+).
In my mind, i would like to know if i do 10 times the k-means which is the number of snapshots that are always find together (not only with two simulations).
@jbarnoud, there already is a maximum number of iterations, and that number must be set by the user.
ok, 1 iteration means complete use of all the data ?
A default must be provided (=10)
What could be nice for the user is to have a plot of "non-change (NC)"*
- I explain: x = 10 clusters, N = 10000 sequence of PBs
iteration 1 -> random initialisation of centers (profiles), association of N with the x for the first time,
update of profiles
iteration 2 -> association of N with the x
NC is the number of N associated with the same cluster as for previous iteration.
iteration i+1 -> association of N with the x
NC would increase .... i hope
can it be done ?
A lot of stuff from @alexdb27.
About the empty clusters
It is in theory possible to have empty clusters with the k-means algorithm. The workaround is easy, though: here I choose my initial centers among the records, and I make sure they are not redundant. I think it is enough to avoid empty clusters, indeed there will at the very least be one record by cluster at the beginning.
Therefore there is no test about empty clusters as they should not happen.
About clustering reproducibility
I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.
About non-change plot
It s easy to get the number of sequences that change group at each iteration. I made a crude prototype, and indeed the number decreases at each iteration until it reaches 0. I'll have the plot available in a future version of the notebook. I may even use that as a criterion for convergence as it is faster to compute as what I currently do.
About the user interface
This notebook is just a prototype to validate the algorithm. Once the method will be validated, I will implement it in PBclust. At that point I will set a default value for the number of iterations and I can make some plots available to the user.
I will come to you all later to define what are the most pertinent information and plots to expose to the user. My feeling is that we should expose only what is the most useful through the command line and have the rest accessible through the API that will mostly be used by advanced users.
On testing the method
I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?
@jbarnoud, About clustering reproducibility
I am not sure I understood everything you asked. I carried out 100 times the clustering of the same 270 PB sequences from a MD trajectory, and we can observe that the succession of clusters along the trajectory is not always the same. The figure is difficult to analyze, I will try to come up with something more quantitative and more readable.
-> Ok, i've fixed our own confusion point. You cannot look at the cluster i in simulation S(t) and look if it looks at cluster i in simulation S(t+1). What you need is to do a (i) a confusion table that is based only on data associated to each cluster. The principle is that to define the number of data found both in each cluster in simulation S(t) and in simulation S(t+1). (b) you take the max for each line (or column) and it gives you the correspondance between one cluster [S(t] and another [resp S(t+1)]. (c) you sum all and you have the true confusion. You do that cycle (t) after cycle and you will see if it reproducible
@ jbarnoud On testing the method
I would like to test the pertinence of the clustering on structure similarity within the clusters. What do you usually use to compute GDT TS and TM-scores?
It is mainly RMSD. GDT TS and TM-scores will be quite not sensitive for so highly similar structures. :-)
Impressive indeed and very nice.
About scipy, I was juste wondering if the built-in k-means clustering implemented in scipy was easier-to-use / quicker.
Great job Jonathan!
RMSD could be a nice measure for the different clusters. The issue on a regular MD (the 270 sequences you tested) is to know the good number of clusters. Maybe '4' is not a good one, hence the reproducibility is hard to assess.
The issue, I think, about built-in k-means it is really difficult to have a custom distance metrix and a custom representation of the centroids.
Dear @jbarnoud,
Amazing works. I've seen your mail.
For me, the most simple way to insure its abilities is to simulate totally fake PB series.
With one 'aaaaaa...aaaaaaa', twenty times, 'bbbbbbbbbbbbbbbbb...' forty and so on. Then test is they suceed to cluster it in a perfect way and also how it reacts if the number of clusters is 'wrong'.
I've two questions, (i) is there a control in case of empty cluster and (ii) how is updated the profile.