Name : Ayush Suhane
Email : firstname.lastname@example.org
Github : ayushsuhane
Blog : Website
This GIST includes all the work which was done during the period of GSoC 2018 with MDAnalysis - NumFOCUS under the title "Improve Distance Search Methods in MDAnalysis".
During this period, I realized the importance of multiple things which were not in my radar before GSoC. For me, the highlight of the project was the realization about the importance of unit tests and benchmarks for important modules (specially when performance is the goal), and trying fast calculations in python using cython (Even though it was scary at first). While I am not highly skilled in any of them mentioned above, a concoction of such exposure surely gave me a flavour of software practices to be followed while writing publishable code. Before going into the specifics of my contributions in the project, I wish to thank my super helpful mentors @richardjgowers, @jbarnoud and other awesome core developers of MDAnalysis @orbeckst, @kain88-de, @seb-buch and @zemanj. The community was helpful at time when I needed it the most.
The first evaluation mostly involved the intial benchmarks for the current implementation of different methods in MDAnalysis as well as using some external libraries to find the best method to incorporate in MDAnalysis to improve the distance evaluations. During this period, we tested out various data structures like Cell-lists, KDtree, OCtree for different cases relevant to MD simulations for Periodic Boundary Conditions (PBC)/ no PBC distance evaluations. The benchmarks for these cases reside in a separate directory. To assist the reader, a README.md with the information on individual files is also available in the same repository. During this time, 1 PR PR #13 was merged in cellgrid module of MDAnalysis based on the benchmarks.
Once an appropriate method was selected, second evaluation included contribution into MDAnalysis repository primarily on two fronts.
First was an addition of a function which could select the best method based on some well defined rules for faster computations.
On the second front, some modifications to increase the efficiency of existing methods and easy extension of PBC were implemented
in form of
augment_coordinates. During this time, a PR (PR #1941] was merged
capped_distance functionality for automatic selection of efficient method followed by another PR (PR #1977
which included creating relevant duplicate particles to mimic periodic boundary conditions, which can be used by any non PBC aware algorithm.
For the final evaluation, the
augment_coordinates which was written in Cython, is used to increase the efficiency of KDTree. Furthermore,
augment_coordinates is used to replace the dependency of
Bio.KDTree with a more stable
(PR #1990). Additional methods like
self_search are also written
to have fast accessible functions for different family of distance searches. Simulataneously,
self_capped_distance function is also introduced
with similar signature to
lib.distances.self_distance_array to automatically chose a method for self searches.
Alongside increasing the efficiency of already existing methods, a big chunk of work was to include cell-list algorithm (shows superior
performance as found in initial benchmarks) was also underway. With a huge help from @seb-buch, we managed to get the code merged in MDAanlysis
repository (PR #2008). This was followed by inclusion of
self_capped_distance functions in the same PR((PR #2008).
Next step was to use to
capped_distance in various application. Using
capped_distances, we demonstrated the improvement in performance
of Radial Distribution Function,
Guess Bonds, and
distance based selections. Furthermore,
a test case of guess bonds is also introduced in benchmarks to assess the improvements in guessing bonds over
Next steps include increased usage of
capped_distance into distance based analysis and use it as a default function throughout the code.
Overall for the benchmark cases, we managed to improve the performance by following factor:
- Selections : ~10x
- GuessBonds : ~10x
- RDF : ~3x
List of Contributions
Total Pull Requests Created: 11
Repository - MDAnalysis/cellGrid
- PR #13 - Adds a keyword to chose optimized cell-size
Repository - MDAnalysis
PR #1941 - Introduces Capped function for automatic method selection
PR #1977 - Augment coordinates to handle PBC
PR #1990 - Replace Biopython.KDtree with
PR #2006 - Added
self_capped_functionand modified the function to identify bonds between atoms
PR #2008 - Fast cythonized cell-list algorithm along with its definition in
PR #2022 - Reduce memory intensive calculations by allowing KDTree for periodic distance calculations
PR #2013 - Improvement in Radial distribution function
PR #2035 - Replaced multiple methods for distance selections with
PR #2041 - Faster individual methods in
capped_functionirrespective of memory consumption
PR #2045 - Benchmarks for Guess_bonds [WIP]
Blog Post - Github
- Increase usage of
capped_distancein other distance based analysis functions like contact matrix.