Instantly share code, notes, and snippets.

Embed
What would you like to do?
GSoC 2018 - MDAnalysis

Information

Name : Ayush Suhane

Email : ayush.suhane@gmail.com

Links

Github : ayushsuhane

Blog : Website

Proposal : Improve Distance Search Methods in MDAnalysis

Introduction

This GIST includes all the work which was done during the period of GSoC 2018 with MDAnalysis - NumFOCUS under the title "Improve Distance Search Methods in MDAnalysis".

During this period, I realized the importance of multiple things which were not in my radar before GSoC. For me, the highlight of the project was the realization about the importance of unit tests and benchmarks for important modules (specially when performance is the goal), and trying fast calculations in python using cython (Even though it was scary at first). While I am not highly skilled in any of them mentioned above, a concoction of such exposure surely gave me a flavour of software practices to be followed while writing publishable code. Before going into the specifics of my contributions in the project, I wish to thank my super helpful mentors @richardjgowers, @jbarnoud and other awesome core developers of MDAnalysis @orbeckst, @kain88-de, @seb-buch and @zemanj. The community was helpful at time when I needed it the most.

Summary

The first evaluation mostly involved the intial benchmarks for the current implementation of different methods in MDAnalysis as well as using some external libraries to find the best method to incorporate in MDAnalysis to improve the distance evaluations. During this period, we tested out various data structures like Cell-lists, KDtree, OCtree for different cases relevant to MD simulations for Periodic Boundary Conditions (PBC)/ no PBC distance evaluations. The benchmarks for these cases reside in a separate directory. To assist the reader, a README.md with the information on individual files is also available in the same repository. During this time, 1 PR PR #13 was merged in cellgrid module of MDAnalysis based on the benchmarks.

Once an appropriate method was selected, second evaluation included contribution into MDAnalysis repository primarily on two fronts. First was an addition of a function which could select the best method based on some well defined rules for faster computations. On the second front, some modifications to increase the efficiency of existing methods and easy extension of PBC were implemented in form of augment_coordinates. During this time, a PR (PR #1941] was merged for capped_distance functionality for automatic selection of efficient method followed by another PR (PR #1977 which included creating relevant duplicate particles to mimic periodic boundary conditions, which can be used by any non PBC aware algorithm.

For the final evaluation, the augment_coordinates which was written in Cython, is used to increase the efficiency of KDTree. Furthermore, the augment_coordinates is used to replace the dependency of Bio.KDTree with a more stable scipy.spatial.cKDTree (PR #1990). Additional methods like search and self_search are also written to have fast accessible functions for different family of distance searches. Simulataneously, self_capped_distance function is also introduced with similar signature to lib.distances.self_distance_array to automatically chose a method for self searches. Alongside increasing the efficiency of already existing methods, a big chunk of work was to include cell-list algorithm (shows superior performance as found in initial benchmarks) was also underway. With a huge help from @seb-buch, we managed to get the code merged in MDAanlysis repository (PR #2008). This was followed by inclusion of nsgrid(cell-list algorithm) in capped_distance and self_capped_distance functions in the same PR((PR #2008).

Next step was to use to capped_distance in various application. Using capped_distances, we demonstrated the improvement in performance of Radial Distribution Function, Guess Bonds, and distance based selections. Furthermore, a test case of guess bonds is also introduced in benchmarks to assess the improvements in guessing bonds over time(PR #2045).

Next steps include increased usage of capped_distance into distance based analysis and use it as a default function throughout the code.

Overall for the benchmark cases, we managed to improve the performance by following factor:

  • Selections : ~10x
  • GuessBonds : ~10x
  • RDF : ~3x

List of Contributions

Total Pull Requests Created: 11

Repository - MDAnalysis/cellGrid

  1. PR #13 - Adds a keyword to chose optimized cell-size

Repository - MDAnalysis

  1. PR #1941 - Introduces Capped function for automatic method selection

  2. PR #1977 - Augment coordinates to handle PBC

  3. PR #1990 - Replace Biopython.KDtree with scipy.spatial + augment_coordinates

  4. PR #2006 - Added self_capped_function and modified the function to identify bonds between atoms topology.guess_bonds

  5. PR #2008 - Fast cythonized cell-list algorithm along with its definition in capped_function

  6. PR #2022 - Reduce memory intensive calculations by allowing KDTree for periodic distance calculations

  7. PR #2013 - Improvement in Radial distribution function

  8. PR #2035 - Replaced multiple methods for distance selections with capped_function

  9. PR #2041 - Faster individual methods in capped_function irrespective of memory consumption

  10. PR #2045 - Benchmarks for Guess_bonds [WIP]

Blog Post - Github

Pending Work

  1. Increase usage of capped_distance in other distance based analysis functions like contact matrix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment