GSoC 2017 Final Report
Shogun Detox II: Codebase improvements and finalization of the new Tags and Serialization frameworks
The Shogun Toolbox is a well-established machine learning project that provides efficient algorithms implementations that can be used in a wide range of applications and with multi-language support (thanks to SWIG magic). Unfortunately, since it was built by many hands for many years, its code has become not easily maintainable or extendable and it does not use many new programming techniques and components that have appeared since the Shogun foundation. The time has come to blow some fresh air (and some new fresh code) into Shogun's depths. This project aims to correct and update the codebase and to complete the integration of many new features that will make it more modular and easy to use. My efforts will be focused on: integrate the new Tags and serialization frameworks, substitute old-style macros with brand new C++11 smart pointers, enable premature stopping of ML algorithm and, last but not least, implement a useful (and beautiful) progress bar to show a visual representation of algorithm’s execution.
By the first mid-term evaluation, my mentor and I realized that some of the previously detailed task (Smart Pointers refactor and Tags finalization) would have been impossible to be completed before the end of GSoC. So we decided to focus on different objectives which would have still been valuable contributions for the Shogun Toolbox.
Table of Contents
- Premature stopping of machine learning algorithms
- List of commits
- Parameter observers
- List of commits
- Progress Bar
- List of commits
- Future Plans
- Other Contributions
Premature stopping of machine learning algorithms
We wanted to be able to premature stop/pause Shogun's machine learning algorithms (for example, if the computation is taking too much time to complete) and yet be able to get some meaningful result. The general idea was to build an infrastructure using the observer pattern which would have been capable of managing user's signals and the algorithms' execution.
I've implemented this architecture by using the RxCpp framework, which is particularly suited for reactive programming. I've implemented also an interactive prompt, which the user can use to select what action perform (pause, terminate, premature stop the computations). The user can reach this prompt by simply pressing
CTRL+C during the program execution.
Each of the algorithms is registered into a global observable (a
rxcpp::observable instance) which sends a signal to all the registered machine when they have to stop/pause their computations.
Developers can easily build (or extend) an algorithm with premature stopping. If they want to implement a custom behaviour
they will have only to override the methods
on_complete() which are defined into
CMachine.h. A macro named
COMPUTATION_CONTROLLERS is also available to enable premature stopping inside
train_machine() methods (this feature works fine also within multithreaded environments).
See this gist for a practical approach.
List of commits
|#3858||Add premature stopping methods to
We wanted also to be able to watch over Shogun's object parameters and to monitor them. For example, we might be interested in how the weights of a LARS model change over training or we might want to retrieve a specific trained model from a cross validation run.
This was done by extending the
CSGObject class with the ability to "emit" parameters' values. I've also implemented a series of
ParameterObservers which can be attached to any Shogun's object. When the Shogun's object will emit measured values, the observers will catch them and, depend on what is their purpose, they will elaborate them.
One cool feature is that some of these parameter observers can be used to serialize Shogun's object information to Tensorflow event files, which can be rendered by using Tensorboard to do some nice data visualization. To achieve this, my mentor built a cool C++ library called TFLogger which enables us to write Tensorflow's event file directly from Shogun.
I've made a python example to show how easy it is to use the observer.
List of commits
|#3911, #3912||Add Protobuf and TFLogger to Docker for testing purpose.|
|#3925||Add a way to show which class parameters can be observed.|
|#3929||Convert SGObject observable to observable.timestamp().|
|#3939||First round of refactor of parameter's observers feature.|
|#3953||Apply parameter's observer feature to
|#3967||Polish the new
|#3969||Add cookbook to explain how to user parameter observers.|
This was a plain and simple task. It was the first thing I've developed during the first two weeks of GSoC. The basic idea was
to create a new progress bar to substitute the old C style
SG_PROGRESS macro and then to apply it to the Shogun's algorithms.
The new progress bar is implemented as a header-only library, so that to simplify its usga . It can be used as a C++11
for (auto i : progress(range(10)), but it also offers methods to update the progress bar manually (that was done to overcome OpenMP's limitation since it does not support the C++11 range-based loop yet)
Please have a look at my first blog post to get more detailed information about how the progress bar works and how to use it.
List of commits
|#3828||Add multithreaded progress bar (works fine inside OpenMP environments).|
|#3829||Add boolean flag to the progress bar.|
|#3831||Refactor and polish progress bar code and add documentation.|
The features are almost complete, the main functionalities are in-place and they work as intended. I've already started to apply them to Shogun's codebase, but there are still many things that need to be done. In fact, the toolbox is huge and messy and there many places where it needs to be changed. After GSoC I plan also to work on the tasks I could not complete during this months, because I think they will improve Shogun's quality.
|#3813||Modify check_format.sh so that it can be used on local environments.|
|#3814||Add information about the newly introduced style checks inside DEVELOPING.md.|
|#3816||Refactor Some<> to make it a bit more standalone.|
|#3821||Add new constructor and get() method to Some<>.|
|#3827||Fix check_format.sh script when the destination branch is not develop.|
|#3919||Replaced rx.hpp headers with rx-lite.hpp and fix some RxCpp memory leaks.|
|#3927||Fix OSX and FreeBSD build.|
|#3934||Fix clang-format script when dealing with deleted files.|
|#3840||Port SGObject and its unit tests to Some<> (not merged yet).|
|#3959||Add premature stopping features to some classes.|
I've also added some documentation pages which describe the features I built and show how to use them: