Instantly share code, notes, and snippets.

Embed
What would you like to do?

GSoC 2017 Final Report

Shogun Detox II: Codebase improvements and finalization of the new Tags and Serialization frameworks

Student: Giovanni De Toni
Organization: The Shogun Toolbox
Mentors: Viktor Gal, lambday

Abstract

The Shogun Toolbox is a well-established machine learning project that provides efficient algorithms implementations that can be used in a wide range of applications and with multi-language support (thanks to SWIG magic). Unfortunately, since it was built by many hands for many years, its code has become not easily maintainable or extendable and it does not use many new programming techniques and components that have appeared since the Shogun foundation. The time has come to blow some fresh air (and some new fresh code) into Shogun's depths. This project aims to correct and update the codebase and to complete the integration of many new features that will make it more modular and easy to use. My efforts will be focused on: integrate the new Tags and serialization frameworks, substitute old-style macros with brand new C++11 smart pointers, enable premature stopping of ML algorithm and, last but not least, implement a useful (and beautiful) progress bar to show a visual representation of algorithm’s execution.

By the first mid-term evaluation, my mentor and I realized that some of the previously detailed task (Smart Pointers refactor and Tags finalization) would have been impossible to be completed before the end of GSoC. So we decided to focus on different objectives which would have still been valuable contributions for the Shogun Toolbox.

Table of Contents

Premature stopping of machine learning algorithms

Description

We wanted to be able to premature stop/pause Shogun's machine learning algorithms (for example, if the computation is taking too much time to complete) and yet be able to get some meaningful result. The general idea was to build an infrastructure using the observer pattern which would have been capable of managing user's signals and the algorithms' execution.

I've implemented this architecture by using the RxCpp framework, which is particularly suited for reactive programming. I've implemented also an interactive prompt, which the user can use to select what action perform (pause, terminate, premature stop the computations). The user can reach this prompt by simply pressing CTRL+C during the program execution.

Each of the algorithms is registered into a global observable (a rxcpp::observable instance) which sends a signal to all the registered machine when they have to stop/pause their computations. Developers can easily build (or extend) an algorithm with premature stopping. If they want to implement a custom behaviour they will have only to override the methods on_next(), on_pause() and on_complete() which are defined into CMachine.h. A macro named COMPUTATION_CONTROLLERS is also available to enable premature stopping inside train_machine() methods (this feature works fine also within multithreaded environments).

See this gist for a practical approach.

List of commits

Pull Request Description
#3845 Add RxCpp to Shogun's CMake.
#3848 Refactor CSignal class to use RxCpp.
#3855 Add RxCpp to Docker image for testing purposes.
#3858 Add premature stopping methods to CMachine.
#3875 Replace old CSignal::cancel_computations with the new cancel_computation().

Parameter observers

Description

We wanted also to be able to watch over Shogun's object parameters and to monitor them. For example, we might be interested in how the weights of a LARS model change over training or we might want to retrieve a specific trained model from a cross validation run.

This was done by extending the CSGObject class with the ability to "emit" parameters' values. I've also implemented a series of ParameterObservers which can be attached to any Shogun's object. When the Shogun's object will emit measured values, the observers will catch them and, depend on what is their purpose, they will elaborate them.

One cool feature is that some of these parameter observers can be used to serialize Shogun's object information to Tensorflow event files, which can be rendered by using Tensorboard to do some nice data visualization. To achieve this, my mentor built a cool C++ library called TFLogger which enables us to write Tensorflow's event file directly from Shogun.

I've made a python example to show how easy it is to use the observer.

List of commits

Pull Request Description
#3877 Add ParameterObserverInterface class and implementations for Tensorboard. Add also SGObject observable.
#3911, #3912 Add Protobuf and TFLogger to Docker for testing purpose.
#3925 Add a way to show which class parameters can be observed.
#3929 Convert SGObject observable to observable.timestamp().
#3939 First round of refactor of parameter's observers feature.
#3953 Apply parameter's observer feature to CrossValidation.
#3967 Polish the new CrossValidationclass.
#3969 Add cookbook to explain how to user parameter observers.

Progress bar

Description

This was a plain and simple task. It was the first thing I've developed during the first two weeks of GSoC. The basic idea was to create a new progress bar to substitute the old C style SG_PROGRESS macro and then to apply it to the Shogun's algorithms.

The new progress bar is implemented as a header-only library, so that to simplify its usga . It can be used as a C++11 range-based loop for (auto i : progress(range(10)), but it also offers methods to update the progress bar manually (that was done to overcome OpenMP's limitation since it does not support the C++11 range-based loop yet)

Please have a look at my first blog post to get more detailed information about how the progress bar works and how to use it.

List of commits

Pull Request Description
#3745 Add PRange class which will substitute the old SG_PROGRESS.
#3828 Add multithreaded progress bar (works fine inside OpenMP environments).
#3829 Add boolean flag to the progress bar.
#3831 Refactor and polish progress bar code and add documentation.
#3836 Replace old SG_PROGRESS code with the new progress bar.

Future Plans

The features are almost complete, the main functionalities are in-place and they work as intended. I've already started to apply them to Shogun's codebase, but there are still many things that need to be done. In fact, the toolbox is huge and messy and there many places where it needs to be changed. After GSoC I plan also to work on the tasks I could not complete during this months, because I think they will improve Shogun's quality.

Other Contributions

Pull Request Description
#3813 Modify check_format.sh so that it can be used on local environments.
#3814 Add information about the newly introduced style checks inside DEVELOPING.md.
#3816 Refactor Some<> to make it a bit more standalone.
#3821 Add new constructor and get() method to Some<>.
#3827 Fix check_format.sh script when the destination branch is not develop.
#3919 Replaced rx.hpp headers with rx-lite.hpp and fix some RxCpp memory leaks.
#3927 Fix OSX and FreeBSD build.
#3934 Fix clang-format script when dealing with deleted files.
#3840 Port SGObject and its unit tests to Some<> (not merged yet).
#3959 Add premature stopping features to some classes.

I've also added some documentation pages which describe the features I built and show how to use them:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment