Skip to content

Instantly share code, notes, and snippets.

@v0dro
Last active October 9, 2015 04:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save v0dro/1f7e1b80206f84e34331 to your computer and use it in GitHub Desktop.
Save v0dro/1f7e1b80206f84e34331 to your computer and use it in GitHub Desktop.
Proposal for Ruby Association Grant 2015

Name & Contact

Name: Sameer Deshmukh

Email: sameer.deshmukh93@gmail.com

Twitter: @v0dro

Github: www.github.com/v0dro

Blog: www.v0dro.github.io

Biography

I am a fourth year Computer Engineering student at Pune University, Pune, India.

I am daru's author and have been contributing to the Ruby Science Foundation for the past year. I was selected as a Google Summer of Code 2015 student under SciRuby to further develop daru and integrate it with other important SciRuby projects like statsample, statsample-glm, statsample-timeseries, rb-gsl, nmatrix etc.

I also gave a talk about daru at DeccanRubyConf 2015, Pune, India, which is one of the 3 important ruby conferences in India. You can see the video of my talk here. The talk was well received.

In my third year at college I co-authored a research paper on 'Automatic Speech Recognition of Marathi Consonants' (marathi is my mother tongue). The paper has been published by the IEEE. You can see it here.

I have been contributing to SciRuby for a full year now and am well versed with most of their projects.

Project Title

Time series, categorical data and other improvements for daru and statsample.

Theme - Scientific libraries for mathematics, science & engineering.

Project Details

Background

daru (Data Analysis in RUby) is a ruby gem for analysis, manipulation and visualization of data. Its ultimate aim is to make data analysis easy as pie by providing a Ruby-like interface for complex data analysis tasks. I have been developing daru for the past year or so in co-ordination with members of the Ruby Science Foundation.

Over the past year it has become a robust library with many useful features. Thanks to GSOC 2015, it is also now the de facto data container solution for many other important SciRuby gems like statsample, statsample-glm and statsample-timeseries (with nyaplot integration en route). Two gems in GSOC 2015 - mixed_models and gnuplotrb were actually built using daru's features. It is also the first library in Ruby to provide a robust API for analysis and manipulation of data indexed on time stamps, a functionality that is important and useful in data analysis.

In just about a year, daru has become a pretty popular gem, with people using it and providing due feedback. The latest version has received over 500 downloads in a few days after it's release.

To gain a better overview of what daru is capable of, please browse through the tutorials created in the form of these iruby notebooks and blog posts.

While daru has grown to become a pretty mature data analysis solution over the past year, a lot of work still remains, some of which I hope to accomplish with this grant.

The main areas of focus over the grant period will be support for categorical data, better support for time series and more robust functions for dealing with missing data. I elaborated on each of these new features below:

Specific details

The first part of this project will consist of categorical data implementation for daru, statsample and statsample-glm.

Categorical data is a very important concept in statistical analysis. It's primary use is to assign discrete categories to certain data, for example 'Male' and 'Female'. Both R and python have robust in-built support for categorical data. Currently no ruby scientific library supports categorical data. Because of its importance in statistical analysis, it is imperative that statistical data be supported in major ruby scientific gems (viz. statsample and statsample-glm).

A categorical variable is currently treated like a nominal/ordinal variable in daru, statsample and statsample-glm and thus calculations involving categorical data are not performed accurately.

Support for categorical data is very important and is strongly felt in the Ruby community. This mailing list discussion and the issues open here, here and here provide support for my claim.

In this sub-project I will do two things - implement a new index called CategoricalIndex for Daru similar to that in pandas for supporting categorical data and change the regression methods in statsample and statsample-glm so that they support categorical data supplied by daru.

The second sub-project will consist of better time series support.

Time series is a very common use case in most applications such as Internet of Things (IoT) devices, stock market prices and user data on the web. It mainly involves data that is indexed on a time stamp. Since Ruby dominates the web it is very important to have an interactive and friendly tool for working with this type of data. Currently most developers rely on databases or worse, delegate functions to another language. This does not seem to be the most straightforward approach and clearly a better tool for this purpose should be used.

The current time series functionality in daru was implemented over the GSOC period, and it currently supports time series with the Daru::DateTimeIndex index. The current functions are usable, but not very robust, i.e. you still cannot perform the full range of time series manipulation activities in daru like you can in pandas. Some examples demonstrating daru's current time series support can be found here.

Most other languages like Python, R and Julia have very good support for this purpose but Ruby is still lacking behind.

This task will aim to fulfill that void.

Implementation overview

Categorical data support will involve two stages:

  • Support categorical data with a new 'categorical' data type and CategoricalIndex index class in daru.
  • Support operations on categorical data from daru on statsample and statsample-glm as well.

Time series support will mainly involve:

  • Better time zone handling
  • More computational tools:
    • Percent change.
    • Generic rolling apply.
    • Window functions like hamming window, hanning, etc. This will require the creation of a new gem.since it's not available in ruby currently.
    • Binary rolling moments (cov and corr).
    • Exponentially weighted moment functions.
    • Time series resampling.

Better support for 'wild' data:

  • Sorting with missing data present in Vector and DataFrame.
  • More methods for handling missing data: fill_na, drop_na, etc.

Documentation overview: As I done for GSOC, I will closely document every single feature that will be written in the grant period with the help of iruby notebooks and helpful blog posts. I have already written many for daru during the GSOC period, you can see them here and here. The new ones will be on a similar note.

Project Deliverables

Mid term 30 October - 26 December

Support for categorical data in daru, statsample and statsample-glm.

Better support for 'wild' data.

Lots of documentation in the form of blogs, iruby notebooks and screencasts to make it easier for people to use categorical data with daru and statsample.

Deliverables - New statsample, statsample-glm and daru supporting categorical data. Documentation for all new features.

End term 26 December - 29 February

Implement time series functions.

Lots of documentation showing the latest support for time series.

Deliverables - New daru with lots of new additions for better support for time-based data.

Optional deliverables

One of the leading reasons for using time series analysis is for analysis of data from stock markets. This data is frequently scraped from the relevant website, but I think a better way would be to simply provide for a function in daru that would facilitate retriving this data and directly loading it into a daru data structure.

To implement this kind of functionality, I will optionally add Daru::DataFrame#from_yahoo (for getting data from Yahoo finance) and Daru::DataFrame#from_google (for getting data from Google finance). Users will directly be able to specify the names of the companies whose stocks they are looking for.

For example, Daru::DataFrame.from_yahoo(['AAPL', 'IBM', 'GOOG'], from: '2012-2-4', to: '2012-2-8') will retreive all the stock data of Apple, IBM and Google from 2012-2-4 to 2012-2-8 in a Daru::DataFrame. Users can then use any of daru's functions for analysis or visulization of the time series.

Mentors and support

My GSOC mentor, Carlos Agarie has been very kind in volunteering to mentor me during the grant period. Carlos was the admin for SciRuby during GSOC and has been associated with SciRuby for many years now. He has a very good understanding of all the SciRuby gems and is very keen on developing them further.

You can contact Carlos on his email at carlos.agarie@gmail.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment