lokeshh/gsoc_blog.md

## gsoc_blog.md

      
    Raw
  

              gsoc_blog.md
            
          
    GSoC 2016 Summary, Adding categorical data support

Support for categorical data is important for any data analysis tool. This summer I added categorical data support to:

to easily analyze categorical data in Daru
visualize categorical data
support regression with categorical variable in Statsample and Statsample-GLM

Here's my project page.
Lets talk about each of them in detail.
Analyzing catgorical data with Daru

Categorical data is now readily recognized by Daru and Daru has all the procedure to deal with it.
To analyze categorical variable, simply turn the ordinary vector to categorical and you are ready to go.
# Load the dataset
shelter_data = Daru::DataFrame.from_csv '../data/animal_shelter_train.csv'
(This is animal shelter data taken from kaggle compeption.)

# Tell Daru which variables are categorical
shelter_data.to_category 'OutcomeType', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'

# Or quantize a numerical variable to categorical
shelter_data['AgeuponOutcome'] = shelter_data['AgeuponOutcome(Weeks)'].cut [0, 1, 4, 52, 260, 1500],
    labels: [:less_than_week, :less_than_month, :less_than_year, :one_to_five_years, :more_than__five_years]

# Do your operations on categorical data
shelter_data['AgeuponOutcome'].frequencies.sort ascending: false

small['Breed'].categories.size
#=> 1380
# Merge infrequent categories to make data analysis easy
other_cats = shelter_data['Breed'].categories.select { |i| shelter_data['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
shelter_data['Breed'].rename_categories other_cats_hash
shelter_data['Breed'].frequencies
# View the data
small['Breed'].frequencies.sort(ascending: false).head(10)

Please refer to this blog post to know more.
Visualizing categorical data

With the help of Nyaplot, Gnuplot and Gruff, Daru now provides ability to visualize categorical data as it does with orgdinary data.
To plot vector with Nyaplot one needs to call the function #plot.
# dv is a caetgorical vector
dv = Daru::Vector.new ['III']*10 + ['II']*5 + ['I']*5, type: :category, categories: ['I', 'II', 'III']

dv.plot(type: :bar, method: :fraction) do |p, d|
  p.x_label 'Categories'
  p.y_label 'Fraction'
end

Given a dataframe, one can plot the scatter plot such that the points color, shape and size can be varied acording to a categorical variable.
# df is a dataframe with categorical variable :c
df = Daru::DataFrame.new({
  a: [1, 2, 4, -2, 5, 23, 0],
  b: [3, 1, 3, -6, 2, 1, 0],
  c: ['I', 'II', 'I', 'III', 'I', 'III', 'II']
  })
df.to_category :c

df.plot(type: :scatter, x: :a, y: :b, categorized: {by: :c, method: :color}) do |p, d|
  p.xrange [-10, 10]
  p.yrange [-10, 10]
end

In a similar manner Gnuplot and Gruff also supports plotting of categorical variables.
An additional work I did was to add Gruff with Daru. Now one do plotting of vectors and dataframes also using Gruff.
See more notebooks on visualizing categorical data with Daru here.
Regression support for categorical data

Now categorical data is supported in the regression in Statsample and Statsample-GLM.
Also there has been formual language introduced (like used in R and Patsy) to ease the task of regression.
Now there's no need to manually create a dataframe for regression.
require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.model.coefficients :hash

#=> {:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}
Also through the work of Alexej Gossmann, one can also perdict using the model.
predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5

This I believe makes Statsample-GLM very convenient to use.
See this for a complete example.
Other

In addition to above mentioned changed there are some other considerable changes:

Improving overall structure of indexing in Daru and adding more capabilities. See this and this.
CategoricalIndex to handle the case when index column is a categorical data. More about it here.
Improving missing value API in Daru. Read more about it here.
Configuring guard to enable automatic testing. More info here.

Documentation

You can read about all my work in detail here.
I hope with these additions one will be able to see data more clearly with Daru :)