zippeurfou/rsqlite.md

## rsqlite.md

      
    Raw
  

              rsqlite.md
            
          
    There are multiple questions:


The operating regions are indicated by region_id. Generate a report of the average hourly_charge in each operating region as well as the overall average.


Assuming that a booking is completed if it is not cancelled by the customer and has no reschedule events, generate a report based on the calendar week (running Sun-Sat) of the number of bookings done, number of bookings done using coupons, total hours booked, and number of bookings which were cancelled by the customer.


Recurring bookings are bookings which happen on a regularly scheduled basis and are indicated by recurring_id and a frequency (freq) indicating how many weeks pass between each booking in the series. Determine the distribution of bookings based on the frequency of the recurring booking to which they belong across the days of the week on which they were completed.


Say we have a problem with customers canceling and rescheduling bookings. Assuming all the bookings are from different users, pull metrics which you believe would give a general profile of these problem users.


Input:

Question asked before
handy_bookings_train.csv and handy_booking_test.csv

Output:

This document

#Set up
setwd("~/R projects/handy")
#load csv sql to be able to do sql queries on the csv files
#the R library use sqlite
library(sqldf)
#Some graph
library(ggplot2)
#load data
ds.train<-read.csv(file="data/handy_bookings_train.csv",header=T,sep=",",na.strings=c("NULL",""))
ds.test<-read.csv(file="data/handy_bookings_test.csv",header=T,sep=",",na.strings=c("NULL",""))
#Question 1
The operating regions are indicated by region_id. Generate a report of the average hourly_charge in each operating region as well as the overall average.
#average per region
avg.region<-sqldf("SELECT region_id as region,avg(hourly_charge) as average_charge  FROM 'ds.train' GROUP BY region_id")
## Loading required package: tcltk

#overall average
#I wasn't sure if you wanted the overall average weighted by region or just the overall average so I made both query
avg.overall<-sqldf("select avg(hourly_charge) from 'ds.train'")
#average overall weighted by region
avg.region.overall<-sqldf("select avg(t.average_charge) from (SELECT avg(hourly_charge) as average_charge from 'ds.train' group by region_id) t")
#This output the same result but it's kind of cheating since I'm using R and not pure sql
#avg.region.overall<-sqldf("select avg(average_charge) FROM 'avg.region'")
#If I had to pick one, I think it makes more sense to use the avg.overall one.

avg.region
##    region average_charge
## 1       4       2981.250
## 2       5       2853.342
## 3       6       2838.129
## 4       8       2969.583
## 5       9       2886.047
## 6      10       2936.000
## 7      11       2928.125
## 8      12       2945.614
## 9      13       3000.000
## 10     14       3146.032
## 11     15       3008.125
## 12     16       3108.333
## 13     17       3088.298
## 14     18       3077.778
## 15     20       3165.385
## 16     21       3003.175
## 17     22       3144.000
## 18     23       3053.704
## 19     24       3108.730
## 20     25       3068.831
## 21     26       3108.511
## 22     27       3047.826
## 23     28       2929.592
## 24     29       3095.455
## 25     31       3032.895
## 26     32       2906.061
## 27     33       3013.115
## 28     34       2868.750
## 29     35       3059.677
## 30     36       1007.380
## 31     38       3098.485
## 32     41       1000.000
## 33     42       1000.000

ggplot(data=avg.region,aes(x=region,y=average_charge))+geom_histogram(stat="identity")+scale_x_continuous(breaks=seq(min(avg.region$region),max(avg.region$region),1))+xlab("Region ID")+ylab("Average Charge")+ggtitle("Average charge per region")

avg.overall
##   avg(hourly_charge)
## 1           2674.819

avg.region.overall
##   avg(t.average_charge)
## 1              2832.674

#Question 2
Assuming that a booking is completed if it is not cancelled by the customer and has no reschedule events, generate a report based on the calendar week (running Sun-Sat) of the number of bookings done, number of bookings done using coupons, total hours booked, and number of bookings which were cancelled by the customer.
#The way I understood the question was to ask for the day of the week (sun-sat) of the week number (in the year). I am not sure it is exactly what you meant.
#Additionally, it wasn't clear to me if you wanted the report to be for the day of the week it was added or it was starting. I figured that added made more sense.
#This question is kind of tricky because I did not know exactly one row entry mean and what the column mean. I guess it was part of the exercise that's why I didn't ask.
#I have noticed that some date cancelation were after the start date and date start was always after date added. It made me assume that each row is for each customer, date added is when they added their order, date start is when they first start and frequencies indicate the  frequency of the task. Therefore, you can cancel after the start if your frequency is more than one.
#I assumed that an event would be canceled when we had a cancel date and not reschuduled when  reschedule_events_count was 0
#I also assumed this report was to give information of "what have been done" which mean that the booking hours should be only to the one who didn't cancel.


report<-sqldf("select cast (strftime('%W', date_start) as integer) as weeknumb, case cast (strftime('%w', date_added) as integer)
  when 0 then 'Sunday'
  when 1 then 'Monday'
  when 2 then 'Tuesday'
  when 3 then 'Wednesday'
  when 4 then 'Thursday'
  when 5 then 'Friday'
  when 6 then 'Saturday'
  else 'Unknwon' end as dow,
  count(case when (reschedule_events_count=0 OR customer_cancelation_date is null) then 1 end) as booking_done,
  count(case when (reschedule_events_count=0 OR customer_cancelation_date is null AND coupon is not null) then 1 end) as booking_done_coupon,
  sum(case when (reschedule_events_count=0 OR customer_cancelation_date is null) then hrs else 0 end) as booking_hours,
  count(case when (customer_cancelation_date is not null AND reschedule_events_count>0) then 1 end) as booking_cancelled
  from 'ds.train'
  group by weeknumb,cast (strftime('%w', date_added) as integer) order by weeknumb, cast (strftime('%w', date_start) as integer)")

##Please note that I could definitely have breaked down this query into string vars in order to not repeat myself. However, it was saying to use SQL. Therefore, I assumed I should do as if I could only do SQL in a terminal.
report
##    weeknumb       dow booking_done booking_done_coupon booking_hours
## 1        48    Sunday            7                   7          19.0
## 2        48  Thursday          108                  97         287.5
## 3        48  Saturday           57                  51         145.0
## 4        48    Monday          291                 275         771.0
## 5        48 Wednesday          141                 128         390.0
## 6        48   Tuesday          259                 238         658.5
## 7        48    Friday           89                  72         236.5
## 8        49    Monday          261                 240         716.5
## 9        49  Saturday          163                 148         449.5
## 10       49    Sunday          162                 149         448.0
## 11       49   Tuesday          316                 286         844.5
## 12       49  Thursday          243                 215         695.5
## 13       49    Friday          221                 202         617.5
## 14       49 Wednesday          250                 224         688.0
## 15       50    Sunday          151                 139         422.5
## 16       50    Monday          295                 273         815.0
## 17       50   Tuesday          267                 248         742.5
## 18       50 Wednesday          215                 192         596.5
## 19       50  Thursday          148                 125         431.0
## 20       50  Saturday          106                  96         314.0
## 21       50    Friday          119                 107         344.0
##    booking_cancelled
## 1                  0
## 2                  7
## 3                  2
## 4                  5
## 5                  4
## 6                  6
## 7                  4
## 8                  6
## 9                  3
## 10                 4
## 11                11
## 12                 6
## 13                18
## 14                10
## 15                 4
## 16                 9
## 17                13
## 18                 9
## 19                 3
## 20                 1
## 21                 6

#Question3
Recurring bookings are bookings which happen on a regularly scheduled basis and are indicated by recurring_id and a frequency (freq) indicating how many weeks pass between each booking in the series. Determine the distribution of bookings based on the frequency of the recurring booking to which they belong across the days of the week on which they were completed.
#On this question, I am really confused. Do you want me to give you the  classification model using machine learning (I see there are train and test data)?
#Do you want me to give you the shape of the data (normal distribution, linear...)
#Do you want me to give you some descriptive informations such as the range, mean, variance..
#The first line says SQL, I will just give you some general information but please tell me if you want more. I can definitely do it if that's what you are looking for.


distribution<-sqldf("select cast (avg(freq) as integer) as freq,case cast (strftime('%w', date_start) as integer)
  when 0 then 'Sunday'
  when 1 then 'Monday'
  when 2 then 'Tuesday'
  when 3 then 'Wednesday'
  when 4 then 'Thursday'
  when 5 then 'Friday'
  when 6 then 'Saturday'
  else 'Unknwon' end as dow,
  count(*) as freq_count
  from 'ds.train' where (recurring_id is not null AND freq is not NULL) AND (reschedule_events_count=0 OR customer_cancelation_date is null)
  group by cast (strftime('%w', date_start) as integer),freq order by cast (strftime('%w', date_added) as integer)")


ggplot(data=distribution,aes(x=as.factor(freq),y=freq_count))+geom_histogram(stat="identity")+facet_wrap(~dow)+xlab("Frequence")+ylab("Number of completed booking")+ggtitle("Number of completed booking per frequency accross the day of the week")

##Question 4
Say we have a problem with customers canceling and rescheduling bookings. Assuming all the bookings are from different users, pull metrics which you believe would give a general profile of these problem users.
#We can do many things to try to find some correlation between the data
#One thing we could look at is the disperssion of the cancelation date
#Did they usually cancel at the day of the week
#We could use machine learning again with a binary response which would be canceled and reshudeled against normal or find a linear correlation using pearson moment correlation for example.
#However, the way I understood the challenge was to only use pure SQL (without help from any programming language as if you where in a sql terminal) for each question which highly limit the possibilities.
#What we want to know is why these users are different than the one who don't cancel or reschedule
#Therefore, our metric should be able to compare both type of users.
#Moreover, there is the notion of time. Was there a special event that made customers leave. Therefore, a timeline of percentage of user leaving vs total user would be interesting.
#Also, it would be interesting to see the acquisition vs retention rate. Do people who cancel are first time user or people who have had services every week (looking at freq)

metric.retention<-sqldf("select (cast (strftime('%W', date_added) as integer) +(cast (strftime('%w', date_added) as integer))/7.0) as date_event, (100.0*(count(case when(reschedule_events_count>0 AND customer_cancelation_date is not null) then 1 end))/count(*)) as perc,
             (avg(case when(reschedule_events_count>0 AND customer_cancelation_date is not null) then freq  end)) as freq_leaving,
           (avg(case when(reschedule_events_count=0 OR customer_cancelation_date is null) then freq end)) as freq_not_leaving
                        
            from 'ds.train' group by cast (strftime('%W', date_added) as integer),cast (strftime('%w', date_added) as integer) order by cast (strftime('%W', date_added) as integer), cast (strftime('%w', date_added) as integer)")

ggplot(data=metric.retention,aes(x=date_event,y=perc))+geom_point()+scale_x_continuous(breaks=seq(min(metric.retention$date_event),max(metric.retention$date_event),0.5))+geom_smooth()+xlab("Time")+ylab("Percentage of uncompleted booking")+ggtitle("Percentage of uncompleted booking accross time")
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.


#We can see that there is a peak during the week 49 (saturday)
#It would be interesting to see if we had a problem of in our server or some reason that the data itself wouldn't tell us


ggplot(data=metric.retention,aes(x=date_event,y=freq_not_leaving))+geom_point()+scale_x_continuous(breaks=seq(min(metric.retention$date_event),max(metric.retention$date_event),0.5))+geom_line()+geom_point(aes(x=date_event,y=freq_leaving,color="red"))+geom_line(aes(x=date_event,y=freq_leaving,color="red"))+xlab("Time")+ylab("Average frequency")+ggtitle("Average frequency of uncompleted (red) vs completed booking")
## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_path).


#We can see that while the average frequency for the staying has a small variation, the variation of the one for the one leaving is really higher. We can conclude that the one staying have usually a freq of 3 while the one leaving have other frequencies. Therefore, we should promote having frequency of 3 to keep them.

#Now, let's look at other interesting variables such as wether or not they have coupon (a deeper analysis of which coup trigger retention or not could be done as well),if they clicked for extra home cleaning, total chargethe region,if there was a campaign,number of provider, if provider is requested, if it was peak priced.
#If I had to do it only with sql I would look at the average and the variance of each variable for the group leaving and staying. This would give me a good global idea if there are some odd behavior and which metric makes more sense.

#However, I believe that the best way to know if someone is going to leave or not is to do preddictive analysis. Therefore, I'll be using a bit of machine learning.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Cleaning data removing meaningless variables
rf.train<- ds.train %>% mutate(has_coupon=as.factor(as.numeric(is.na(coupon))),total_charge=as.numeric(hrs*hourly_charge),cancel=as.factor(as.numeric(!is.na(customer_cancelation_date) & reschedule_events_count>0))) %>% select_("-row_id","-customer_cancelation_date","-date_start","-date_added","-hrs","-hourly_charge","-coupon","-recurring_id","-reschedule_events_count","-user_bookings_count","-user_cancelled_bookings_count","-region_id")

rf.test<- ds.test %>% mutate(has_coupon=as.factor(as.numeric(is.na(coupon))),total_charge=as.numeric(hrs*hourly_charge),cancel=as.factor(as.numeric(!is.na(customer_cancelation_date) & reschedule_events_count>0))) %>% select_("-row_id","-customer_cancelation_date","-date_start","-date_added","-hrs","-hourly_charge","-coupon","-recurring_id","-reschedule_events_count","-user_bookings_count","-user_cancelled_bookings_count","-region_id")

rf.test[is.na(rf.test$freq),]<-0
rf.train[is.na(rf.train$freq),]<-0


rf.train$num_providers<-as.factor(rf.train$num_providers)
rf.train$has_campaign<-as.factor(rf.train$has_campaign)
rf.train$clicked_extra_home_cleaning<-as.factor(rf.train$clicked_extra_home_cleaning)
rf.train$peak_priced<-as.factor(rf.train$peak_priced)

rf.test$num_providers<-as.factor(rf.test$num_providers)
rf.test$has_campaign<-as.factor(rf.test$has_campaign)
rf.test$clicked_extra_home_cleaning<-as.factor(rf.test$clicked_extra_home_cleaning)
rf.test$peak_priced<-as.factor(rf.test$peak_priced)


clf<-randomForest(cancel~.,data=rf.train,na.action=na.omit,importanc=T)
clf$confusion
##      0 1 class.error
## 0 3948 0           0
## 1   52 0           1

#We can see that the confusion matrix is really bad, it is unecessary to look at the prediction, the clasifier is just bad
#pred<-predict(clf,rf.test %>% select_("-cancel"))
#varImpPlot(clf)
#Let's try with adaboost
############      AdaBoost   ###############
#I picked adaboost because it is supposed to be a great mix between accuracy and speed for high dim predictors with NA.
library(ada)
## Loading required package: rpart

#Helper function that allow to know what are the dependencies between vars
#Input: dataframe, name of y in dataframe, type of boost you want to apply, number of iteration and depth
#Output: classifier
clfAda<-function(data=ds,y,adatype="gentle",adaiter=70,cdepth=14){
  n <- nrow(data)
  indy<-which(names(data)==y)
  ind <- sample(1:n)
  xnam <- paste(names(data[-indy]), sep="")
  fmla <- as.formula(paste(y," ~ ", paste(xnam, collapse= "+")))
  trainval <- ceiling(n * .5)
  testval <- ceiling(n * .3)
  train <- data[ind[1:trainval],]
  test <- data[ind[(trainval + 1):(trainval + testval)],]
  valid <- data[ind[(trainval + testval + 1):n],]
  control <- rpart.control(cp = -1, maxdepth = cdepth,maxcompete = 1,xval = 0)
  clf <- ada(fmla, data = train, test.x = test[,-indy], test.y = test[,indy], type =adatype, control = control, iter = adaiter)
  clf <- addtest(clf, valid[,-indy], valid[,indy])
  clf
}

##Now we compute our adaboost and do our prediction and confusion matrix
#Take labels and assign variables after doing adaboost, prediction, confusion matrix and the variable importance
  xtest<-rf.test %>% select_("-cancel")
  ytest<-rf.test %>% select_("cancel")
  #classifier
  ada.clf<-clfAda(data=rf.train,y="cancel")
  #prediction
  ada.pred<-predict(ada.clf,xtest)
  #confusion matrix
  ada.conf<-table(ada.pred,ytest[,1])
  #var importance
  ada.vip<-varplot(ada.clf,plot.it=FALSE,type='scores')
ada.clf$confusion
##           Final Prediction
## True value    0
##          0 1969
##          1   31

#Again the confusion matrix is really bad, we can't predict if it is canceled
#We can't easily predict it
#Even if we could, the testing set is actually consisting of only completed booking which give us an unbalanced set
#I will then just look at the classification tree
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

rf.train$freq<-as.factor(rf.train$freq)
plot(ctree(cancel~.,rf.train))

#We can see that the best metric to know if we complete or not a booking is if we have a provider or not. Other than that
#We could go further and look at the total charge, frequency, has_campaign, clicked_extra_home_cleaning, peak_priced, is_provider_requested, has_coupon.
#However, this is mostly descriptive analysis which trigger less accurate result than the classification done.