Skip to content

Instantly share code, notes, and snippets.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@timabe
timabe / statcast_columns.R
Created September 2, 2019 17:39
Column object for statcast data with correct data types
statcast_cols <- cols(
pitch_type = col_character(),
game_date = col_date(format = ""),
release_speed = col_double(),
release_pos_x = col_double(),
release_pos_z = col_double(),
player_name = col_character(),
batter = col_double(),
pitcher = col_double(),
events = col_character(),
@timabe
timabe / yelp_hclust
Created November 12, 2013 06:18
Hierarchical Clustering in R with the Yelp-Kaggle dataset
load('busCats.Rdata') # load this data from github.com/timabe
catSums <- colSums(bus.cats) # get summary data, with just the category sums
catSums[order(-catSums)]->catSums # order the categories ahead of plotting them
plot(log(catSums))
# the plot shows a skewed set. Many of these will be useless in the hierarchical clustering
# as there are only a handful of observations of them. It's unlikely they will produce an
# interesting cluster membership
# lets concentrate on the top 60 since there's a slight kink there
@timabe
timabe / yelp_hclust.R
Last active December 28, 2015 02:09
Hierarchical Clustering in R for the Yelp Kaggle Dataset
load('busCats.Rdata') # download this from https://dl.dropboxusercontent.com/u/6132890/busCats.Rdata
catSums <- colSums(bus.cats) # get summary data, with just the category sums
catSums[order(-catSums)]->catSums # order the categories ahead of plotting them
plot(log(catSums))
# the plot shows a skewed set. Many of these will be useless in the hierarchical clustering
# as there are only a handful of observations of them. It's unlikely they will produce an
# interesting cluster membership
# lets concentrate on the top 60 since there's a slight kink there
# and for similarity purposes we only want businesses that have multiple categories
@timabe
timabe / predictOverFullGames.R
Created April 1, 2013 04:48
Aggregates votes from RandomForest
predictOverFullGames <- function(name) {
player <- subset(pitch.test, pitcher_name == name)
# genrate predictions as votes
player$prediction <- predict(rf, player, type = 'vote')
# aggregate over his individual games
playerAgg <- aggregate(. ~ V1, data = data.frame(cbind( player$url, player$prediction)), sum)
# melt data frame to extract winner of majority vote
melt.player <- melt(playerAgg, id = 'V1')
melt.player$V1 <- as.character(melt.player$V1)
melt.player <- ddply(melt.player, .(as.character(V1)), summarize, pitcher = variable[value == max(value)])
@timabe
timabe / predictPitcherRF.R
Created April 1, 2013 04:00
OaklandAsRandomForest
# Generate Training and Testing data from full pitch dataframe
train.vec2 <- sample(nrow(pitch), nrow(pitch) * 0.7)
pitch.train <- pitch[train.vec2, ]
pitch.test <- pitch[-train.vec2, ]
# Change pitcher name and pitch type from character variable to factor
pitch.train$pitcher_name <- as.factor(pitch.train$pitcher_name)
pitch.train$pitch_type <- as.factor(pitch.train$pitch_type)
pitch.test$pitcher_name <- as.factor(pitch.test$pitcher_name)
pitch.test$pitch_type <- as.factor(pitch.test$pitch_type)
@timabe
timabe / reduceDimensions.R
Last active December 15, 2015 15:28
Oakland A's pitchFx: reduce dimensions
# these variables aren't useful for analysis
badVars <- c('des', 'on_1b', 'on_2b', 'on_3b', 'type', 'type_confidence',
'z0', 'zone', 'url', 'num', 'count', 'pitcher', 'batter_name',
'sv_id', 'nasty')
varsToOmit = c(badVars, 'pitcher_name', 'pitch_type')
# let's explore this data . . . use 3 pitchers and see if some dimentions of
# their fastballs can differentiate them well.
@timabe
timabe / hot_hand_ex.R
Created April 28, 2012 20:00
Hot Hand Example
### HOT HAND
###
### EXAMPLE: Anthony Morrow
# first source code from https://github.com/timabe/Hot-Hand/blob/master/hot_hand.R
streaks('Anthony Morrow')
### outputs it as data frame to streak_count
historicals('Anthony Morrow')
### outputs it as a list to historical.list