A data scientist with a focus on health care and program evaluation. Also a baseball fan (so many statistics!) and someone who learns and teaches best with simulation; there's some of each here. A summary:
From the Riddler Express:
The Major League Baseball playoffs are about to begin. Based on the current playoff format, what is the best possible winning percentage a team can have in the playoffs without winning the World Series? And what is the worst possible winning percentage a team can have in the playoffs and still win the World Series?
The current format is:
- one play-in game for two non-division-winning wild card teams
- best 3-out-of 5 division series
- best 4-out-of-7 league championship series
- best 4-out-of-7 World Series
This week's Riddler Express:
Abby and Beatrix are playing a game with two six-sided dice. Rather than having numbers on the sides like normal dice, however, the sides of these dice are either red or blue. In the game they're playing, Abby wins if the two dice land with the same color on top. Beatrix wins if the colors are not the same. One of the dice has five blue sides and one red side. If Abby and Beatrix have equal chances of winning the game, how many red and blue sides does the other die have?
While this wouldn't be too difficult to reason out, it's also a pretty straightforward simulation, and a good excuse for a stacked barchart. We start by creating the seven dice needed to compare.
df = data.frame(id = 1:6)
for(i in 0:6){
die <- c(rep('red',i),rep('blue',6-i))
One of my favorite baseball websites, baseballmusings.com, recently had a post about the chances of a player winning the triple crown (leading the league in home runs, RBIs and batting average).
The author provided calculations for these, and estimated that probability of achieving the triple crown by multiplying the individual probabilities together. This led me to assess the association between these statistics, as it seemed that there was a fairly strong association between leading the league in home runs and RBIs (both generally signs of power hitters, who probably get lineup spots with opportunities to drive in runners), and that perhaps they shouldn't be considered as independent. The analysis using RStudio and intepretation is below.
Using Lahman
package for baseball data, sqldf
for data manipulation
library(Lahman)
library(sqldf)
Take a standard deck of cards, and pull out the numbered cards from one suit (the cards 2 through 10). Shuffle them, and then lay them face down in a row. Flip over the first card. Now guess whether the next card in the row is bigger or smaller. If you’re right, keep going. If you play this game optimally, what’s the probability that you can get to the end without making any mistakes? Extra credit: What if there were more cards — 2 through 20, or 2 through 100? How do your chances of getting to the end change?
R code is here. Starts with function to determine success in single 10 card trial (0 or 1), which comes out to ~ .17.
Use function to simulate repetitions of 10 card came, can be extended to many cards or repetitions. Probability of success rapidly drops to near zero with ~ 35 cards:
The code included is a simulated answer to a classic riddler challenge from 538:
Let’s call this game rock-paper-scissors-hop. Here is an idealized list of its rules:
- Kids stand at either end of N hoops.
- At the start of the game, one kid from each end starts hopping at a speed of one hoop per second until they run into each other, either in adjacent hoops or in the same hoop.
- At that point, they play rock-paper-scissors at a rate of one game per second until one of the kids wins.
- The loser goes back to their end of the hoops, a new kid immediately steps up at that end, and the winner and the new player hop until they run into each other. >- This process continues until someone reaches the opposing end. That player’s team wins!
As a rapidly growing field, data science programs often work to provide exposure to leading companies in marketing, banking, consulting, research, technology, insurance, and many other areas with a need for analytic services. Students are encouraged to apply to a wide array of companies, to develop relationships with people in the field, get an understanding of different fields in analytics, and increase the chances of getting one or (hopefully) more offers.
During the interview process analytics students are always running the numbers (consciously or not) and calculating the likelihood of an offer. Where should I put my energy? How many interviews are too many? Too few? I wrote an interactive program to allow applicants to estimate their personal numbers using this tool. It includes the following variables a