Skip to content

Instantly share code, notes, and snippets.

@jurgen-dejager
Created July 22, 2016 20:42
Show Gist options
  • Save jurgen-dejager/660a5a6e1e7ea1316ad7044083196f89 to your computer and use it in GitHub Desktop.
Save jurgen-dejager/660a5a6e1e7ea1316ad7044083196f89 to your computer and use it in GitHub Desktop.
Airbnb in NYC
# http://rpubs.com/jdejager/197077
# Introduction
> Airbnb boasts over a million listings in 34,000 cities, and according to data from Inside Airbnb, a independent data analysis website, Airbnb listed about 36000 apartments in New York as of July 5, 2016. Airbnb's presence in NYC has been clouded in controversy from the beginning, with law makers arguing that airbnb drives up rent for New York residents, as well as facilitate a lot of illegal hosting activities, all the while not paying any of the fees hotels are subjected to. Rent is driven when landlords decide to rather rent apartments to short-term guests at higher rates, compared to signing up tenants for yearlong leases. Less supply, and increase demand almost always leads to higher prices.In a study conducted in 2014, The New York State Attorney General concluded that **72%** of all units used as private short-term rentals on Airbnb during 2010 through mid-2014 appeared to violate both state and local New York laws. A more recent study by BJH Advisors LLC shows that this number is closer to 55% -- Still a staggering amount. Most of these violations come from the fact that the minimum required stay for **entire-home** airbnb rentals have to be at-least 30 days. A more recent law states that landlords can't have multiple listings. This data exploration sets out to visualize how airbnb operates in New York City. I'm going to investigate how prices, and activity differ across the five boroughs, and how neighborhood median income is correlated with prices. I'm also taking a closer look at at whether or not the high figure of illegal activity reported is accurate, and finally whether the concern that airbnb potentially decreases rental supply is warranted.
## My analyis is structured as follows:
* Visualization of:
+ Prices
+ Reviews per Month
+ Availability
+ Minimum Stay
* Correlation Plots
+ Median Income vs Median Private Price
+ Median Income vs Median Entire Price
+ % Entire vs. Median Income
+ Median Income vs Reviews
* Geo-Spatial Analysis
+ Prices
+ Reviews per Month
+ Availability
+ Minimum Stay
---
```{r, echo = FALSE, warning = FALSE, include=FALSE}
library(dplyr)
library(zipcode)
library(sp)
library(rgdal)
library(leaflet)
library(GGally)
library(ggthemes)
library(ggplot2)
library(plotly)
#reading in airbnb data
abnb = read.csv("abnb.csv")
airbnb.data = read.csv("~/Desktop/RDocs/airbnb.data.csv")
#reading in shapefile
nyc2 = readOGR(dsn = "/Users/jdejager/Desktop/RDocs/Neighborhood.Tabulation.Areas-1", layer="geo_export_d282a06b-5edb-4a8a-a038-32d5494cd78e")
airbnb.data = airbnb.data %>%
mutate(
illegal = ifelse(min_nights < 30 & room_type == "Entire home/apt", "Illegal", "Legal")) %>%
group_by(nhood) %>%
mutate(
med.price = as.numeric(mean(price))
)
airbnb.data = airbnb.data %>%
filter(min_nights <= 365)
airbnb.data$Occupany = cut(airbnb.data$availability_365, breaks = 12,
labels=c("0-29","30-59","60-89", "90-119","120-149", "150-179", "180-209", "210-239", "240-269", "270-299", "300-329", "330-365"))
```
# Distributions
> To start things off, let's look at the distributions of Price, Reviews, Availability and Minimum Nights These plots primarily investigate how the 5 Boroughs difference when we compare the prices, reviews, and availability of each respectively. It make sense to think that Manhattan is the most expensive Borough, followed by Brooklyn or Staten Island.
## 1. Prices
> First up, Lets take a look at how prices are distributed across the five boroughs.
```{r, warning = FALSE, echo=FALSE}
#prices
ggplot(airbnb.data, aes(x = price, color = boro)) +
geom_density() +
xlim(0,2000) + ggtitle("Distribution of Prices") + theme_fivethirtyeight()
```
> We can clearly see that there are some outliers. The next plot excludes them and takes a closer look.
```{r, echo=FALSE, warning= FALSE}
#zoomed in
ggplot(airbnb.data, aes(x = price, color = boro)) +
geom_density() +
xlim(0,500) + ggtitle("Distribution of Prices - Without Outliers") + theme_fivethirtyeight()
```
> Next, I thought it was worth looking at the outliers. We can see that Staten Island has the highest amount of price outliers.
```{r, echo=FALSE, warning= FALSE}
#outliers
ggplot(airbnb.data, aes(x = price, color = boro)) +
geom_density() + xlim(1000,6000) + ggtitle("Outliers") + theme_fivethirtyeight()
```
> Finally, the best way to look at the prices is by taking the logarithm. This shows that Manhattan is the priciest followed by Brooklyn, then Staten Island, then Queens and lastly the Bronx.
```{r, echo=FALSE, warning= FALSE}
#log of prices
ggplot(airbnb.data, aes(x = log(price), color = boro)) +
geom_density() + xlim(2,8) +
ggtitle("Distribution of Prices - Log Scale") + theme_fivethirtyeight()
#box plot of log prices
ggplot(airbnb.data, aes(x = boro, log(price), color = boro)) +
ggtitle("Distribution of Price - Boxplot") + theme_fivethirtyeight() + geom_violin() + geom_boxplot(aes(fill = boro, alpha = 0.2))
```
---
## 2. Reviews per month
> The *reviews_per_month* variable gives us a good indication of how much activity goes on in each Borough. The more reviews a listing has per month, the more it gets rented out.
```{r, echo = FALSE, warning = FALSE}
ggplot(airbnb.data, aes(x = boro, reviews_per_month, color = boro)) +
ggtitle("Reviews Per Month") + theme_fivethirtyeight() + geom_boxplot() + ylim(0,5) + geom_violin(aes(fill = boro), alpha = 0.2)
```
> When we zoom into the outlier region, we can see that only handful of people have more than 8 reviews per month.
```{r, echo = FALSE, warning = FALSE}
ggplot(airbnb.data, aes(x = boro, reviews_per_month, color = boro)) +
ggtitle("Reviews Per Month") + theme_fivethirtyeight() + geom_boxplot() + ylim(5,12) + geom_violin(aes(fill = boro), alpha = 0.2)
```
> The last distribution I investigate is how availability differs across the five boroughs. What struck me as quite interesting is how much these listings are available on an annual basis. The distribution shows that listings are either available for a very short period, or essential for 365 days out of the year.
---
## 3. Availability
> A quick look at estimated occupancy rates for Entire Home rentals in New York City (as at July 2, 2016) shows us that more than 6,000 entire homes are being rented for more than half the year, and most likely are no longer available on the rental or owner-occupied housing markets.
```{r, echo = FALSE, warning = FALSE}
ggplot(airbnb.data, aes(x = availability_365, fill = boro)) +
geom_histogram(binwidth = 10, position = "dodge") + ggtitle("Distribution of Availability") + theme_fivethirtyeight() + geom_freqpoly(aes(color = boro))
```
```{r, echo = FALSE, warning = FALSE}
ggplot(airbnb.data, aes(x = availability_365, color = boro)) +
geom_density() + ggtitle("Distribution of Availability") + theme_fivethirtyeight()
```
```{r, echo = FALSE, warning = FALSE}
entire.homes = airbnb.data %>% filter(room_type == "Entire home/apt")
ggplot(airbnb.data, aes(x = Occupany)) + geom_bar() + theme_fivethirtyeight() + ggtitle("Availibility")
```
---
> Now lets test the claim that 72% of listings are illegal. Under state law, it is illegal to lease most homes—with the exception of one- and two-family residences—for periods of less than 30 days when the owner or tenant is not present. This means that if an apartment is listed for less than thirty days and is listed as an "entire home", it most likely is an illegal listing. The data shows that there is definitely something fishy going on.
## 4. Minimum Nights
```{r, echo = FALSE, warning = FALSE}
ggplot(airbnb.data, aes(x = min_nights, fill= illegal, alpha = 0.8)) + geom_area(stat = "bin") + xlim(0,100) + theme_fivethirtyeight() +
ggtitle("Minimum Stay")
```
---
# Correlation
```{r, warning = FALSE, echo=FALSE}
ggplot(aes(x = medprivateprice, y = medinc), data = abnb) +
geom_point(aes(colour = boroname)) + geom_smooth(method = lm) +
xlab("Median Private Room Price") + ylab("Median Income") +
ggtitle("Median Private Price vs. Median Income") + theme_fivethirtyeight()
```
> The plot above shows that the Median Income is correlated with the Median Price of Private Room Listings. As the Median Income increases, starting from the bottom left of the plot, so does the Median Price of Private Room Listings. The correlation coefficient is 0.69.
```{r, warning = FALSE, echo=FALSE}
ggplot(aes(x = medentireprice, y = medinc), data = abnb) +
geom_point(aes(colour = boroname)) + geom_smooth(method = lm) +
xlab('Median Entire Room Price') +
ylab("Median Income") + ggtitle("Median Entire vs. Median Income") + theme_fivethirtyeight()
```
> The relationship between Median Entire Room Price and the Median Income for each neighbourhood showed a positive correlation of 0.61. Wealthier neigbourhoods list more expensive apartments. We again see that Brooklyn and Manhattan are the most expensive listings,
```{r, warning = FALSE, echo=FALSE}
ggplot(aes(x = log(medinc), y = percententire), data = abnb) +
geom_point(aes( colour = boroname)) + geom_smooth(method = 'lm') +
xlab('Percent Entire Room Listings') +
ylab("Median Income") +
ggtitle("% Entire vs. Median Income") + theme_fivethirtyeight()
```
> After I doing a log transformation, we observe a linear relationship, with a correlation of 0.632 between the log of median income and the percentage of apartment that are for entire room listings. I then coloured them by Borough. We see that Brooklyn and Manhattan have the highest percentage of entire apartment listings.
---
# Geo-Spatial Analysis of Airbnb Listings in New York City
> This section of my data exploration spatialy visualize airbnb's listings in the city. I'm going to look at prices, activity levels, and which rental units are illegal under New York law.
---
## Prices
```{r, echo = FALSE, warning = FALSE}
state_popup <- paste(airbnb.data$price)
qpal_price <- colorQuantile("YlOrRd", airbnb.data$price, n = 10)
leaflet(nyc2) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng=airbnb.data$long,
lat=airbnb.data$lat,
group = "nhood",
radius = 1,
stroke = FALSE,
opacity = 1,
color = ~qpal_price(airbnb.data$price),
popup = ~state_popup) %>%
addLegend("topleft", pal = qpal_price, values = ~airbnb.data$price,
title = "Price of Listing",
opacity = 0.5
) %>% setView(lng = -73.98928, lat = 40.75042, zoom = 11)
```
> The red on the map shows the expensive listings, and the yellow color denote the cheaper listings. As expected Manhattan has the higeshest concetratd regions of expensive listings, especially in areas like Soho and Chelsea and the Financial District. The next plot show how prices are distributed accross the neighbourhood.
---
## Price by Neighbourhood
```{r, echo = FALSE, warning = FALSE}
##### BY NEIGHBORHOOD ######
nhood_popup <- paste(round(airbnb.data$med.price), airbnb.data$nhood)
qpal_price <- colorQuantile("YlOrRd", airbnb.data$med.price, n = 10)
leaflet(nyc2) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng=airbnb.data$long,
lat=airbnb.data$lat,
group = "nhood",
radius = 0.1,
stroke = T,
opacity = 0.2,
color = ~qpal_price(airbnb.data$med.price),
popup = ~nhood_popup) %>%
addLegend("topleft", pal = qpal_price, values = ~airbnb.data$med.price,
title = "Price of Listing",
opacity = 0.5
) %>% setView(lng = -73.98928, lat = 40.75042, zoom = 11)
```
> The plot shows exactly what we could have guessed -- Manhattan has the more expensive listings, on average, and the most expensive neighborhoods are concentrated near the financial Chelsea, Soho and the Financial district. The Bronx, as well as the outskirts of Brooklyn has the cheapest neighborhood on average.
---
## Type of Listing
```{r, echo = FALSE, warning = FALSE}
###### TYPE OF LISTING
state_popup <- paste(airbnb.data$room_type, airbnb.data$illegal)
qpal_type <- colorFactor(c("blue", "red", "green"), domain = c("Entire home/apt", "Private Room", "Shared Room"))
leaflet(nyc2) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng=airbnb.data$long,
lat=airbnb.data$lat,
group = "nhood",
radius = 1,
stroke = FALSE,
opacity=0.2,
color = ~qpal_type(airbnb.data$room_type),
popup = state_popup) %>% setView(lng = -73.98928, lat = 40.75042, zoom = 11) %>%
addLegend("topleft", pal = qpal_type, values = ~airbnb.data$room_type,
title = "Type of Listing",
opacity = 0.5
)
```
> Above we can see how the type of listing is distributed across NYC. We can clearly see that the more expensive regions are also the regions that list more entire apartment listings. Shared rooms are rarely seen.
---
## Illegal Listings
```{r,echo = FALSE, warning = FALSE}
state_popup = ifelse(airbnb.data$illegal == "Illegal", paste("ARREST", airbnb.data$host_name), paste(airbnb.data$host_name))
airbnb.data$illegal = as.factor(airbnb.data$illegal)
airbnb.data = airbnb.data %>% group_by(illegal) %>% mutate(
legality = n()/nrow(airbnb.data))
qpal_type <- colorFactor(c("blue","green"), domain = c("Illegal", "Legal"))
leaflet(nyc2) %>%
addProviderTiles("CartoDB.DarkMatter") %>%
addCircleMarkers(
lng=airbnb.data$long,
lat=airbnb.data$lat,
group = "illegal",
radius = 1,
stroke = FALSE,
opacity=0.2,
color = ~qpal_type(airbnb.data$illegal),
popup = state_popup) %>%
addLegend("topleft", pal = qpal_type, values = ~airbnb.data$illegal,
title = "Illegal Activity",
opacity = 0.5
) %>% setView(lng = -73.98928, lat = 40.75042, zoom = 11)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment