You’ll need to setup YouTube OAuth Authentication (with an application id and password). If you have questions check out the package documentation or the tutorial on Storybench.
# Load packages below
library(magrittr) # Pipes %>%, %T>% and equals(), extract().
library(tidyverse) # all tidyverse packages
library(hrbrthemes) # themes for graphs
library(tuber) # youtube API
Great! Now that you have your YouTube authentication all set up, we’ll download some data into RStudio. Be sure to check out the reference page and the YouTube API reference docs on how to access various meta data from YouTube videos.
We’re basing this documentation off of a previous project implemented in Python, and want to create two data sets identical to the .csv’s we import in the code chunk below:
amy_destfile <- paste0("data/",
base::noquote(lubridate::today()),
"-amy.csv")
kap_destfile <- paste0("data/",
base::noquote(lubridate::today()),
"-kap.csv")
download.file(url = "https://raw.githubusercontent.com/richardcornish/sketch-comedy-data/master/csvs/amy.csv",
destfile = amy_destfile)
download.file(url = "https://raw.githubusercontent.com/richardcornish/sketch-comedy-data/master/csvs/kap.csv",
destfile = kap_destfile)
trying URL 'https://raw.githubusercontent.com/richardcornish/sketch-comedy-data/master/csvs/amy.csv'
Content type 'text/plain; charset=utf-8' length 19693 bytes (19 KB)
==================================================
downloaded 19 KB
trying URL 'https://raw.githubusercontent.com/richardcornish/sketch-comedy-data/master/csvs/kap.csv'
Content type 'text/plain; charset=utf-8' length 30913 bytes (30 KB)
==================================================
downloaded 30 KB
Import the datasets.
KeyAndPeeleRaw <- readr::read_csv(kap_destfile)
AmyRaw <- readr::read_csv(kap_destfile)
Parsed with column specification:
cols(
publishedAt = col_datetime(format = ""),
dislikeCount = col_double(),
url = col_character(),
id = col_character(),
commentCount = col_double(),
title = col_character(),
viewCount = col_double(),
likeCount = col_double()
)
Parsed with column specification:
cols(
publishedAt = col_datetime(format = ""),
dislikeCount = col_double(),
url = col_character(),
id = col_character(),
commentCount = col_double(),
title = col_character(),
viewCount = col_double(),
likeCount = col_double()
)
Take a look at variables in both datasets to see what I’ll need.
# more elaborate than needed, but wanted to see how well this works
intersect(x = names(KeyAndPeeleRaw),
y = names(AmyRaw))
[1] "publishedAt" "dislikeCount" "url" "id" "commentCount" "title" "viewCount" "likeCount"
After doing a little reading in the YouTube API
documentation, we
discover we will have to adapt the tuber
package functions to get us
these columns. We’ll use the Drunk
History
playlist from Comedy Central’s YouTube Channel as case study.
We will be using the playlist id from the url to access the content from the videos. Be sure to check the documentation on the available playlist data from the YouTube API.
playlistId
= TheplaylistId
parameter specifies the unique ID of the playlist for which you want to retrieve playlist items. Note that even though this is an optional parameter, every request to retrieve playlist items must specify a value for either theid
parameter or theplaylistId
parameter.
Let’s split the url for this playlist and get the playlistId
and store
it in drunk_history_playlist_id
.
drunk_history_playlist_id <- stringr::str_split(
string = "https://www.youtube.com/playlist?list=PLD7nPL1U-R5pSwKIcVaIQrG5BnGMbHI5H",
pattern = "=",
n = 2,
simplify = TRUE)[ , 2]
drunk_history_playlist_id
[1] "PLD7nPL1U-R5pSwKIcVaIQrG5BnGMbHI5H"
We now take the playlistId object (stored in a vector
drunk_history_playlist_id
), and pass it over to the
tuber::get_playlist_items()
function, but add part
and max_results
arguments. The documentation for the tuber::get_playlist_items()
from
tuber
has the following description & arguments:
Description: Get Playlist Items.
filter
= string; Required. named vector of length 1 potential names of the entry in the vector:item_id
: comma-separated list of one or more unique playlist item IDs.playlist_id
: YouTube playlist ID.
video_id
= Comma separated list of IDs of the videos for which details are requested. Required.
part
= Required. Comma separated string including one or more of the following:contentDetails
,id
,snippet
,status.
Default:contentDetails
max_results
= Maximum number of items that should be returned. Integer. Optional. Default is50
. If over50
, all the results are returned.
simplify
= returns adata.frame
rather than a list.
We’ll set the part
to "snippet"
, the max_result
to the number of
videos in the playlist, and simplify
to true
. We’ll throw all this
into a data.frame
named DrunkHistRawSnippets
.
DrunkHistRawSnippets <- tuber::get_playlist_items(filter = c(playlist_id = drunk_history_playlist_id),
# get snippets
part = "snippet",
# set this to the number of videos
max_results = 150,
# return a data frame
simplify = TRUE)
DrunkHistRawSnippets %>% dplyr::glimpse(78)
Observations: 110
Variables: 28
$ .id <chr> "items1", "items2", "items3", "i…
$ kind <fct> youtube#playlistItem, youtube#pl…
$ etag <fct> "p4VTdlkQv3HQeTEaXgvLePAydmU/Jjk…
$ id <fct> UExEN25QTDFVLVI1cFN3S0ljVmFJUXJH…
$ snippet.publishedAt <fct> 2019-09-26T18:34:54.000Z, 2019-0…
$ snippet.channelId <fct> UCUsN5ZwHx2kILm84-jPDeXw, UCUsN5…
$ snippet.title <fct> "Alexander Hamilton’s Salacious …
$ snippet.description <fct> "Alexander Hamilton is mired in …
$ snippet.thumbnails.default.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.default.width <int> 120, 120, 120, 120, 120, 120, 12…
$ snippet.thumbnails.default.height <int> 90, 90, 90, 90, 90, 90, 90, 90, …
$ snippet.thumbnails.medium.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.medium.width <int> 320, 320, 320, 320, 320, 320, 32…
$ snippet.thumbnails.medium.height <int> 180, 180, 180, 180, 180, 180, 18…
$ snippet.thumbnails.high.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.high.width <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet.thumbnails.high.height <int> 360, 360, 360, 360, 360, 360, 36…
$ snippet.thumbnails.standard.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.standard.width <int> 640, 640, 640, 640, 640, 640, 64…
$ snippet.thumbnails.standard.height <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet.thumbnails.maxres.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.maxres.width <int> 1280, 1280, 1280, 1280, 1280, 12…
$ snippet.thumbnails.maxres.height <int> 720, 720, 720, 720, 720, 720, 72…
$ snippet.channelTitle <fct> Comedy Central, Comedy Central, …
$ snippet.playlistId <fct> PLD7nPL1U-R5pSwKIcVaIQrG5BnGMbHI…
$ snippet.position <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ snippet.resourceId.kind <fct> youtube#video, youtube#video, yo…
$ snippet.resourceId.videoId <fct> 3yi0FNwOYiI, agsiFcRGAQo, 4iRwHL…
We can see the columns here aren’t exactly what I wanted, but I’m
getting closer! I can use the contentDetails.videoId
variable to get
the statistics for each video (likeCount
, dislikeCount
, viewCount
)
etc.
First I put the id into a vector and call it drunk_hist_videoIds
.
# Get the ids
drunk_hist_videoIds <- as.vector(DrunkHistRawSnippets$snippet.resourceId.videoId)
dplyr::glimpse(drunk_hist_videoIds)
chr [1:110] "3yi0FNwOYiI" "agsiFcRGAQo" "4iRwHLRS8Qw" "fD_zMzwa3VE" ...
Now that I have the Drunk History playlist video ids in a character
vector (drunk_hist_videoIds
), we can create a function that extracts
the statistics for each video.
# Function to scrape stats for all vids
get_all_stats <- function(id) {
tuber::get_stats(video_id = id)
}
Now I can apply the function get_all_stats
to the vector of video ids
(drunk_hist_videoIds
)
# Get stats and convert results to data frame
DrunkHistAllStatsRaw <- purrr::map_df(.x = drunk_hist_videoIds,
.f = get_all_stats)
DrunkHistAllStatsRaw %>% dplyr::glimpse(78)
Observations: 109
Variables: 6
$ id <chr> "3yi0FNwOYiI", "agsiFcRGAQo", "4iRwHLRS8Qw", "fD_zMzw…
$ viewCount <chr> "374141", "152978", "182906", "110198", "106631", "15…
$ likeCount <chr> "13935", "2951", "3738", "1985", "3875", "3241", "145…
$ dislikeCount <chr> "127", "81", "138", "269", "59", "63", "332", "179", …
$ favoriteCount <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
$ commentCount <chr> "602", "222", "249", "240", "291", "157", "679", "537…
The purrr::map_df()
function “maps” the function over all video ids in
the vector and returns a data.frame
we call DrunkHistAllStatsRaw
.
But now we notice we don’t have the name of the video.
Well, I can join the new DrunkHistAllStatsRaw
back to the
DrunkHistRawSnippets
data.frame
using the
snippet.resourceId.videoId
column and the id
column from
DrunkHistAllStatsRaw
.
DrunkHistRawStatsSnips <- DrunkHistAllStatsRaw %>%
inner_join(x = .,
DrunkHistRawSnippets,
by = c("id" = "snippet.resourceId.videoId"))
DrunkHistRawStatsSnips %>% dplyr::glimpse(78)
Observations: 111
Variables: 33
$ id <chr> "3yi0FNwOYiI", "agsiFcRGAQo", "4…
$ viewCount <chr> "392821", "155224", "183979", "1…
$ likeCount <chr> "14912", "3010", "3769", "1993",…
$ dislikeCount <chr> "142", "82", "140", "269", "59",…
$ favoriteCount <chr> "0", "0", "0", "0", "0", "0", "0…
$ commentCount <chr> "637", "224", "249", "240", "292…
$ .id <chr> "items1", "items2", "items3", "i…
$ kind <fct> youtube#playlistItem, youtube#pl…
$ etag <fct> "p4VTdlkQv3HQeTEaXgvLePAydmU/Jjk…
$ id.y <fct> UExEN25QTDFVLVI1cFN3S0ljVmFJUXJH…
$ snippet.publishedAt <fct> 2019-09-26T18:34:54.000Z, 2019-0…
$ snippet.channelId <fct> UCUsN5ZwHx2kILm84-jPDeXw, UCUsN5…
$ snippet.title <fct> "Alexander Hamilton’s Salacious …
$ snippet.description <fct> "Alexander Hamilton is mired in …
$ snippet.thumbnails.default.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.default.width <int> 120, 120, 120, 120, 120, 120, 12…
$ snippet.thumbnails.default.height <int> 90, 90, 90, 90, 90, 90, 90, 90, …
$ snippet.thumbnails.medium.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.medium.width <int> 320, 320, 320, 320, 320, 320, 32…
$ snippet.thumbnails.medium.height <int> 180, 180, 180, 180, 180, 180, 18…
$ snippet.thumbnails.high.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.high.width <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet.thumbnails.high.height <int> 360, 360, 360, 360, 360, 360, 36…
$ snippet.thumbnails.standard.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.standard.width <int> 640, 640, 640, 640, 640, 640, 64…
$ snippet.thumbnails.standard.height <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet.thumbnails.maxres.url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet.thumbnails.maxres.width <int> 1280, 1280, 1280, 1280, 1280, 12…
$ snippet.thumbnails.maxres.height <int> 720, 720, 720, 720, 720, 720, 72…
$ snippet.channelTitle <fct> Comedy Central, Comedy Central, …
$ snippet.playlistId <fct> PLD7nPL1U-R5pSwKIcVaIQrG5BnGMbHI…
$ snippet.position <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ snippet.resourceId.kind <fct> youtube#video, youtube#video, yo…
Beautiful! We have all the columns we originally needed (and more!), but we should also address the formatting on the column names.
The janitor
package let’s us standardize the column names using
clean_names()
and case = "snake"
.
DrunkHistRawStatsSnips <- DrunkHistRawStatsSnips %>%
janitor::clean_names(dat = ., case = "snake")
DrunkHistRawStatsSnips %>% glimpse(78)
Observations: 111
Variables: 33
$ id <chr> "3yi0FNwOYiI", "agsiFcRGAQo", "4…
$ view_count <chr> "392821", "155224", "183979", "1…
$ like_count <chr> "14912", "3010", "3769", "1993",…
$ dislike_count <chr> "142", "82", "140", "269", "59",…
$ favorite_count <chr> "0", "0", "0", "0", "0", "0", "0…
$ comment_count <chr> "637", "224", "249", "240", "292…
$ id_2 <chr> "items1", "items2", "items3", "i…
$ kind <fct> youtube#playlistItem, youtube#pl…
$ etag <fct> "p4VTdlkQv3HQeTEaXgvLePAydmU/Jjk…
$ id_y <fct> UExEN25QTDFVLVI1cFN3S0ljVmFJUXJH…
$ snippet_published_at <fct> 2019-09-26T18:34:54.000Z, 2019-0…
$ snippet_channel_id <fct> UCUsN5ZwHx2kILm84-jPDeXw, UCUsN5…
$ snippet_title <fct> "Alexander Hamilton’s Salacious …
$ snippet_description <fct> "Alexander Hamilton is mired in …
$ snippet_thumbnails_default_url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet_thumbnails_default_width <int> 120, 120, 120, 120, 120, 120, 12…
$ snippet_thumbnails_default_height <int> 90, 90, 90, 90, 90, 90, 90, 90, …
$ snippet_thumbnails_medium_url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet_thumbnails_medium_width <int> 320, 320, 320, 320, 320, 320, 32…
$ snippet_thumbnails_medium_height <int> 180, 180, 180, 180, 180, 180, 18…
$ snippet_thumbnails_high_url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet_thumbnails_high_width <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet_thumbnails_high_height <int> 360, 360, 360, 360, 360, 360, 36…
$ snippet_thumbnails_standard_url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet_thumbnails_standard_width <int> 640, 640, 640, 640, 640, 640, 64…
$ snippet_thumbnails_standard_height <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet_thumbnails_maxres_url <fct> https://i.ytimg.com/vi/3yi0FNwOY…
$ snippet_thumbnails_maxres_width <int> 1280, 1280, 1280, 1280, 1280, 12…
$ snippet_thumbnails_maxres_height <int> 720, 720, 720, 720, 720, 720, 72…
$ snippet_channel_title <fct> Comedy Central, Comedy Central, …
$ snippet_playlist_id <fct> PLD7nPL1U-R5pSwKIcVaIQrG5BnGMbHI…
$ snippet_position <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ snippet_resource_id_kind <fct> youtube#video, youtube#video, yo…
This looks much better–no more periods or capitol letters in the variable names.
Ok–I was able to create a data.frame
with the same columns as AmyRaw
and KeyAndPeeleRaw
, but wouldn’t it be nice if I could put the entire
thing together into a custom function?
Ideally, I would just provide the url for a playlist (playlist_url
),
and the number of videos (video_count
), and have the function return a
data set with the columns I wanted. Let’s give it a try!
youtube_video_data <- function(playlist_url, video_count) {
# packages -----------------------------------------------------------------
require(stringr)
require(tuber)
require(purrr)
require(dplyr)
require(janitor)
# step 1) split url into playlistId ----------------------------------------
raw_url <- playlist_url
# get playlist id
playlist_id <- stringr::str_split(
string = raw_url,
pattern = "=",
n = 2,
simplify = TRUE)[ , 2]
# get raw snippet df -------------------------------------------------------
RawSnippets <- tuber::get_playlist_items(filter =
c(playlist_id = playlist_id),
# get snippets
part = "snippet",
# set this to the number of videos
max_results = video_count,
# return a data frame
simplify = TRUE)
# extract the videoId ------------------------------------------------------
snip_videoIds <- base::as.vector(RawSnippets$snippet.resourceId.videoId)
# create custom stats function ---------------------------------------------
get_all_stats <- function(id) {
tuber::get_stats(video_id = id)
}
# apply stats function to vector of video ids ------------------------------
YouTubeStatsRaw <- purrr::map_df(.x = snip_videoIds,
.f = get_all_stats) %>%
# join to RawSnippets data.frame ---------------------------------------
dplyr::inner_join(x = .,
RawSnippets,
by = c("id" = "snippet.resourceId.videoId"))
# clean names --------------------------------------------------------------
YouTubeStatsRaw <- YouTubeStatsRaw %>%
janitor::clean_names(dat = ., case = "snake")
# return df ----------------------------------------------------------------
return(YouTubeStatsRaw)
}
Ok, now we should test our youtube_video_data()
function and see if it
works with a fresh playlist just to make sure there’s nothing fishy
going on.
Paste this
url
from Funny or Die’s playlist “FOD has Tech”. There are 146
videos in
this playlist, so we will specify video_count = 146
.
youtube_video_data(playlist_url = "https://www.youtube.com/playlist?list=PLRcB4n4CGcy8_3l1f6G1DGQCC6Jt3JEtZ",
video_count = 146) %>% dplyr::glimpse(78)
Observations: 137
Variables: 33
$ id <chr> "70YAFPoMCQc", "vbdosqKMJxA", "l…
$ view_count <chr> "200793", "3512", "1960", "6169"…
$ like_count <chr> "6414", "26", "17", "19", "466",…
$ dislike_count <chr> "152", "19", "9", "12", "35", "8…
$ favorite_count <chr> "0", "0", "0", "0", "0", "0", "0…
$ comment_count <chr> "907", NA, NA, NA, "36", "55", "…
$ id_2 <chr> "items1", "items2", "items3", "i…
$ kind <fct> youtube#playlistItem, youtube#pl…
$ etag <fct> "p4VTdlkQv3HQeTEaXgvLePAydmU/YQv…
$ id_y <fct> UExSY0I0bjRDR2N5OF8zbDFmNkcxREdR…
$ snippet_published_at <fct> 2019-06-19T14:54:07.000Z, 2019-0…
$ snippet_channel_id <fct> UCzS3-65Y91JhOxFiM7j6grg, UCzS3-…
$ snippet_title <fct> Jeff Goldblum Dishes With Regina…
$ snippet_description <fct> "Secrets don’t make friends, but…
$ snippet_thumbnails_default_url <fct> https://i.ytimg.com/vi/70YAFPoMC…
$ snippet_thumbnails_default_width <int> 120, 120, 120, 120, 120, 120, 12…
$ snippet_thumbnails_default_height <int> 90, 90, 90, 90, 90, 90, 90, 90, …
$ snippet_thumbnails_medium_url <fct> https://i.ytimg.com/vi/70YAFPoMC…
$ snippet_thumbnails_medium_width <int> 320, 320, 320, 320, 320, 320, 32…
$ snippet_thumbnails_medium_height <int> 180, 180, 180, 180, 180, 180, 18…
$ snippet_thumbnails_high_url <fct> https://i.ytimg.com/vi/70YAFPoMC…
$ snippet_thumbnails_high_width <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet_thumbnails_high_height <int> 360, 360, 360, 360, 360, 360, 36…
$ snippet_thumbnails_standard_url <fct> https://i.ytimg.com/vi/70YAFPoMC…
$ snippet_thumbnails_standard_width <int> 640, 640, 640, 640, 640, 640, 64…
$ snippet_thumbnails_standard_height <int> 480, 480, 480, 480, 480, 480, 48…
$ snippet_thumbnails_maxres_url <fct> https://i.ytimg.com/vi/70YAFPoMC…
$ snippet_thumbnails_maxres_width <int> 1280, 1280, 1280, 1280, 1280, 12…
$ snippet_thumbnails_maxres_height <int> 720, 720, 720, 720, 720, 720, 72…
$ snippet_channel_title <fct> Funny Or Die, Funny Or Die, Funn…
$ snippet_playlist_id <fct> PLRcB4n4CGcy8_3l1f6G1DGQCC6Jt3JE…
$ snippet_position <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10…
$ snippet_resource_id_kind <fct> youtube#video, youtube#video, yo…
Awesome! Now this totally reproducible–we can create new YouTube datasets with just a playlist url and the number of videos.
Just to make test data look identical to the imoprted AmyRaw
,
KeyAndPeeleRaw
.csv files, I’ll check the names and reduce the columns
to only those in the original datasets.
# KeyAndPeeleRaw %>% dplyr::glimpse(78)
DrunkHistData <- DrunkHistRawStatsSnips %>%
dplyr::select(
published_at = snippet_published_at,
dislike_count,
url = snippet_thumbnails_default_url,
id,
comment_count,
title = snippet_title,
view_count,
like_count)
# check
DrunkHistData %>% glimpse(78)
Observations: 111
Variables: 8
$ published_at <fct> 2019-09-26T18:34:54.000Z, 2019-09-19T19:37:40.000Z, 2…
$ dislike_count <chr> "142", "82", "140", "269", "59", "63", "332", "179", …
$ url <fct> https://i.ytimg.com/vi/3yi0FNwOYiI/default.jpg, https…
$ id <chr> "3yi0FNwOYiI", "agsiFcRGAQo", "4iRwHLRS8Qw", "fD_zMzw…
$ comment_count <chr> "637", "224", "249", "240", "292", "157", "679", "537…
$ title <fct> "Alexander Hamilton’s Salacious Sex Scandal (feat. Li…
$ view_count <chr> "392821", "155224", "183979", "110611", "106962", "15…
$ like_count <chr> "14912", "3010", "3769", "1993", "3892", "3248", "146…
Great! Now I can export this and start analyzing!
# timestamp
readr::write_csv(base::as.data.frame(DrunkHistData),
path = base::paste0(
"data/",
base::noquote(lubridate::today()),
"-DrunkHistData.csv"
))
fs::dir_tree("data")
data
├── 2019-09-30-DrunkHistData.csv
├── 2019-09-30-amy.csv
├── 2019-09-30-kap.csv
└── README.md