Skip to content

Instantly share code, notes, and snippets.

@bkamapantula
Created March 27, 2021 11:58
Show Gist options
  • Save bkamapantula/919962a7e75308db34689b8be5f7f845 to your computer and use it in GitHub Desktop.
Save bkamapantula/919962a7e75308db34689b8be5f7f845 to your computer and use it in GitHub Desktop.
Get live chat content from YouTube videos

Setup

pytchat

Install pytchat which does bulk of the work. Open a terminal and run the below to install pytchat.

pip install pytchat

ray

Install ray, a distributed processing utility. Open a terminal and run the below to install ray.

pip install ray

CSV file

Our Python script uses AllScraped.csv file as an input file. The content of the file is as below (first few rows):

current_urls,Session Name
YnK_V8Sxteo,ParaView Tutorial
Sj-YhKszTGw,MLUI 2020: Machine Learning from User Interaction for Visualization and Analytics
RiG1Rn0Acn0,Color Basics for Creating Visualizations

Execute

Open a terminal, and run the below to get

python youtube-livechat.py
"""Retrieve live chat content from YouTube videos."""
import json
import csv
import pytchat
import ray
# initialize ray
ray.init()
urls = []
# AllScraped.csv has several rows. Each row has two columns.
# first column: YouTube video ID, second column is the video title.
with open('AllScraped.csv', mode='r') as file_in:
reader = csv.reader(file_in)
for _id, row in enumerate(reader):
if _id != 0:
urls.append(row[0])
# after the above for loop completes execution, urls list will have list of video IDs
# this list will be used at the end of this script to retrieve the live chat for each url.
@ray.remote
def fetch_live_chat(video_id):
"""Fetch live chat messages for a YouTube video ID.
Args:
video_id (str): video ID. ex: https://www.youtube.com/watch?v=ID
Returns:
None
Writes output to `video_id.json` file.
Usage: fetch_live_chat('XQhBHnPIsRk')
"""
print("current ID:", video_id)
chat = pytchat.create(video_id=video_id)
chats = []
keys = ['author', 'message']
while chat.is_alive():
print("chat is alive...")
for c in chat.get().sync_items():
print(f"{c.datetime} [{c.author.name}]- {c.message}")
obj = json.dumps({'author': c.author.name, 'message': c.message})
chats.append(json.loads(obj))
with open(f"{video_id}.json", "w") as fout:
json.dump(chats, fout)
with open(f"{video_id}.csv", "w", newline='') as file_out:
dict_writer = csv.DictWriter(file_out, keys)
dict_writer.writeheader()
dict_writer.writerows(chats)
def read_from_json(video_id):
"""Get comments from a JSON file."""
file_in = json.load(open(f"{video_id}.json"))
# scrape all videos
# this invocation (function.remote(variable)) follows ray's convention
[fetch_live_chat.remote(url) for url in urls]
# after executing the above list comprehension, two files are created. 1) VIDEO_ID.json, 2) VIDEO_ID.csv
# if you've a set of videos that are already scraped, save their IDs in ignore_urls list
# and run the below
[fetch_live_chat.remote(url) for url in urls if url not in ignore_urls]
# after executing the above list comprehension, two files are created. 1) VIDEO_ID.json, 2) VIDEO_ID.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment