Skip to content

Instantly share code, notes, and snippets.

@JosephRedfern
Last active August 12, 2021 02:22
Show Gist options
  • Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
Scrapes the youtube video IDs for the youtube-8m data set. Probably buggy. Could be threaded.
import requests
from collections import defaultdict
csv_prefix = "https://research.google.com/youtube8m/csv"
r = requests.get("{0}/verticals.json".format(csv_prefix))
verticals = r.json()
block_urls = defaultdict(list)
count = 0
for cat, urls in verticals.items():
for url in urls:
jsurl = "{0}/j/{1}.js".format(csv_prefix, url.split("/")[-1])
block_urls[cat[1:]].append(jsurl)
count += 1 #lazy.
ids_by_cat = defaultdict(list)
downloaded = 0.0
for cat_name, block_file_urls in block_urls.items():
for block_file_url in block_file_urls:
print("[{0}%] Downloading block file: {1} {2}".format((100.0*downloaded/count), block_file_url, cat_name))
try:
r = requests.get(block_file_url)
idlist = r.content.split("\"")[3]
ids = [n for n in idlist.split(";") if len(n) > 3]
ids_by_cat[cat_name] += ids
except IndexError, IOError:
print("Failed to download or process block at {0}".format(block_file_url))
downloaded += 1 #increment even if we've failed.
with open("{0}.txt".format(cat_name), "w") as idfile:
print("Writing ids to {0}.txt".format(cat_name))
for vid in ids_by_cat[cat_name]:
idfile.write("{0}\n".format(vid))
print("Done.")
@JosephRedfern
Copy link
Author

JosephRedfern commented Jan 12, 2021

@naveenv2 It would be fairly easy to script up a tool that requested the verticals list, pulled out the links to the different pages for each category, then made a request to the URL that provides the translation to non-anonymised Video ID (https://research.google.com/youtube8m/video_id_conversion.html).

However, unlike the previous method, doing it this way would require a request for every url, which would take a while and feels a bit abusive.

As for the issue of video ids in the tfrecords file -- is this the case for all videos, or just some of them? As noted in the video id conversion page, "When a video gets deleted, or made private by its uploader, the lookup URL becomes invalid", so I'd expect at least some lookups to return an error.

@naveenv2
Copy link

Hi @JosephRedfern,

I don't think it's a case of missing videos. I checked a couple of tfrecords. The IDs are 8-character long (as compared to the mentioned 4-char ID), something like this:

>>> jlist[483]['features']['feature']['id']['bytesList']['value']
['bklKdg==']
>>> jlist[123]['features']['feature']['id']['bytesList']['value']                                                                                                                             
['bGVKdg==']
>>> jlist[892]['features']['feature']['id']['bytesList']['value']                                                                                                                             
['eFVKdg==']
>>> jlist[928]['features']['feature']['id']['bytesList']['value']                                                                                                                             
['TW1Kdg==']

(jlist is a list of json outputs extracted for a tfrecord file using this)

I suspect that the URL format mentioned on the website (/AB/ABCD.js) isn't compatible with these IDs. I also tried various combinations (like dropping the recurring '==' and 'dg==' text from the ID), but none of them got a hit.

I hope I'm looking at the right values though. Please correct me if I missed out anything.

@JosephRedfern
Copy link
Author

JosephRedfern commented Jan 12, 2021

Hi @naveenv2,

Ahh, these are base64 encoded strings, and need decoding first. For example, using the base64 utility (https://linux.die.net/man/1/base64):

(base) ~ ❯❯❯ echo "bklKdg==" | base64 -d

nIJv%

This yields nIJv (the % represents the lack of newline at the end of the string), which is a valid video: https://data.yt8m.org/2/j/i/nI/nIJv.js

In Python you can use the base64 decode module's b64decode function (https://docs.python.org/3/library/base64.html), though there may be some method on TFRecord that can do this for you.

@naveenv2
Copy link

Ah yes. Thanks for pointing it out.

This is precisely what I was looking for.

Thanks a lot! :)

@JosephRedfern
Copy link
Author

Glad I could help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment