Skip to content

Instantly share code, notes, and snippets.

@JosephRedfern
Last active August 12, 2021 02:22
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
Scrapes the youtube video IDs for the youtube-8m data set. Probably buggy. Could be threaded.
import requests
from collections import defaultdict
csv_prefix = "https://research.google.com/youtube8m/csv"
r = requests.get("{0}/verticals.json".format(csv_prefix))
verticals = r.json()
block_urls = defaultdict(list)
count = 0
for cat, urls in verticals.items():
for url in urls:
jsurl = "{0}/j/{1}.js".format(csv_prefix, url.split("/")[-1])
block_urls[cat[1:]].append(jsurl)
count += 1 #lazy.
ids_by_cat = defaultdict(list)
downloaded = 0.0
for cat_name, block_file_urls in block_urls.items():
for block_file_url in block_file_urls:
print("[{0}%] Downloading block file: {1} {2}".format((100.0*downloaded/count), block_file_url, cat_name))
try:
r = requests.get(block_file_url)
idlist = r.content.split("\"")[3]
ids = [n for n in idlist.split(";") if len(n) > 3]
ids_by_cat[cat_name] += ids
except IndexError, IOError:
print("Failed to download or process block at {0}".format(block_file_url))
downloaded += 1 #increment even if we've failed.
with open("{0}.txt".format(cat_name), "w") as idfile:
print("Writing ids to {0}.txt".format(cat_name))
for vid in ids_by_cat[cat_name]:
idfile.write("{0}\n".format(vid))
print("Done.")
@jason51285128
Copy link

curl: (6) Could not resolve host: www.yt8m.org

is the host changed?

@JosephRedfern
Copy link
Author

@zhangchenwhu It looks like it, yes -- I've updated the host, can you try again?

@jason51285128
Copy link

@JosephRedfern yes, it works! but the prefix url in line 13 should be : #https://storage.googleapis.com/data.yt8m.org/2/j/v/#

@Queequeg92
Copy link

Hi, does this script get the anonymous id or a real youtube url?

@naveenv2
Copy link

naveenv2 commented Jan 12, 2021

Hi, seems like the script doesn't work currently.

It does get the verticals list (L6), but every subsequent call to get block file URL (L24) fails with the following message:

"Failed to download or process block at https://research.google.com/youtube8m/csv/j/0py27.js"

I've tried @jason51285128's solution as well, by replacing the prefix in line 13, but it shows the same message.

Moreover, the video id translation shown in youtube8m website doesn't seem to work for the id's given in the tfrecords files downloaded from the website. This has been reported in this kaggle thread.

Any fix yet?

@JosephRedfern
Copy link
Author

Hi @naveenv2,

Apologies, I've not used this in a long time. I think that they must have changed the format of/locations at which IDs are stored so I'm not sure if this method will work any more.

I can try and find some time to look at it again, but I'm afraid I don't have time at the moment.

Thanks,
Joe

@naveenv2
Copy link

naveenv2 commented Jan 12, 2021

Thanks for the quick reply. Seems like the links have been updated.

I found a possible solution here. However, this doesn't correlate with the IDs given in the video level features on the website.

Nevertheless, good work! Thanks! :)

@JosephRedfern
Copy link
Author

JosephRedfern commented Jan 12, 2021

@naveenv2 It would be fairly easy to script up a tool that requested the verticals list, pulled out the links to the different pages for each category, then made a request to the URL that provides the translation to non-anonymised Video ID (https://research.google.com/youtube8m/video_id_conversion.html).

However, unlike the previous method, doing it this way would require a request for every url, which would take a while and feels a bit abusive.

As for the issue of video ids in the tfrecords file -- is this the case for all videos, or just some of them? As noted in the video id conversion page, "When a video gets deleted, or made private by its uploader, the lookup URL becomes invalid", so I'd expect at least some lookups to return an error.

@naveenv2
Copy link

Hi @JosephRedfern,

I don't think it's a case of missing videos. I checked a couple of tfrecords. The IDs are 8-character long (as compared to the mentioned 4-char ID), something like this:

>>> jlist[483]['features']['feature']['id']['bytesList']['value']
['bklKdg==']
>>> jlist[123]['features']['feature']['id']['bytesList']['value']                                                                                                                             
['bGVKdg==']
>>> jlist[892]['features']['feature']['id']['bytesList']['value']                                                                                                                             
['eFVKdg==']
>>> jlist[928]['features']['feature']['id']['bytesList']['value']                                                                                                                             
['TW1Kdg==']

(jlist is a list of json outputs extracted for a tfrecord file using this)

I suspect that the URL format mentioned on the website (/AB/ABCD.js) isn't compatible with these IDs. I also tried various combinations (like dropping the recurring '==' and 'dg==' text from the ID), but none of them got a hit.

I hope I'm looking at the right values though. Please correct me if I missed out anything.

@JosephRedfern
Copy link
Author

JosephRedfern commented Jan 12, 2021

Hi @naveenv2,

Ahh, these are base64 encoded strings, and need decoding first. For example, using the base64 utility (https://linux.die.net/man/1/base64):

(base) ~ ❯❯❯ echo "bklKdg==" | base64 -d

nIJv%

This yields nIJv (the % represents the lack of newline at the end of the string), which is a valid video: https://data.yt8m.org/2/j/i/nI/nIJv.js

In Python you can use the base64 decode module's b64decode function (https://docs.python.org/3/library/base64.html), though there may be some method on TFRecord that can do this for you.

@naveenv2
Copy link

Ah yes. Thanks for pointing it out.

This is precisely what I was looking for.

Thanks a lot! :)

@JosephRedfern
Copy link
Author

Glad I could help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment