-
-
Save JosephRedfern/d60bdc584d84b1451cc6052e955b755c to your computer and use it in GitHub Desktop.
import requests | |
from collections import defaultdict | |
csv_prefix = "https://research.google.com/youtube8m/csv" | |
r = requests.get("{0}/verticals.json".format(csv_prefix)) | |
verticals = r.json() | |
block_urls = defaultdict(list) | |
count = 0 | |
for cat, urls in verticals.items(): | |
for url in urls: | |
jsurl = "{0}/j/{1}.js".format(csv_prefix, url.split("/")[-1]) | |
block_urls[cat[1:]].append(jsurl) | |
count += 1 #lazy. | |
ids_by_cat = defaultdict(list) | |
downloaded = 0.0 | |
for cat_name, block_file_urls in block_urls.items(): | |
for block_file_url in block_file_urls: | |
print("[{0}%] Downloading block file: {1} {2}".format((100.0*downloaded/count), block_file_url, cat_name)) | |
try: | |
r = requests.get(block_file_url) | |
idlist = r.content.split("\"")[3] | |
ids = [n for n in idlist.split(";") if len(n) > 3] | |
ids_by_cat[cat_name] += ids | |
except IndexError, IOError: | |
print("Failed to download or process block at {0}".format(block_file_url)) | |
downloaded += 1 #increment even if we've failed. | |
with open("{0}.txt".format(cat_name), "w") as idfile: | |
print("Writing ids to {0}.txt".format(cat_name)) | |
for vid in ids_by_cat[cat_name]: | |
idfile.write("{0}\n".format(vid)) | |
print("Done.") |
Hi @JosephRedfern,
I don't think it's a case of missing videos. I checked a couple of tfrecords. The IDs are 8-character long (as compared to the mentioned 4-char ID), something like this:
>>> jlist[483]['features']['feature']['id']['bytesList']['value']
['bklKdg==']
>>> jlist[123]['features']['feature']['id']['bytesList']['value']
['bGVKdg==']
>>> jlist[892]['features']['feature']['id']['bytesList']['value']
['eFVKdg==']
>>> jlist[928]['features']['feature']['id']['bytesList']['value']
['TW1Kdg==']
(jlist
is a list of json outputs extracted for a tfrecord file using this)
I suspect that the URL format mentioned on the website (/AB/ABCD.js
) isn't compatible with these IDs. I also tried various combinations (like dropping the recurring '=='
and 'dg=='
text from the ID), but none of them got a hit.
I hope I'm looking at the right values though. Please correct me if I missed out anything.
Hi @naveenv2,
Ahh, these are base64 encoded strings, and need decoding first. For example, using the base64
utility (https://linux.die.net/man/1/base64):
(base) ~ ❯❯❯ echo "bklKdg==" | base64 -d
nIJv%
This yields nIJv
(the %
represents the lack of newline at the end of the string), which is a valid video: https://data.yt8m.org/2/j/i/nI/nIJv.js
In Python you can use the base64 decode module's b64decode
function (https://docs.python.org/3/library/base64.html), though there may be some method on TFRecord that can do this for you.
Ah yes. Thanks for pointing it out.
This is precisely what I was looking for.
Thanks a lot! :)
Glad I could help!
@naveenv2 It would be fairly easy to script up a tool that requested the verticals list, pulled out the links to the different pages for each category, then made a request to the URL that provides the translation to non-anonymised Video ID (https://research.google.com/youtube8m/video_id_conversion.html).
However, unlike the previous method, doing it this way would require a request for every url, which would take a while and feels a bit abusive.
As for the issue of video ids in the tfrecords file -- is this the case for all videos, or just some of them? As noted in the video id conversion page, "When a video gets deleted, or made private by its uploader, the lookup URL becomes invalid", so I'd expect at least some lookups to return an error.