Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Analysis of the RIAA claims against youtube-dl

Technical analysis of the RIAA claim against youtube-dl

This write-up follows the code paths in youtube-dl that get executed when you try to run it based on the claims of RIAA has put forward. This is a technical analysis, not a legal one.

Note: This analysis is based on one of the many unofficial copies of youtube-dl that have popped up during the last few days. I do not have a copy downloaded from the original repository and can in no way guarantee that these findings hold true for the original. However, given the complexity and the shortness of time it seems unlikely that someone went through the trouble of injecting the things in this analysis.

Credits: This analysis would not have been possible without the help of Jan Wildeboer.

As a reminder, RIAA uses the following argument to support their claim that youtube-dl is a tool primarily made for copyright infringement:

We also note that the source code prominently includes as sample uses of the source code the downloading of copies of our members’ copyrighted sound recordings and music videos, as noted in Exhibit A hereto. For example, as shown on Exhibit A, the source code expressly suggests its use to copy and/or distribute the following copyrighted works owned by our member companies:

  • Icona Pop – I Love It (feat. Charli XCX) [Official Video], owned by Warner Music Group
  • Justin Timberlake – Tunnel Vision (Explicit), owned by Sony Music Group
  • Taylor Swift – Shake it Off, owned/exclusively licensed by Universal Music Group

The RIAA claims seem to be based on the fact that the file youtube_dl/extractor/youtube.py contains the following lines:

class YoutubeIE(YoutubeBaseInfoExtractor):
    #...
    _TESTS = [
        #...
        {
            'url': 'https://www.youtube.com/watch?v=UxxajLWwzqY',
            'note': 'Test generic use_cipher_signature video (#897)',
            'info_dict': {
                'id': 'UxxajLWwzqY',
                'ext': 'mp4',
                'upload_date': '20120506',
                'title': 'Icona Pop - I Love It (feat. Charli XCX) [OFFICIAL VIDEO]',
                'alt_title': 'I Love It (feat. Charli XCX)',
                'description': 'md5:19a2f98d9032b9311e686ed039564f63',
                'tags': ['Icona Pop i love it', 'sweden', 'pop music', 'big beat records', 'big beat', 'charli',
                         'xcx', 'charli xcx', 'girls', 'hbo', 'i love it', "i don't care", 'icona', 'pop',
                         'iconic ep', 'iconic', 'love', 'it'],
                'duration': 180,
                'uploader': 'Icona Pop',
                'uploader_id': 'IconaPop',
                'uploader_url': r're:https?://(?:www\.)?youtube\.com/user/IconaPop',
                'creator': 'Icona Pop',
                'track': 'I Love It (feat. Charli XCX)',
                'artist': 'Icona Pop',
            }
        },

As you can already see, the data is assigned to a variable called _TESTS. If we search the repository for the name _TESTS we come across the README, which states the following:

  1. Start with this simple template and save it to youtube_dl/extractor/yourextractor.py:

[...]

class YourExtractorIE(InfoExtractor):
    _TEST = {
        # ...

[...]

  1. Run python test/test_download.py TestDownload.test_YourExtractor. This should fail at first, but you can continually re-run it until you're done. If you decide to add more than one test, then rename _TEST to _TESTS and make it into a list of dictionaries. The tests will then be named TestDownload.test_YourExtractor, TestDownload.test_YourExtractor_1, TestDownload.test_YourExtractor_2, etc. Note that tests with only_matching key in test's dict are not counted in.

The section in the README is intended for developers and not end users. The README seems to suggest that these files are downloaded during the test run, but this is not the case as we will see below.

First attempt

As a first attempt, let's look at the video IDs in question:

  • Icona Pop: UxxajLWwzqY
  • Justin Timberlake: 07FYdnEawAQ
  • Taylor Swift: nfWlot6h_JM

If we search the source code for the first ID (Icona Pop), we don't find anything apart from the original listing. If we search for the second ID (Justin Timberlake) we find the following piece of code:

class TestAgeRestriction(unittest.TestCase):
    def _assert_restricted(self, url, filename, age, old_age=None):
        self.assertTrue(_download_restricted(url, filename, old_age))
        self.assertFalse(_download_restricted(url, filename, age))

    def test_youtube(self):
        self._assert_restricted('07FYdnEawAQ', '07FYdnEawAQ.mp4', 10)

This is a test case for age-restricted content. Again, the naming seems to suggest it is downloading something. If we go through the code with a debugger we will land at youtube_dl/YouTubeDL.py line 2018 in the class YoutubeDL in the function process_info(). Looking at this method, it will call the following code for every URL passed to it:

res = self.extract_info(
    url, force_generic_extractor=self.params.get('force_generic_extractor', False))

This sounds like it is only fetching information. Further stepping into the code we end up in YoutubeIE > _real_extract() with the following stack trace:

  1. _real_extract, youtube.py:2070
  2. extract, common.py:532
  3. extract_info, YoutubeDL.py:797
  4. download, YoutubeDL.py:2018
  5. _download_restricted, test_age_restriction.py:29
  6. _assert_restricted, test_age_restriction.py:37
  7. test_youtube, test_age_restriction.py:41

Following this method to the end we get a very large data structure that details the metadata of the video, as well as the various download URLs for the video. (These are the URLs that contain the actual video data.)

A few steps further we end up in the method process_ie_result in YoutubeDL.py. The documentation says:

Take the result of the ie(may be modified) and resolve all unresolved
references (URLs, playlist items).

It will also download the videos if 'download'.
Returns the resolved ie_result.

The download flag is set to True for this test case, but YoutubeDL.py on line 1757 in process_info() has the following code:

if self.params.get('simulate', False):
  return

This piece of code aborts the download when simulate mode is engaged. The simulate mode is set to on in some cases, in other cases the skip_download mode is enabled and caught by this piece of code further down.

if not self.params.get('skip_download', False):

What's with the _TESTS?

Back to the original question: what's up with the _TESTS variable. Where is it used? The easiest way to find that out is to implement a wrapper class:

class ArrayLike:
    def __init__(self, data):
        self._DATA = data

    def __repr__(self):
        return f"{self.__class__.__name__}"

    def __getitem__(self, key):
        return self._DATA[key]

Then we wrap the test cases:

_TESTS = ArrayLike([...])

We can then set a break point in __getitem__, which will stop in common.py line 2901, function get_testcases. This method is used when running python -m unittest discover to run all unit tests. The test generator located in test_download.py line 250 will then create test methods dynamically with this dataset. The template for the test method is located in line 92 (generator function).

Looking at these code paths they end up in the same download function mentioned above. The download function located in common.py in the class FileDownloader in the method download(). This method invokes the downloader:

return self.real_download(filename, info_dict)

The invoked real_download function is implemented in a variety of classes depending on the video stream type:

  • RtspFD
  • RtmpFD
  • IsmFD
  • HttpFD
  • HlsFD
  • F4mFD
  • ExternalFD
  • DashSegmentsFD

We can also let the code step into the specific cases. To make the debugging less tedious we can add a filter to test_download.py:

for n, test_case in enumerate(defs):
    if "youtube.com" in test_case['url']:
        ...

The YouTube code steps into the HttpFD to download files from googlevideo.com, the domain hosting the actual video files. The HTTP downloader is observing the test flag and only downloading the first 10 kB of the file. This is done by setting the chunk size in http.py in the HttpFD class in the real_download() method:

is_test = self.params.get('test', False)
chunk_size = self._TEST_FILE_SIZE if is_test else (
    info_dict.get('downloader_options', {}).get('http_chunk_size')
    or self.params.get('http_chunk_size') or 0)

The download() method then sets the cap on the download:

# Range HTTP header may be ignored/unsupported by a webserver
# (e.g. extractor/scivee.py, extractor/bambuser.py).
# However, for a test we still would like to download just a piece of a file.
# To achieve this we limit data_len to _TEST_FILE_SIZE and manually control
# block size when downloading a file.
if is_test and (data_len is None or int(data_len) > self._TEST_FILE_SIZE):
    data_len = self._TEST_FILE_SIZE

The function then breaks out of the download loop:

if data_len is not None and byte_counter == data_len:
    break

To summarize, youtube-dl only downloads ~10kB of the video file to check if the download functionality is actually working.

Summary

  • This is a technical analysis, not a legal one.
  • The referenced songs in the RIAA takedown are only mentioned in the code for automatically testing the youtube-dl functionality in a development/test setup, not for general use.
  • The reason for adding metadata for these copyrighted videos is interoperability: only very specific videos (e.g. VEVO) have the specific mechanisms that youtube-dl aims to decode, therefore no other videos are suitable for testing against.
  • youtube-dl only downloads the first ~10 kB of the videos from YouTube during the tests, not the full video.

Disclaimer: This is the observation of one individual and has not yet been verified independently. I have also not been able to run all tests as a lot of tests seem to be broken or contain hard-coded paths specific to the environment of the original authors. Please handle accordingly.

Further references

@iuriguilherme

This comment has been minimized.

Copy link

@iuriguilherme iuriguilherme commented Nov 16, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment