Skip to content

Instantly share code, notes, and snippets.

@jeremycline
Last active October 28, 2015 20:24
Show Gist options
  • Save jeremycline/df0fea853b304dc138ee to your computer and use it in GitHub Desktop.
Save jeremycline/df0fea853b304dc138ee to your computer and use it in GitHub Desktop.
A collection of options for the Lazy download tasks.

The Plan

After a meeting where we hashed everything out, we decided on a plan that uses the "Deferred Download" method. Beyond being correct and effecient, we also wanted to be able to track content that has been downloaded. This is done by having Pulp download content at the ContentUnit level, rather than at the file level.

Overview

download_one goes away. Instead, the streamer inserts a document into a new collection which will be referred to as the deferred_downloads collection in this write-up. A new Celerybeat task is added that downloads everything in this deferred_downloads collection, deleting the entries as it goes.

The ContentUnit model is modified to include a new Boolean field, downloaded. This will require a migration which sets the ContentUnit.downloaded boolean to True for all existing units.

download_repo is a new Pulp task. This task uses the catalog to download every content unit that is a part of the repo that is not flagged as downloaded on the ContentUnit.

Deferred Downloads

This portion of the plan replaces what was originally the download_one task and introduces a new task, download_deferred, and a new collection, deferred_downloads, which has the following fields:

  • unit_id
  • unit_type_id
  • Potentially more?

There will be a unique index on (unit_id, unit_type_id).

When the streamer receives a file request, it fetches the file using the lazy catalog. Once it has successfully served the client, the streamer inserts a record into deferred_downloads using the information found in the lazy catalog. The unique index on deferred_downloads ensures only one entry exists in the collection at one time. In this way we avoid downloading a content unit many times except in a few rare cases. See the "Known Flaws" section for more information.

The download_deferred task is dispatched at regular intervals using Celerybeat. This will default to every thirty minutes (this value may be tweaked after some real-world testing). This task does not lock and does the following:

  1. Check to see if there are any entries in the deferred_downloads collection. If there aren't any, the task is done.
  2. Read and remove an entry from deferred_downloads. Get all entries in the lazy_content_catalog collection for the referenced unit. For each file, stat the filesystem to see if another task has already downloaded the file. If not, make a DownloadRequest for each each file and download everything in the unit.
  3. When a ContentUnit is completely downloaded, call set_content(), which will set ContentUnit.downloaded to True, and save the unit.
  4. GOTO 1.

The deferred_downloads collection is purged during orphan purge.

Download Repo

This portion of the plan is meant to cover the lazy=active case, as well as to cover the case where a repository is composed of several lazy=passive repositories and the user wants Pulp to download all the content (to perform an export or similar).

This task can be triggered in several ways:

  • An importer is configured with ACTIVE lazy mode and has performed a sync. This means users want this content downloaded into Pulp in the background.
  • A user wants to ensure that all content associated with repository is downloaded into Pulp storage. This would be required to export the repository.

The download_repo task does not make use of the deferred_downloads workflow. This task will download all units associated with the repository that are not already marked as downloaded. This task is similar to the deferred_downloads task, except its list is based on unit association to repositories.

Modification to ContentUnit

The ContentUnit model must be modified to change when the storage_path is set. Currently, this is None until the set_content method is called, but for lazy repos set_content isn't called until the content is downloaded. This is problematic because the value of storage_path is needed to create the catalog entries and to publish the repository. Therefore, the value should be populated when a ContentUnit is created and the download flag will indicate whether the content is downloaded or not.

Furthermore, the importers will need to be able to provide the ContentUnit with a relative path (which will either be a file name or a directory name in the case of a distribution) which is joined to the storage_path. This is necessary to construct the final path location for the lazy catalog.

Downloading Content

There was some discussion of having the importers provide a callable to the download tasks, but it was decided that for now we will stick with using Nectar to perform the download of units.

A Nectar downloader will be configured for each importer (in the future it may be necessary to configure a downloader for each content type an importer supports). A ContentUnit will be considered downloaded when every file path in the catalog that corresponds to that unit_id is downloaded successfully.

Although this introduces some complexity to the download tasks, it was judged to be better than introducing the overhead (and complexity in its own right) of having the importer produce a callable that the download tasks managed in a process or thread pool. For example, Nectar reuses TCP connections whereas the callable approach would not be able to.

Known Flaws

This plan, like all the others, has a few known effeciency problems. There are several cases, outlined below, where content is downloaded multiple times by Pulp from the Squid proxy. Although this does not access an external network, it is still considered undesirable since it consumes disk I/O unnecessarily.

Multiple Downloads

A content unit could be downloaded multiple times if a client requests a file in that unit and then a download_repo task for a repository that contains that unit and the celerybeat deferred_downloads task run at the same time, and they happen to process the that content unit at the same time.

A content unit could be downloaded multiple times if the deferred_downloads task is set to run often enough that a new task is dispatched before the old one is finished. If those tasks select the same units at the same time, they could download the same content twice. This is a fairly narrow window as each task should be reading and then removing the document from MongoDB, but it is by no means impossible.

A content unit could be downloaded multiple times if a client is actively requesting content from a multi-file ContentUnit. This occurs if the deferred_downloads task removes an entry to process, and then the client asks for a new file (that isn't cached in Squid). The Streamer will be able to add another entry for that ContentUnit there is no longer an entry for that (unit_id, unit_type_id).

Mitigation: Have both download_repo and deferred_downloads regularly check the ContentUnit.downloaded flag on the units it is processing. This way it can detect if another task has already downloaded the unit and quit.

Lost Downloads

Since the deferred_downloads task removes entries from the collection, it is possible for a lazy=passive download to be lost by Pulp if the worker is killed before it finishes the download, but after it has removed the database record(s).

Mitigations: Have the deferred_downloads task remove relatively few entries at a time. This is a matter of balancing the performance of parallelizing downloads versus losing entries and having to wait for the Squid cache to expire and cause the Streamer to add the entry back to the deferred_downloads collection. A user can also dispatch a download_repo task if they want these lost units to be downloaded by Pulp.

@bmbouter
Copy link

Is more input needed to reach a decision on using Nectar vs a Callable from the Importer? If a decision is reached, can the one we're not doing be removed from the writeup? I don't have a clear recommendation on which to do, but I slightly lean towards Nectar since it manages downloads already versus writing new code to drive multiprocessing.pool.ThreadPool.. I'm fine with either method, as long as it works.

@bmbouter
Copy link

After thinking about this more, we could avoid the migration altogether and introduce an @property on ContentUnit named downloaded instead of a field. This property would be like this:

    @property
    def downloaded(self):
        """
        Indicates if a unit has been downloaded or not by checking if storage_path has been set.

        :returns: True if the unit has been downloaded.
        :rtype: bool
        """
        return self.storage_path is None

If we do this, it may obsolete this comment altogether.

@bmbouter
Copy link

Responding to this comment, yes storage_path is only set when set_content() is called. Otherwise it's None.

@bmbouter
Copy link

Note the following in-progress field name renaming being done (but not yet merged) to ContentUnit.

source_location is being renamed to _source_location
unit_type_id is being renamed to _content_type_id

@mhrivnak
Copy link

I think we decided that the download_all task will not lock. We want the ability to publish repos without waiting on downloads.

@jortel
Copy link

jortel commented Oct 28, 2015

The behavior of setting the storage_path in the ContentUnit only when content is saved along with the unit will need to be changed to support Lazy. Currently, the FileContentUnit.set_content() causes two things to happen:

  1. The value of storage_path is determined.
  2. The content is copied from a temporary location into the platform storage (at storage_path)

The problem with this behavior is that the storage_path needs to be set before content is downloaded and copied into platform storage to support:

  1. publishing - need it to create symbolic links to something
  2. build catalog entries.

One solution would be to set the storage_path when the unit is created using relative path information provided by the importer. The relative path would, at least, include the file name. For RPMs this may also include the checksum. For example:: checksum/zoo.rpm. The set_content() method would then only cause the content to be copied into the already know storage_path.

Having the storage_path set on unit creation means that ContentUnit.downloaded needs to be a real field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment