Skip to content

Instantly share code, notes, and snippets.

@jeremycline
Last active October 28, 2015 20:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeremycline/df0fea853b304dc138ee to your computer and use it in GitHub Desktop.
Save jeremycline/df0fea853b304dc138ee to your computer and use it in GitHub Desktop.
A collection of options for the Lazy download tasks.

The Plan

After a meeting where we hashed everything out, we decided on a plan that uses the "Deferred Download" method. Beyond being correct and effecient, we also wanted to be able to track content that has been downloaded. This is done by having Pulp download content at the ContentUnit level, rather than at the file level.

Overview

download_one goes away. Instead, the streamer inserts a document into a new collection which will be referred to as the deferred_downloads collection in this write-up. A new Celerybeat task is added that downloads everything in this deferred_downloads collection, deleting the entries as it goes.

The ContentUnit model is modified to include a new Boolean field, downloaded. This will require a migration which sets the ContentUnit.downloaded boolean to True for all existing units.

download_repo is a new Pulp task. This task uses the catalog to download every content unit that is a part of the repo that is not flagged as downloaded on the ContentUnit.

Deferred Downloads

This portion of the plan replaces what was originally the download_one task and introduces a new task, download_deferred, and a new collection, deferred_downloads, which has the following fields:

  • unit_id
  • unit_type_id
  • Potentially more?

There will be a unique index on (unit_id, unit_type_id).

When the streamer receives a file request, it fetches the file using the lazy catalog. Once it has successfully served the client, the streamer inserts a record into deferred_downloads using the information found in the lazy catalog. The unique index on deferred_downloads ensures only one entry exists in the collection at one time. In this way we avoid downloading a content unit many times except in a few rare cases. See the "Known Flaws" section for more information.

The download_deferred task is dispatched at regular intervals using Celerybeat. This will default to every thirty minutes (this value may be tweaked after some real-world testing). This task does not lock and does the following:

  1. Check to see if there are any entries in the deferred_downloads collection. If there aren't any, the task is done.
  2. Read and remove an entry from deferred_downloads. Get all entries in the lazy_content_catalog collection for the referenced unit. For each file, stat the filesystem to see if another task has already downloaded the file. If not, make a DownloadRequest for each each file and download everything in the unit.
  3. When a ContentUnit is completely downloaded, call set_content(), which will set ContentUnit.downloaded to True, and save the unit.
  4. GOTO 1.

The deferred_downloads collection is purged during orphan purge.

Download Repo

This portion of the plan is meant to cover the lazy=active case, as well as to cover the case where a repository is composed of several lazy=passive repositories and the user wants Pulp to download all the content (to perform an export or similar).

This task can be triggered in several ways:

  • An importer is configured with ACTIVE lazy mode and has performed a sync. This means users want this content downloaded into Pulp in the background.
  • A user wants to ensure that all content associated with repository is downloaded into Pulp storage. This would be required to export the repository.

The download_repo task does not make use of the deferred_downloads workflow. This task will download all units associated with the repository that are not already marked as downloaded. This task is similar to the deferred_downloads task, except its list is based on unit association to repositories.

Modification to ContentUnit

The ContentUnit model must be modified to change when the storage_path is set. Currently, this is None until the set_content method is called, but for lazy repos set_content isn't called until the content is downloaded. This is problematic because the value of storage_path is needed to create the catalog entries and to publish the repository. Therefore, the value should be populated when a ContentUnit is created and the download flag will indicate whether the content is downloaded or not.

Furthermore, the importers will need to be able to provide the ContentUnit with a relative path (which will either be a file name or a directory name in the case of a distribution) which is joined to the storage_path. This is necessary to construct the final path location for the lazy catalog.

Downloading Content

There was some discussion of having the importers provide a callable to the download tasks, but it was decided that for now we will stick with using Nectar to perform the download of units.

A Nectar downloader will be configured for each importer (in the future it may be necessary to configure a downloader for each content type an importer supports). A ContentUnit will be considered downloaded when every file path in the catalog that corresponds to that unit_id is downloaded successfully.

Although this introduces some complexity to the download tasks, it was judged to be better than introducing the overhead (and complexity in its own right) of having the importer produce a callable that the download tasks managed in a process or thread pool. For example, Nectar reuses TCP connections whereas the callable approach would not be able to.

Known Flaws

This plan, like all the others, has a few known effeciency problems. There are several cases, outlined below, where content is downloaded multiple times by Pulp from the Squid proxy. Although this does not access an external network, it is still considered undesirable since it consumes disk I/O unnecessarily.

Multiple Downloads

A content unit could be downloaded multiple times if a client requests a file in that unit and then a download_repo task for a repository that contains that unit and the celerybeat deferred_downloads task run at the same time, and they happen to process the that content unit at the same time.

A content unit could be downloaded multiple times if the deferred_downloads task is set to run often enough that a new task is dispatched before the old one is finished. If those tasks select the same units at the same time, they could download the same content twice. This is a fairly narrow window as each task should be reading and then removing the document from MongoDB, but it is by no means impossible.

A content unit could be downloaded multiple times if a client is actively requesting content from a multi-file ContentUnit. This occurs if the deferred_downloads task removes an entry to process, and then the client asks for a new file (that isn't cached in Squid). The Streamer will be able to add another entry for that ContentUnit there is no longer an entry for that (unit_id, unit_type_id).

Mitigation: Have both download_repo and deferred_downloads regularly check the ContentUnit.downloaded flag on the units it is processing. This way it can detect if another task has already downloaded the unit and quit.

Lost Downloads

Since the deferred_downloads task removes entries from the collection, it is possible for a lazy=passive download to be lost by Pulp if the worker is killed before it finishes the download, but after it has removed the database record(s).

Mitigations: Have the deferred_downloads task remove relatively few entries at a time. This is a matter of balancing the performance of parallelizing downloads versus losing entries and having to wait for the Squid cache to expire and cause the Streamer to add the entry back to the deferred_downloads collection. A user can also dispatch a download_repo task if they want these lost units to be downloaded by Pulp.

@jortel
Copy link

jortel commented Oct 23, 2015

The collection will need both: unit_id and unit_type_id.

@jeremycline
Copy link
Author

Re: "Read and remove an entry from deferred_downloads.": That was one bit that I wasn't very confident about. My understanding was we would have a single task that is responsible for downloading everything, but that it would remove the entries one at a time as it handled them so that if that task died an untimely death, we would only lose a few deferred_downloads rather than the entire collection.

That being said, either way would work fine. I was also wondering where we landed on the importer returning a callable vs dealing with nectar ourselves. If it returns a callable the task would probably have a multiprocessing Pool, right?

@jortel
Copy link

jortel commented Oct 23, 2015

Not using nectar for concurrency and going with a generic downloader (one per file):

Pros:

  • Makes no assumptions about how a file is downloaded. (http bit-torrent, etc)
  • Less complex support for downloading units imported by different importers concurrent. That is, we don't need to aggregate downloaders etc.
  • More easily manage concurrency across all downloads.
  • May provided cleaner opportunity to stat the filesystem before actually downloading the file. Although, this could probably be done in the download_started() nectar callback.

Cons:

  • May not take advantage of nectar connection reuse?
  • The writing certs to files thing nectar does.

If going with generic downloader approach (which I favor), I'd start with the multiprocessing.pool.ThreadPool.

@jortel
Copy link

jortel commented Oct 23, 2015

The FileContentUnit.set_content() should be updated to set downloaded=True and the download task could just call this. This would ensure that downloaded is always updated when content is stored.

@jeremycline
Copy link
Author

Not that I'm necessarily advocating this, just a thought/question: storage_path is only set on a unit when set_content is called, correct? We could just check to see if that's not None. I do like the explicit flag, though.

@jortel
Copy link

jortel commented Oct 23, 2015

We need a migration, right?
Just adds downloaded=<storage_path != None> on all existing units. I don't think it needs to check if the storage_path exists. Thoughts?

Also, need to update the ContentManager.add_content_unit(). If the unit has the storage_path != None and the path exists, set downloaded=True. I think this would handle new units imported/uploaded by non-mongoengine code.

@jeremycline
Copy link
Author

That sounds reasonable to me. I shall add it to the plan.

@jortel
Copy link

jortel commented Oct 23, 2015

What should be the default for metadata only (no files) content units? True? False?
False seems more correct but means we'd need queries like:

  • get_units_not_downloaded(repo_id)
  • has_all_units_downloaded(repo_id)

To also test if storage_path != None.

Thoughts?

@jortel
Copy link

jortel commented Oct 23, 2015

Oh, interesting. Yeah, I forgot that in new-style units, the storage_path is set when the unit's content is stored. We may not need the downloaded added to the ContentUnit after all. Or, managed.

Although, I do like the explicit (vs implicit) nature of an explicit downloaded even though it's more work.

@jortel
Copy link

jortel commented Oct 23, 2015

After discussion, I think we can consider metadata-only units to be downloaded=True since the metadata is part of what needs to be downloaded. With this in mind, the migration can just blindly set the downloaded=True on existing units.

@jeremycline
Copy link
Author

This write-up needs to go into more detail about the situations when a unit can be downloaded multiple times, and more generally what happens in different failure scenarios.

@jortel
Copy link

jortel commented Oct 26, 2015

Further mitigation for duplicate downloads scenarios would be to check the filesystem to see if a file exists immediately before downloading it. This will be very important for units with multiple large files.

@jortel
Copy link

jortel commented Oct 26, 2015

Further mitigate for lost downloads is the user just manually pushing the [download] button on a repo.

@jeremycline
Copy link
Author

I don't think checking the filesystem will help us further mitigate duplicate downloads, since we will only call set_content once (when we have every file in the ContentUnit). Until that's called, the files will be in a working directory and not in their final location.

@jortel
Copy link

jortel commented Oct 26, 2015

Oh, right. That's very unfortunate. Especially since the units with multiple large files have the greatest risk of duplicate download.

@jortel
Copy link

jortel commented Oct 26, 2015

Might be good to mention that the deferred_downloads collection will be purged during orphan purge.

@bmbouter
Copy link

When a save() call which writes an entry into DeferredDownloads is rejected due to the uniqueness constraint, we need to catch that exception explicitly. We should have a test case for this exception catching behavior.

@bmbouter
Copy link

Regarding the migration, yes, I think there needs to be one. Remember that ContentUnit isn't a single collection, but a base class that is inherited and used by many collections. This migration will need to add the downloaded boolean is added to each collection that inherits from ContentUnit.

One other problem is that since ContentUnit is only used by mongoengine converted plugins types, we can't do it for "all content unit types", just the mongoengine converted ones. Since only RPM will get the lazy feature to start with, maybe we make the migration only work on the RPM based mongoengine collections. This would leave other collections needing migrations in the future when they become lazy aware.

If we try to include "all content units" installed on the system, if any of them are not converted to mongoengine, then when those records are modified the downloaded field will get removed by the old-style manager upon a re-save. This further motivates a design where the migration only adds downloaded to RPM units.

NOTE: the ContentUnit field downloaded should default to False. I think that needs to be added to the plan. Only the migration specifically will set them to true for RPM content units at upgrade time to 2.8.0.

@bmbouter
Copy link

Can some clarification be added regarding download_repo and whether it will use locking or not? I assume it will use locking and the same lock name that will prevent a concurrent publish of that repo. Is that right?

Also, is the name be for the the periodic task that Celerybeat will dispatch be deferred_downloads? It could be referred to by name earlier in the writeup for clarity. I assume deferred_downloads will not use locking. If that is true can that detail be added to the plan also?

@bmbouter
Copy link

Regarding the DeferredDownload algorithm steps, there isn't a step that determines if the unit has already been downloaded. It's not stated that the filesystem will be checked, and it's not stated that the downloaded field for the content unit will be checked. One of these two mechanisms should be a step to prevent re-downloading of already-downloaded content right?

I'll suggest that the downloaded field be checked, for the same reason as suggested in this comment. In either case the algorithm probably needs a step added to it.

@bmbouter
Copy link

One great behavior that comes out of this design is that set_content() moves content into place as part of a pre_save signal, and the downloaded field is set as part of the save. This ensures that by the time the DB is told downloaded=True the content is already there. This is great.

@bmbouter
Copy link

Is more input needed to reach a decision on using Nectar vs a Callable from the Importer? If a decision is reached, can the one we're not doing be removed from the writeup? I don't have a clear recommendation on which to do, but I slightly lean towards Nectar since it manages downloads already versus writing new code to drive multiprocessing.pool.ThreadPool.. I'm fine with either method, as long as it works.

@bmbouter
Copy link

After thinking about this more, we could avoid the migration altogether and introduce an @property on ContentUnit named downloaded instead of a field. This property would be like this:

    @property
    def downloaded(self):
        """
        Indicates if a unit has been downloaded or not by checking if storage_path has been set.

        :returns: True if the unit has been downloaded.
        :rtype: bool
        """
        return self.storage_path is None

If we do this, it may obsolete this comment altogether.

@bmbouter
Copy link

Responding to this comment, yes storage_path is only set when set_content() is called. Otherwise it's None.

@bmbouter
Copy link

Note the following in-progress field name renaming being done (but not yet merged) to ContentUnit.

source_location is being renamed to _source_location
unit_type_id is being renamed to _content_type_id

@mhrivnak
Copy link

I think we decided that the download_all task will not lock. We want the ability to publish repos without waiting on downloads.

@jortel
Copy link

jortel commented Oct 28, 2015

The behavior of setting the storage_path in the ContentUnit only when content is saved along with the unit will need to be changed to support Lazy. Currently, the FileContentUnit.set_content() causes two things to happen:

  1. The value of storage_path is determined.
  2. The content is copied from a temporary location into the platform storage (at storage_path)

The problem with this behavior is that the storage_path needs to be set before content is downloaded and copied into platform storage to support:

  1. publishing - need it to create symbolic links to something
  2. build catalog entries.

One solution would be to set the storage_path when the unit is created using relative path information provided by the importer. The relative path would, at least, include the file name. For RPMs this may also include the checksum. For example:: checksum/zoo.rpm. The set_content() method would then only cause the content to be copied into the already know storage_path.

Having the storage_path set on unit creation means that ContentUnit.downloaded needs to be a real field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment