jeremycline/lazy_options.rst

## lazy_options.rst

      
    Raw
  

              lazy_options.rst
            
          
    The Plan

After a meeting where we hashed everything out, we decided on a plan that uses the "Deferred Download" method. Beyond being correct and effecient, we also wanted to be able to track content that has been downloaded. This is done by having Pulp download content at the ContentUnit level, rather than at the file level.
Overview

download_one goes away. Instead, the streamer inserts a document into a new collection which will be referred to as the deferred_downloads collection in this write-up. A new Celerybeat task is added that downloads everything in this deferred_downloads collection, deleting the entries as it goes.
The ContentUnit model is modified to include a new Boolean field, downloaded. This will require a migration which sets the ContentUnit.downloaded boolean to True for all existing units.
download_repo is a new Pulp task. This task uses the catalog to download every content unit that is a part of the repo that is not flagged as downloaded on the ContentUnit.
Deferred Downloads

This portion of the plan replaces what was originally the download_one task and introduces a new task, download_deferred, and a new collection, deferred_downloads, which has the following fields:

unit_id
unit_type_id
Potentially more?

There will be a unique index on (unit_id, unit_type_id).
When the streamer receives a file request, it fetches the file using the lazy catalog. Once it has successfully served the client, the streamer inserts a record into deferred_downloads using the information found in the lazy catalog. The unique index on deferred_downloads ensures only one entry exists in the collection at one time. In this way we avoid downloading a content unit many times except in a few rare cases. See the "Known Flaws" section for more information.
The download_deferred task is dispatched at regular intervals using Celerybeat. This will default to every thirty minutes (this value may be tweaked after some real-world testing). This task does not lock and does the following:

Check to see if there are any entries in the deferred_downloads collection. If there aren't any, the task is done.
Read and remove an entry from deferred_downloads. Get all entries in the lazy_content_catalog collection for the referenced unit. For each file, stat the filesystem to see if another task has already downloaded the file. If not, make a DownloadRequest for each each file and download everything in the unit.
When a ContentUnit is completely downloaded, call set_content(), which will set ContentUnit.downloaded to True, and save the unit.
GOTO 1.

The deferred_downloads collection is purged during orphan purge.
Download Repo

This portion of the plan is meant to cover the lazy=active case, as well as to cover the case where a repository is composed of several lazy=passive repositories and the user wants Pulp to download all the content (to perform an export or similar).
This task can be triggered in several ways:

An importer is configured with ACTIVE lazy mode and has performed a sync. This means users want this content downloaded into Pulp in the background.
A user wants to ensure that all content associated with repository is downloaded into Pulp storage. This would be required to export the repository.

The download_repo task does not make use of the deferred_downloads workflow. This task will download all units associated with the repository that are not already marked as downloaded. This task is similar to the deferred_downloads task, except its list is based on unit association to repositories.
Modification to ContentUnit

The ContentUnit model must be modified to change when the storage_path is set. Currently, this is None until the set_content method is called, but for lazy repos set_content isn't called until the content is downloaded. This is problematic because the value of storage_path is needed to create the catalog entries and to publish the repository. Therefore, the value should be populated when a ContentUnit is created and the download flag will indicate whether the content is downloaded or not.
Furthermore, the importers will need to be able to provide the ContentUnit with a relative path (which will either be a file name or a directory name in the case of a distribution) which is joined to the storage_path. This is necessary to construct the final path location for the lazy catalog.
Downloading Content

There was some discussion of having the importers provide a callable to the download tasks, but it was decided that for now we will stick with using Nectar to perform the download of units.
A Nectar downloader will be configured for each importer (in the future it may be necessary to configure a downloader for each content type an importer supports). A ContentUnit will be considered downloaded when every file path in the catalog that corresponds to that unit_id is downloaded successfully.
Although this introduces some complexity to the download tasks, it was judged to be better than introducing the overhead (and complexity in its own right) of having the importer produce a callable that the download tasks managed in a process or thread pool. For example, Nectar reuses TCP connections whereas the callable approach would not be able to.
Known Flaws

This plan, like all the others, has a few known effeciency problems. There are several cases, outlined below, where content is downloaded multiple times by Pulp from the Squid proxy. Although this does not access an external network, it is still considered undesirable since it consumes disk I/O unnecessarily.
Multiple Downloads

A content unit could be downloaded multiple times if a client requests a file in that unit and then a download_repo task for a repository that contains that unit and the celerybeat deferred_downloads task run at the same time, and they happen to process the that content unit at the same time.
A content unit could be downloaded multiple times if the deferred_downloads task is set to run often enough that a new task is dispatched before the old one is finished. If those tasks select the same units at the same time, they could download the same content twice. This is a fairly narrow window as each task should be reading and then removing the document from MongoDB, but it is by no means impossible.
A content unit could be downloaded multiple times if a client is actively requesting content from a multi-file ContentUnit. This occurs if the deferred_downloads task removes an entry to process, and then the client asks for a new file (that isn't cached in Squid). The Streamer will be able to add another entry for that ContentUnit there is no longer an entry for that (unit_id, unit_type_id).
Mitigation: Have both download_repo and deferred_downloads regularly check the ContentUnit.downloaded flag on the units it is processing. This way it can detect if another task has already downloaded the unit and quit.
Lost Downloads

Since the deferred_downloads task removes entries from the collection, it is possible for a lazy=passive download to be lost by Pulp if the worker is killed before it finishes the download, but after it has removed the database record(s).
Mitigations: Have the deferred_downloads task remove relatively few entries at a time. This is a matter of balancing the performance of parallelizing downloads versus losing entries and having to wait for the Squid cache to expire and cause the Streamer to add the entry back to the deferred_downloads collection. A user can also dispatch a download_repo task if they want these lost units to be downloaded by Pulp.