After a meeting where we hashed everything out, we decided on a plan that uses the "Deferred Download" method. Beyond being correct and effecient, we also wanted to be able to track content that has been downloaded. This is done by having Pulp download content at the ContentUnit
level, rather than at the file level.
download_one
goes away. Instead, the streamer inserts a document into a new collection which will be referred to as the deferred_downloads
collection in this write-up. A new Celerybeat task is added that downloads everything in this deferred_downloads
collection, deleting the entries as it goes.
The ContentUnit
model is modified to include a new Boolean field, downloaded
. This will require a migration which sets the ContentUnit.downloaded
boolean to True
for all existing units.
download_repo
is a new Pulp task. This task uses the catalog to download every content unit that is a part of the repo that is not flagged as downloaded
on the ContentUnit
.
This portion of the plan replaces what was originally the download_one
task and introduces a new task, download_deferred
, and a new collection, deferred_downloads
, which has the following fields:
- unit_id
- unit_type_id
- Potentially more?
There will be a unique index on (unit_id, unit_type_id).
When the streamer receives a file request, it fetches the file using the lazy catalog. Once it has successfully served the client, the streamer inserts a record into deferred_downloads
using the information found in the lazy catalog. The unique index on deferred_downloads
ensures only one entry exists in the collection at one time. In this way we avoid downloading a content unit many times except in a few rare cases. See the "Known Flaws" section for more information.
The download_deferred
task is dispatched at regular intervals using Celerybeat. This will default to every thirty minutes (this value may be tweaked after some real-world testing). This task does not lock and does the following:
- Check to see if there are any entries in the
deferred_downloads
collection. If there aren't any, the task is done. - Read and remove an entry from
deferred_downloads
. Get all entries in thelazy_content_catalog
collection for the referenced unit. For each file, stat the filesystem to see if another task has already downloaded the file. If not, make a DownloadRequest for each each file and download everything in the unit. - When a
ContentUnit
is completely downloaded, callset_content()
, which will setContentUnit.downloaded
toTrue
, and save the unit. - GOTO 1.
The deferred_downloads
collection is purged during orphan purge.
This portion of the plan is meant to cover the lazy=active
case, as well as to cover the case where a repository is composed of several lazy=passive
repositories and the user wants Pulp to download all the content (to perform an export or similar).
This task can be triggered in several ways:
- An importer is configured with ACTIVE lazy mode and has performed a sync. This means users want this content downloaded into Pulp in the background.
- A user wants to ensure that all content associated with repository is downloaded into Pulp storage. This would be required to export the repository.
The download_repo
task does not make use of the deferred_downloads
workflow. This task will download all units associated with the repository that are not already marked as downloaded
. This task is similar to the deferred_downloads
task, except its list is based on unit association to repositories.
The ContentUnit
model must be modified to change when the storage_path
is set. Currently, this is None
until the set_content
method is called, but for lazy repos set_content
isn't called until the content is downloaded. This is problematic because the value of storage_path
is needed to create the catalog entries and to publish the repository. Therefore, the value should be populated when a ContentUnit
is created and the download
flag will indicate whether the content is downloaded or not.
Furthermore, the importers will need to be able to provide the ContentUnit
with a relative path (which will either be a file name or a directory name in the case of a distribution) which is joined to the storage_path
. This is necessary to construct the final path location for the lazy catalog.
There was some discussion of having the importers provide a callable to the download tasks, but it was decided that for now we will stick with using Nectar to perform the download of units.
A Nectar downloader will be configured for each importer (in the future it may be necessary to configure a downloader for each content type an importer supports). A ContentUnit
will be considered downloaded when every file path in the catalog that corresponds to that unit_id is downloaded successfully.
Although this introduces some complexity to the download tasks, it was judged to be better than introducing the overhead (and complexity in its own right) of having the importer produce a callable that the download tasks managed in a process or thread pool. For example, Nectar reuses TCP connections whereas the callable approach would not be able to.
This plan, like all the others, has a few known effeciency problems. There are several cases, outlined below, where content is downloaded multiple times by Pulp from the Squid proxy. Although this does not access an external network, it is still considered undesirable since it consumes disk I/O unnecessarily.
A content unit could be downloaded multiple times if a client requests a file in that unit and then a download_repo
task for a repository that contains that unit and the celerybeat deferred_downloads
task run at the same time, and they happen to process the that content unit at the same time.
A content unit could be downloaded multiple times if the deferred_downloads
task is set to run often enough that a new task is dispatched before the old one is finished. If those tasks select the same units at the same time, they could download the same content twice. This is a fairly narrow window as each task should be reading and then removing the document from MongoDB, but it is by no means impossible.
A content unit could be downloaded multiple times if a client is actively requesting content from a multi-file ContentUnit
. This occurs if the deferred_downloads
task removes an entry to process, and then the client asks for a new file (that isn't cached in Squid). The Streamer will be able to add another entry for that ContentUnit
there is no longer an entry for that (unit_id, unit_type_id).
Mitigation: Have both download_repo
and deferred_downloads
regularly check the ContentUnit.downloaded
flag on the units it is processing. This way it can detect if another task has already downloaded the unit and quit.
Since the deferred_downloads
task removes entries from the collection, it is possible for a lazy=passive
download to be lost by Pulp if the worker is killed before it finishes the download, but after it has removed the database record(s).
Mitigations: Have the deferred_downloads
task remove relatively few entries at a time. This is a matter of balancing the performance of parallelizing downloads versus losing entries and having to wait for the Squid cache to expire and cause the Streamer to add the entry back to the deferred_downloads
collection. A user can also dispatch a download_repo
task if they want these lost units to be downloaded by Pulp.
The collection will need both: unit_id and unit_type_id.