A resource file can be archived without breaking the file hierarchy that CKAN builds under /var/lib/files/resources
. The actual data for a resource 7896ea0c-504d-401c-818f-065430419695 will be stored as a regular file under a 3-level hierarchy at 789/6ea/0c-504d-401c-818f-065430419695
.
For example:
tree resources
...
├── 66f
│ └── 738
│ └── 1d-296c-4f85-9186-3b608d05fff0
├── 789
│ └── 6ea
│ └── 0c-504d-401c-818f-065430419695
├── 99a
│ └── 9c0
│ └── b8-6276-4f44-8fd2-0e763c46da0e
├── aec
│ └── bc8
│ └── 38-93f9-4450-900d-1bf54d1d0cf6
├── ce9
│ └── fa9
│ └── 36-5dc2-4671-8e5f-adf06fc544a8
...
We must atomically replace the file with a symbolik link: the only way to succeed it is with creating the link and then renaming:
For example:
# Prepare directory structure for to-be archived resource
mkdir -p /mnt/data-1/resources/789/6ea
# Archive
cp resources/789/6ea/0c-504d-401c-818f-065430419695 \
/mnt/data-1/resources/789/6ea/0c-504d-401c-818f-065430419695
# Create link to archived resource
ln -s /mnt/data-1/resources/789/6ea/0c-504d-401c-818f-065430419695 \
resources/789/6ea/0c-504d-401c-818f-065430419695-link
# Move atomically to destination
mv -v resources/789/6ea/0c-504d-401c-818f-065430419695-link resources/789/6ea/0c-504d-401c-818f-065430419695
After this, tree resources
will give something like the following:
├── resources
....
│ ├── 66f
│ │ └── 738
│ │ └── 1d-296c-4f85-9186-3b608d05fff0
│ ├── 789
│ │ └── 6ea
│ │ └── 0c-504d-401c-818f-065430419695 -> /mnt/data-1/resources/789/6ea/0c-504d-401c-818f-065430419695
│ ├── 99a
│ │ └── 9c0
│ │ └── b8-6276-4f44-8fd2-0e763c46da0e
...
This is quite the opposite of (1). When a resource path is updated from the web application, then CKAN will also atomically move the new (regular) file onto the previous one. So the old path (even if it was was a symbolic link!) it is replaced by the new file.
Nothing more to do on this.
We must keep track of what's happening on the resources under a package, and act accordingly.
The archiving application must keep track of known package/resource identifiers and last-seen revisions for them. It needs a database schema like:
A sketch of the process:
-
List identifiers for public datasets (packages)
-
For each identifier in (1) fetch package details (
package_show
).-
If package is not seen before, add it to database along with all resources as
active
. -
If package is seen, compare resources to their last-seen revisions.
- If a resource is new: add it, mark it as
active
and record seen revision - If a resource is known but got a new revision, it means either metadata or the actual data were replaced: nothing to do on this, just record seen revision.
- If a resource is missing (i.e is known but not found in resources from
package_show
): mark it asdeleted
. The actual resource must be archived (as it is no longer accessible from CKAN web application)
- If a resource is new: add it, mark it as
-
-
For all missing packages (i.e known but not found in (1)), we can assume they are deleted. All their resources can be archived.