Skip to content

Instantly share code, notes, and snippets.

@drmalex07
Last active November 13, 2019 15:25
Show Gist options
  • Save drmalex07/67f8037394bac5cad759aa5260d16eb7 to your computer and use it in GitHub Desktop.
Save drmalex07/67f8037394bac5cad759aa5260d16eb7 to your computer and use it in GitHub Desktop.
Archive resources from CKAN catalogue. #CKAN

Archive resources from CKAN catalogue

1. Replace resource file with symbolic link

A resource file can be archived without breaking the file hierarchy that CKAN builds under /var/lib/files/resources. The actual data for a resource 7896ea0c-504d-401c-818f-065430419695 will be stored as a regular file under a 3-level hierarchy at 789/6ea/0c-504d-401c-818f-065430419695.

For example:

tree resources
...
├── 66f
│   └── 738
│       └── 1d-296c-4f85-9186-3b608d05fff0
├── 789
│   └── 6ea
│       └── 0c-504d-401c-818f-065430419695
├── 99a
│   └── 9c0
│       └── b8-6276-4f44-8fd2-0e763c46da0e
├── aec
│   └── bc8
│       └── 38-93f9-4450-900d-1bf54d1d0cf6
├── ce9
│   └── fa9
│       └── 36-5dc2-4671-8e5f-adf06fc544a8
...

We must atomically replace the file with a symbolik link: the only way to succeed it is with creating the link and then renaming:

For example:

# Prepare directory structure for to-be archived resource
mkdir -p /mnt/data-1/resources/789/6ea
# Archive
cp resources/789/6ea/0c-504d-401c-818f-065430419695 \
   /mnt/data-1/resources/789/6ea/0c-504d-401c-818f-065430419695
# Create link to archived resource
ln -s /mnt/data-1/resources/789/6ea/0c-504d-401c-818f-065430419695 \
   resources/789/6ea/0c-504d-401c-818f-065430419695-link
# Move atomically to destination
mv -v resources/789/6ea/0c-504d-401c-818f-065430419695-link resources/789/6ea/0c-504d-401c-818f-065430419695

After this, tree resources will give something like the following:

├── resources
....
│   ├── 66f
│   │   └── 738
│   │       └── 1d-296c-4f85-9186-3b608d05fff0
│   ├── 789
│   │   └── 6ea
│   │       └── 0c-504d-401c-818f-065430419695 -> /mnt/data-1/resources/789/6ea/0c-504d-401c-818f-065430419695
│   ├── 99a
│   │   └── 9c0
│   │       └── b8-6276-4f44-8fd2-0e763c46da0e
...

2. Discard symbolic links when a resource is updated

This is quite the opposite of (1). When a resource path is updated from the web application, then CKAN will also atomically move the new (regular) file onto the previous one. So the old path (even if it was was a symbolic link!) it is replaced by the new file.

Nothing more to do on this.

3. Keep track of resources under a package

We must keep track of what's happening on the resources under a package, and act accordingly.

The archiving application must keep track of known package/resource identifiers and last-seen revisions for them. It needs a database schema like:

Diagram

A sketch of the process:

  1. List identifiers for public datasets (packages)

  2. For each identifier in (1) fetch package details (package_show).

    1. If package is not seen before, add it to database along with all resources as active.

    2. If package is seen, compare resources to their last-seen revisions.

      1. If a resource is new: add it, mark it as active and record seen revision
      2. If a resource is known but got a new revision, it means either metadata or the actual data were replaced: nothing to do on this, just record seen revision.
      3. If a resource is missing (i.e is known but not found in resources from package_show): mark it as deleted. The actual resource must be archived (as it is no longer accessible from CKAN web application)
  3. For all missing packages (i.e known but not found in (1)), we can assume they are deleted. All their resources can be archived.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment