Skip to content

Instantly share code, notes, and snippets.

@mark-cooper
Last active June 3, 2016 05:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mark-cooper/96892ab8734cf96a5a6ab0268107ab45 to your computer and use it in GitHub Desktop.
Save mark-cooper/96892ab8734cf96a5a6ab0268107ab45 to your computer and use it in GitHub Desktop.
Top container migration -- what happens?

Top container migration

Reminder: regarding top containers you may have an issue if you don't use barcodes and do reuse numbering sequences within the same resource or accession. Otherwise, review the upgrade documentation and relax =)

What happens during the top container migration? Example (fairly) typical tree with parallel sequencing (reused box numbering sequences):

Record 1
-
-- Series 1
---- Subseries 1
------ Object 1
-------- Box 1
------ Object 2
-------- Box 1
---- Subseries 2
------ Object 1
-------- Box 1
------ Object 2
-------- Box 1
-
-- Series 2
---- Subseries 1
------ Object 1
-------- Box 1
------ Object 2
-------- Box 1
---- Subseries 2
------ Object 1
-------- Box 1
------ Object 2
-------- Box 1

Barcodes

If containers have barcodes, one top container is created for each unique Container 1 barcode.

Record 1 -> Series 1 -> Subseries 1 -> Object 1 -> Instance -> Box 1 123456 (barcode) [Folder 1]
Record 1 -> Series 1 -> Subseries 1 -> Object 2 -> Instance -> Box 1 123456 (barcode) [Folder 2]

Record 1 -> Series 1 -> Subseries 2 -> Object 1 -> Instance -> Box 1 123456 (barcode) [Folder 1]
Record 1 -> Series 1 -> Subseries 2 -> Object 2 -> Instance -> Box 1 123456 (barcode) [Folder 2]

Record 1 -> Series 2 -> Subseries 1 -> Object 1 -> Instance -> Box 1 987654 (barcode) [Folder 1]
Record 1 -> Series 2 -> Subseries 1 -> Object 2 -> Instance -> Box 1 987654 (barcode) [Folder 2]

Record 1 -> Series 2 -> Subseries 2 -> Object 1 -> Instance -> Box 1 987654 (barcode) [Folder 1]
Record 1 -> Series 2 -> Subseries 2 -> Object 2 -> Instance -> Box 1 987654 (barcode) [Folder 2]

Result = 2 Top Containers.

  • Box 1 123456 (all Series 1 objects linked)
  • Box 1 987654 (all Series 2 objects linked)

No barcodes

If containers do not have barcodes, one top container is created for each unique combination of container 1 indicator and container type 1 within a resource or accession.

Once a top container is created, additional instance records for the same container within an accession or resource will be linked to that top container record.

Record 1 -> Series 1 -> Subseries 1 -> Object 1 -> Instance -> Box 1 [Folder 1]
Record 1 -> Series 1 -> Subseries 1 -> Object 2 -> Instance -> Box 1 [Folder 2]

Record 1 -> Series 1 -> Subseries 2 -> Object 1 -> Instance -> Box 1 [Folder 1]
Record 1 -> Series 1 -> Subseries 2 -> Object 2 -> Instance -> Box 1 [Folder 2]

Record 1 -> Series 2 -> Subseries 1 -> Object 1 -> Instance -> Box 1 [Folder 1]
Record 1 -> Series 2 -> Subseries 1 -> Object 2 -> Instance -> Box 1 [Folder 2]

Record 1 -> Series 2 -> Subseries 2 -> Object 1 -> Instance -> Box 1 [Folder 1]
Record 1 -> Series 2 -> Subseries 2 -> Object 2 -> Instance -> Box 1 [Folder 2]

Result = 1 Top Container.

  • Box 1 (all objects linked)

Depending on your data that result may be ok, and if Box 1 is always the same Box 1 there is no problem. However, it can also be the case that each Box 1 is expected to be distinct between series (or "equivalent" level).

Root archival object solution

The barcoder plugin developed by cfitz will generate barcodes for containers based on the "root" archival object id (often the uppermost "Series"). It can be run per resource, or per repository.

Record 1 -> Series 1 [id123] -> Subseries 1 -> Object 1 -> Instance -> Box 1 id123.Box.1 (barcode) [Folder 1]
Record 1 -> Series 1 [id123] -> Subseries 1 -> Object 2 -> Instance -> Box 1 id123.Box.1 (barcode) [Folder 2]

Record 1 -> Series 1 [id123] -> Subseries 2 -> Object 1 -> Instance -> Box 1 id123.Box.1 (barcode) [Folder 1]
Record 1 -> Series 1 [id123] -> Subseries 2 -> Object 2 -> Instance -> Box 1 id123.Box.1 (barcode) [Folder 2]

Record 1 -> Series 2 [id456] -> Subseries 1 -> Object 1 -> Instance -> Box 1 id456.Box.1 (barcode) [Folder 1]
Record 1 -> Series 2 [id456] -> Subseries 1 -> Object 2 -> Instance -> Box 1 id456.Box.1 (barcode) [Folder 2]

Record 1 -> Series 2 [id456] -> Subseries 2 -> Object 1 -> Instance -> Box 1 id456.Box.1 (barcode) [Folder 1]
Record 1 -> Series 2 [id456] -> Subseries 2 -> Object 2 -> Instance -> Box 1 id456.Box.1 (barcode) [Folder 2]

Result = 2 Top Containers.

  • Box 1 id123.Box.1 (all Series 1 objects linked)
  • Box 1 id456.Box.1 (all Series 2 objects linked)

Review the plugin documentation to see if this approach will work for you.

For a SQL implementation equivalent to barcoder see end of document ("Root Archival Object SQL").

Immediate parent solution

The root archival object id may not represent the ideal "base" barcode for your data. An alternative is to use the immediate parent id (which may be the root archival object or resource record):

Record 1 -> Series 1 [id123] -> Subseries 1 [idabc] -> Object 1 -> Instance -> Box 1 idabc.Box.1 (barcode) [Folder 1]
Record 1 -> Series 1 [id123] -> Subseries 1 [idabc] -> Object 2 -> Instance -> Box 1 idabc.Box.1 (barcode) [Folder 2]

Record 1 -> Series 1 [id123] -> Subseries 2 [iddef] -> Object 1 -> Instance -> Box 1 iddef.Box.1 (barcode) [Folder 1]
Record 1 -> Series 1 [id123] -> Subseries 2 [iddef] -> Object 2 -> Instance -> Box 1 iddef.Box.1 (barcode) [Folder 2]

Record 1 -> Series 2 [id456] -> Subseries 1 [idghi] -> Object 1 -> Instance -> Box 1 idghi.Box.1 (barcode) [Folder 1]
Record 1 -> Series 2 [id456] -> Subseries 1 [idghi] -> Object 2 -> Instance -> Box 1 idghi.Box.1 (barcode) [Folder 2]

Record 1 -> Series 2 [id456] -> Subseries 2 [idjkl] -> Object 1 -> Instance -> Box 1 idjkl.Box.1 (barcode) [Folder 1]
Record 1 -> Series 2 [id456] -> Subseries 2 [idjkl] -> Object 2 -> Instance -> Box 1 idjkl.Box.1 (barcode) [Folder 2]

Result = 4 Top Containers.

  • Box 1 idabc.Box.1 (all Series 1 Subseries 1 objects linked)
  • Box 1 iddef.Box.1 (all Series 1 Subseries 2 objects linked)
  • Box 1 idghi.Box.1 (all Series 2 Subseries 1 objects linked)
  • Box 1 idjkl.Box.1 (all Series 2 Subseries 2 objects linked)

See end of document for SQL example to generate such barcodes ("Immediate Parent SQL").

Intermediate or mixed / mutli level solution

For even more deeply nested trees where the "ideal" base barcode might be some level between the root and immediate parent, or at different levels of the tree depending on other factors, or with elaborate levels of sequencing, there is no current one-size-fits-all solution. Manual data preparation or post upgrade cleanup will be required.

Top Container report

To see Top Container -> Sub Container -> Instance -> Archival Object -> Resource refer to "Top Container Report" below (this is not all-encompassing so you may want to expand it).

Finding parallel sequences

This is not perfect but for an indication of where problematic parallel sequencing may be present refer to "Parallel Sequences Report" below.


Mark Cooper, Team Lead, LYRASIS DTS Hosting and Support


All SQL is used at your own risk! This is not considered production ready as additional testing is required, but it may provide a useful starting point for you to use with your data. The barcode generating example SQL is unscoped (i.e. not per repository or resource), but you will probably want to be more selective when running for "real" (by reviewing data or by targeting with a report like "Parallel Sequences Report"). Always backup and test in a non-production environment first.

Root Archival Object SQL

See "Immediate Parent SQL" below for how to apply barcodes after generating the aspace_topcontainer_barcode table.

DROP FUNCTION IF EXISTS get_ao_root_parent_ao_id;
DELIMITER $$
CREATE FUNCTION get_ao_root_parent_ao_id(_ao_id INT)
  # RETURN ID OF ANCESTOR ARCHIVAL OBJECT (ROOT AO ID) OR SELF IF PARENT NULL
  RETURNS INT
  READS SQL DATA
BEGIN
  DECLARE _ao_parent_id INT;

  SET @ret = 0;
  WHILE _ao_id IS NOT NULL do

    SELECT parent_id
    INTO _ao_parent_id
    FROM archival_object 
    WHERE id = _ao_id;

    IF(found_rows() != 1) THEN
      SIGNAL SQLSTATE '45000'
      SET MESSAGE_TEXT = 'id not found in table archival_object';
    END IF;

    SET @ret := _ao_id; 
    SET _ao_id := _ao_parent_id;

  END WHILE;

  RETURN @ret;
END;
$$
DELIMITER ;

DROP TEMPORARY TABLE IF EXISTS aspace_topcontainer_barcode;
CREATE TEMPORARY TABLE aspace_topcontainer_barcode AS (
  # CONTAINER -> INSTANCE -> AO [ROOT AO PARENT ID]
  SELECT
    c.id as c_id,
    ev1.value as c_type_1,
    c.indicator_1 as c_ind_1,
    a.id as ao_id,
    'ao' as parent_type,
    (SELECT get_ao_root_parent_ao_id(a.id) as parent_id),
    a.root_record_id as resource_id,
    concat(
      'aspace.',
      'ao',
      '.',
      (SELECT get_ao_root_parent_ao_id(a.id)),
      '.',
      ev1.value,
      '.',
      c.indicator_1
    ) as barcode
  FROM container c
  JOIN enumeration_value ev1 ON c.type_1_id = ev1.id
  JOIN instance i ON c.instance_id = i.id
  JOIN enumeration_value ev2 ON i.instance_type_id = ev2.id
  JOIN archival_object a ON i.archival_object_id = a.id
  WHERE c.barcode_1 IS NULL AND ev2.value <> 'digital_object'
);

You can now query aspace_topcontainer_barcode to see the generated barcodes.


Immediate Parent SQL

# BARCODE VIA AO / RESOURCE BY IMMEDIATE PARENT AO / RESOURCE SQL
DROP TEMPORARY TABLE IF EXISTS aspace_topcontainer_barcode;
CREATE TEMPORARY TABLE aspace_topcontainer_barcode AS (
SELECT * FROM (
  # CONTAINER -> INSTANCE -> AO [PARENT ID]
  SELECT
    c.id c_id,
    ev1.value as c_type_1,
    c.indicator_1 as c_ind_1,
    a.id as ao_id,
    # IF YOU DON'T WANT TO USE THE RESOUCE ID USE 'ao' string AS 'parent_type'
    # AND USE 'ifnull(a.parent_id, a.id) as parent_id' TO SCOPE BY AO ONLY
    (case when a.parent_id is null then 'r' else 'ao' end) as parent_type,
    ifnull(a.parent_id, a.root_record_id) as parent_id,
    # sanity check: when parent_type = 'r' parent_id and resource_id should match
    a.root_record_id as resource_id,
    concat(
      'aspace.',
      # UPDATE IF CHANGED PER ABOVE: 'ao'
      (case when a.parent_id is null then 'r' else 'ao' end),
      '.',
      # UPDATE IF CHANGED PER ABOVE: 'ifnull(a.parent_id, a.id) as parent_id'
      ifnull(a.parent_id, a.root_record_id),
      '.',
      ev1.value,
      '.',
      c.indicator_1
    ) as barcode
  FROM container c
  JOIN enumeration_value ev1 ON c.type_1_id = ev1.id
  JOIN instance i ON c.instance_id = i.id
  JOIN enumeration_value ev2 ON i.instance_type_id = ev2.id
  JOIN archival_object a ON i.archival_object_id = a.id
  WHERE c.barcode_1 IS NULL AND ev2.value <> 'digital_object'
  UNION ALL
  # CONTAINER -> INSTANCE -> RESOURCE [RESOURCE ID]
  SELECT
    c.id c_id,
    ev1.value as c_type_1,
    c.indicator_1 as c_ind_1,
    NULL as ao_id,
    'r' as parent_type,
    i.resource_id as parent_id,
    # sanity check: when parent_type = 'r' parent_id and resource_id should match
    r.id as resource_id,
    concat(
      'aspace.',
      'r',
      '.',
      r.id,
      '.',
      ev1.value,
      '.',
      c.indicator_1
    ) as barcode
  FROM container c
  JOIN enumeration_value ev1 ON c.type_1_id = ev1.id
  JOIN instance i ON c.instance_id = i.id
  JOIN enumeration_value ev2 ON i.instance_type_id = ev2.id
  JOIN resource r ON i.resource_id = r.id
  WHERE c.barcode_1 IS NULL AND ev2.value <> 'digital_object'
) barcode );

# APPLY GENERATED BARCODES TO CONTAINERS
UPDATE container c
JOIN aspace_topcontainer_barcode atcb ON c.id = atcb.c_id  
SET c.barcode_1 = atcb.barcode;

# REINDEX ARCHIVAL OBJECTS
UPDATE archival_object ao
JOIN aspace_topcontainer_barcode atcb ON ao.id = atcb.ao_id
SET ao.system_mtime = now();

# REINDEX RESOURCES
UPDATE resource r
JOIN aspace_topcontainer_barcode atcb ON r.id = atcb.resource_id
SET r.system_mtime = now()
WHERE atcb.ao_id IS NULL;

Now review to check that the results are what you expect with your data.


Removing generated barcodes

After the upgrade you may want to delete any generated barcodes that were used:

# CHECK THIS IS OK FOR YOUR DATA! (remove barcodes and reindex top containers)
SELECT * FROM top_container WHERE barcode LIKE 'aspace.%'; # REVIEW UPDATES
UPDATE top_container SET barcode = NULL WHERE barcode LIKE 'aspace.%';
UPDATE top_container SET system_mtime = now();

To do this more efficiently you may want to capture the top container ids that will be updated and only update those records. As-is all top containers will be reindexed, which can be slow with larger data sets.


Top Container Report

SELECT
 tc.id as tc_id,
 ev1.value as tc_type,
 tc.indicator as tc_indicator,
 tc.barcode,
 sc.id as sc_id,
 ev2.value as sc_type,
 sc.indicator_2 as sc_indicator2,
 i.id as i_id,
 ao.id as ao_id,
 ao.display_string as ao_title,
 r.id as resource_id,
 r.title
FROM top_container tc
JOIN enumeration_value ev1 ON ev1.id = tc.type_id
JOIN top_container_link_rlshp tclr ON tc.id = tclr.top_container_id
JOIN sub_container sc ON sc.id = tclr.sub_container_id
LEFT JOIN enumeration_value ev2 ON ev2.id = sc.type_2_id
JOIN instance i ON i.id = sc.instance_id
JOIN archival_object ao ON ao.id = i.archival_object_id
JOIN resource r ON r.id = ao.root_record_id
-- WHERE r.id = 2 # use to specify a particular resource
;

Parallel Sequences Report

This provides something to review. It does not disambiguate repeat Box numbers that have a Series (Box 1) -> Subseries (Box 1 Folder 1, Box 1 Folder 2 etc.) hierarchical relationship, which is generally not a problem. If your data uses this pattern frequently you will have many false hits.

SELECT
  r.id as resource_id,
  r.title,
  count(DISTINCT ao.parent_id) as parents,
  ev1.value as ind1_type,
  c.indicator_1 as ind1_value,
  count(*) as ind1_occurences
FROM resource r
JOIN archival_object ao ON r.id = ao.root_record_id
JOIN instance i ON ao.id = i.archival_object_id
JOIN container c ON i.id = c.instance_id
JOIN enumeration_value ev1 ON ev1.id = c.type_1_id
WHERE c.barcode_1 IS NULL
GROUP BY r.id, c.type_1_id, c.indicator_1
HAVING parents > 1
ORDER BY r.id;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment