Skip to content

Instantly share code, notes, and snippets.

@ross-spencer
Forked from cmharlow/pcdm_filesets_testing.md
Created December 2, 2021 09:00
Show Gist options
  • Save ross-spencer/f99cdccd7e075e096077e4b887b95d68 to your computer and use it in GitHub Desktop.
Save ross-spencer/f99cdccd7e075e096077e4b887b95d68 to your computer and use it in GitHub Desktop.
Testing PCDM modeling with Fileset options (PCDM PRs #68, #69) for existing Cornell use cases

My Context

My understanding of PCDM so far:

  • PCDM models repository objects, including the RWO/"intellectual work", a digital resource (surrogate or born digital), and binaries.
  • PCDM does not model descriptive domains, i.e. is different from BIBFRAME, Cidoc-CRM, FRBR-* or RDA, etc.
  • PCDM does not model preservation activities (but could be extended to do so), i.e. is different from PREMIS.
  • Ignoring Collection here because it is not requested to be changed.
  • Object:
    • Intellectual entity or RWO.
    • Can have descriptive metadata and access metadata.
    • Cannot have technical metadata.
    • Each level of a "work" is an Object instance that can be standalone or can be linked to other Objects.
    • Object that is a "work component" (aka a part) is created as needed by the implementation (e.g., the creation of an Object for a Book doesn't mean you have to create a bunch of Object instances for the Pages. You do so if you have a reason for describing a Page separately.)
    • Are ordered via Proxy resources and ordering relationships in PCDM.
  • Fileset:
    • Grouping of Files / binaries that come from the same "source".
    • Not an intellectual "work".
    • Source could be same binary or same digital resource:
      • e.g. for binary: image file derivatives
      • e.g. for same digital resource: so parts of a website (HTML, CSS), derived versions of a dataset (dataset as CSV, dataset visualizations), image of text with OCR, etc.
    • Has digitization/capture technical or access metadata.
    • Does not have descriptive metadata beyond some core fields (i.e. it could have a label, could have publisher or creator, but these relate to the digitization or capture activity, not the intellectual work).
    • Fileset must only contain Files. Can contain 0:n Files (I guess you could have a Fileset without a File, but that seems pointless. Must contain at least 1 File if using the logic shortcuts, below).
    • FileSet should be a member of 1 and only 1 Object (not proposed in the model RDF changes, so is more a recommendation than a requirement). This is the RWO <=> Digitization or Capture output relationship. If you have a Fileset that could be contained by 2 Objects, you are missing an Object RWO for that combined work.
    • Filesets are not ordered (they are attached to Objects that are ordered).
  • File:
    • Binary of a digital resource (surrogate or born digital item).
    • Has digitization or capture technical or access metadata.
    • Does not have descriptive metadata.
    • Are not ordered (but can use class typing to at least get at some needs for determining files from each other).

Currently needed use cases:

  1. Book has Pages, Pages have multiple digital resources from different digitization efforts, some Pages contain Images we capture in detail for sake of discovery.
  2. Journal has Issues, Issues have Articles, Issue has Pages, Articles have Pages, each Page is digitized once but has derivatives, OCR.

Proposed Changes:

  • Fileset is added, becomes a subclass of Object
  • hasFileset is added. Domain: Object, Range: Fileset
  • isFileSetOf is added. Domain: Fileset, Range: Object, inverseOf: hasFileSet

Understanding of Changes:

  • Fileset becomes something that can have descriptive metadata and/or represent an intellectual "work" (through being a subclass of Object) - though this is not recommended or a requested change.
  • Ostensibly, with this, a Fileset could contain a Fileset, but this is not recommended or a requested change.
  • Fileset can be ignored as needed - i.e., (non-FS) Object hasFile File directly is possible.
  • This says nothing about when you would use (non-FS) Object hasFile File versus when you would use (non-FS) Object hasFileSet FS , FS hasFile File.
  • In my estimation (i.e. the model changes say nothing about the following, but this needs to be discussed):
    • You have a direct non-FS Object hasFile File relationship when there are not multiple Files requiring Fileset groupings.
    • You create and use Filesets when there are multiple Files requiring groupings.
    • "Requirement" will be hard to pin down and implementation-specific.
    • This leaves open if each File/binary should be (it can be) a member of 1 and only 1 resource.
    • This creates possibly problematic variance for query or update paths for all Files of a resource (i.e., modeling to have two paths for the same information).
    • This creates uncertainty about what metadata goes where due to subclassing (see first bullet point) and due to possible use of just a File or a Fileset with that File (need to explore this second part more - it could be that you just lose some metadata options if not using Fileset).

Examples

1 : Book has Pages, Pages have multiple digital surrogates from different digitization efforts, some Pages contain Images we capture in detail for sake of discovery.

:book1 a pcdm:Object ;
  dct:title "Bringer of the mystery dog" ;
  dct:creator <http://id.loc.gov/rwo/agents/n79056779> ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300028051> ;
  pcdm:hasMember :page1, :page2, ... .
:page1 a pcdm:Object ;
  dct:title "Page 32" ;
  bibo:locator "32"^^xsd:integer ;
  pcdm:hasFileSet :fs1 , :fs2 ;
  pcdm:hasFile :file1 . # wondering if we can use this to represent the preferred rep when there are multiple filesets.
:fs1 a pcdm:Fileset, pcdm:Object ; # including superclass for sake of clarity in example
  dct:publisher :ag1 ;
  dct:rights http://creativecommons.org/licenses/by/3.0/ ; # digital rights, not intellectual work rights. should maybe be on file? this is more access?
  pcdm:hasFile :file1, :file2.
:file1 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.jpg" ;
  ebucore:fileSize "9087656" .
:file2 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.png" ;
  ebucore:fileSize "8909087656" .
:page2 a pcdm:Object ;
  dct:title "Page 33" ;
  bibo:locator "33"^^xsd:integer ;
  pcdm:hasMember :detail1 ;
  pcdm:hasFileSet :fs3 , :fs4 ;
  pcdm:hasFile :file3 .
# skipping :fs3, :fs4, :file3 as follows above
:detail1 a pcdm:Object ;
  dct:title "Detail of Ishtakhaba" ; # illustration caption
  foaf:depicts <http://www.wikidata.org/entity/Q3155283> ; # including for Agent to depiction convo
  pcdm:hasFileSet :fs5 ;
  pcdm:hasFile :file5 .
# skipping :fs5, :file4 as follow above
:ag1 a foaf:Organization ;
  foaf:name "Cornell University. Library. Digital Consulting and Production Services"@en .
  1. Journal has Issues, Issues have Articles, Issue has Pages, Articles have Pages, each Page is digitized once but has derivatives, OCR.
:journal1 a pcdm:Object ;
  dct:title "Agricultural History" ;
  dct:publisher <http://id.loc.gov/rwo/agents/n79119036> ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300215390> ;
  pcdm:hasMember :issue1, :issue2, ... :issue19 ;
  iana:first :proxyIssue1 ;
  iana:last :proxyIssue19 .
:issue1 a pcdm:Object ;
  dct:title "Agricultural History" ;
  dct:issue "1945-01"^^dcterms:W3CDTF ;
  bibo:volume "19"^^xsd:integer ;
  bibo:issue "1"^^xsd:integer ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300048715> ;
  pcdm:hasMember :article1, :article2, :page1, :page2, ... :page50 ;
  iana:first :proxyPage1 ;
  iana:last :proxyPage50 .
:article1 a pcdm:Object ;
  dct:title "Factors Influencing the Distribution of the German Pioneer Population in Minnesota" ;
  dct:creator :ag2 ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300048715> ;
  pcdm:hasMember :page1, :page2 ;
  iana:first :proxyPageArt1 ;
  iana:last:proxyPageArt2 .
:page1 a pcdm:Object ;
  dct:title "Page 1" ;
  bibo:locator "1"^^xsd:integer ;
  pcdm:hasFileSet :fs1, :fs2 ;
  pcdm:hasFile :file1 . # wondering if we can use this to represent the preferred rep when there are multiple filesets.
:fs1 a pcdm:Fileset, pcdm:Object ; # including superclass for sake of clarity in example
  dct:publisher :ag1 ;
  dct:description "Digitization funded by Cornell University Class of 1956."@en ;
  dct:rights http://creativecommons.org/licenses/by/3.0/ ; # digital rights, not intellectual work rights. should maybe be on file? this is more access?
  pcdm:hasFile :file1, :file2.
:file1 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.jpg" ;
  ebucore:fileSize "9087656" .
:file2 a pcdm:File , pcdmfft:Text ; # Could include tool/event information used to perform OCR here
  ebucore:filename "1234.txt" ;
  ebucore:fileSize "8909087656" .
:page2 a pcdm:Object ;
  dct:title "Page 2" ;
  bibo:locator "2"^^xsd:integer ;
  pcdm:hasFileSet :fs3 , :fs4 ;
  pcdm:hasFile :file3 .
# skipping :fs3, :fs4, :file3 as follows above
:ag1 a foaf:Organization ;
  foaf:name "Cornell University. Library. Digital Consulting and Production Services"@en .
:ag1 a foaf:Person ;
  foaf:name "Johnson, Hildergard Binder" .
:proxyPage1 a ore:Proxy ;
  ore:proxyFor :page1 ;
  ore:proxyIn :issue1 ;
  iana:next :proxyPage2 .
# ... (on through :proxyPage50)
:proxyPageArt1 a ore:Proxy ;
  ore:proxyFor :page1 ;
  ore:proxyIn :article1 ;
:proxyPageArt2 a ore:Proxy ;
  ore:proxyFor :page2 ;
  ore:proxyIn :article1 ;
  iana:prev :proxyPageArt1 .
# proxies for issue ordering in a Journal resource skipped as follows above pattern

Proposed Changes:

  • Fileset is added, but it is not a subclass of Object
  • hasFileset is added. Domain: Object, Range: Fileset
  • isFileSetOf is added. Domain: Fileset, Range: Object, inverseOf: hasFileSet
  • managesFile is added. Domain: Fileset, Range: File
  • managedBy is added. Domain: File, Range: Fileset, inverseOf: managesFile

Understanding of Changes:

  • Fileset remains separate from Object. Definition is as above.
  • Fileset should not be an "intellectual work"/RWO, but that is not explicitly stated (but we have a firmer separation than with option #68, as Fileset is not a subclass of Object here).
  • Fileset can be ignored as needed - i.e., non-FS Object hasFile File directly is possible.
    • In fact, hasFile is used only to link an Object to a File. Use managesFile / managedBy to link a Fileset to a File.
    • This could help navigate one option versus the other in queries/update paths, but doesn't address (like 68) when you would use which option.
  • In my estimation (i.e. the changes say nothing about the following, but this needs to be discussed):
    • You have a direct Object hasFile File relationship when there are not multiple Files requiring Fileset groupings.
    • You create and use Filesets when there are multiple Files requiring groupings.
    • "Requirement" will be hard to pin down and implementation-specific.
    • This leaves open if each File/binary should be a member of 1 and only 1 resource.
    • This creates possibly problematic variance for query or update paths for all Files of a resource (i.e., modeling to have two possible paths for the same information).
    • This creates uncertainty about what metadata goes where due to possible use of just a File or a Fileset with that File. Avoids inclarity of Object / Fileset combo with #68 though.

Examples

1 : Book has Pages, Pages have multiple digital surrogates from different digitization efforts, some Pages contain Images we capture in detail for sake of discovery.

:book1 a pcdm:Object ;
  dct:title "Bringer of the mystery dog" ;
  dct:creator <http://id.loc.gov/rwo/agents/n79056779> ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300028051> ;
  pcdm:hasMember :page1, :page2, ... .
:page1 a pcdm:Object ;
  dct:title "Page 32" ;
  bibo:locator "32"^^xsd:integer ;
  pcdm:hasFileSet :fs1 , :fs2 ;
  pcdm:hasFile :file1 . # wondering if we can use this to represent the preferred rep when there are multiple filesets.
:fs1 a pcdm:Fileset ;
  dct:publisher :ag1 ;
  dct:rights http://creativecommons.org/licenses/by/3.0/ ; # digital rights, not intellectual work rights. should maybe be on file? this is more access?
  pcdm:managesFile :file1, :file2.
:file1 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.jpg" ;
  ebucore:fileSize "9087656" .
:file2 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.png" ;
  ebucore:fileSize "8909087656" .
:page2 a pcdm:Object ;
  dct:title "Page 33" ;
  bibo:locator "33"^^xsd:integer ;
  pcdm:hasMember :detail1 ;
  pcdm:hasFileSet :fs3 , :fs4 ;
  pcdm:hasFile :file3 .
# skipping :fs3, :fs4, :file3 as follows above
:detail1 a pcdm:Object ;
  dct:title "Detail of Ishtakhaba" ; # illustration caption
  foaf:depicts <http://www.wikidata.org/entity/Q3155283> ; # including for Agent to depiction convo
  pcdm:hasFileSet :fs5 ;
  pcdm:hasFile :file5 .
# skipping :fs5, :file4 as follow above
:ag1 a foaf:Organization ;
  foaf:name "Cornell University. Library. Digital Consulting and Production Services"@en .
  1. Journal has Issues, Issues have Articles, Issue has Pages, Articles have Pages, each Page is digitized once but has derivatives, OCR.
:journal1 a pcdm:Object ;
  dct:title "Agricultural History" ;
  dct:publisher <http://id.loc.gov/rwo/agents/n79119036> ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300215390> ;
  pcdm:hasMember :issue1, :issue2, ... :issue19 ;
  iana:first :proxyIssue1 ;
  iana:last :proxyIssue19 .
:issue1 a pcdm:Object ;
  dct:title "Agricultural History" ;
  dct:issue "1945-01"^^dcterms:W3CDTF ;
  bibo:volume "19"^^xsd:integer ;
  bibo:issue "1"^^xsd:integer ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300048715> ;
  pcdm:hasMember :article1, :article2, :page1, :page2, ... :page50 ;
  iana:first :proxyPage1 ;
  iana:last :proxyPage50 .
:article1 a pcdm:Object ;
  dct:title "Factors Influencing the Distribution of the German Pioneer Population in Minnesota" ;
  dct:creator :ag2 ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300048715> ;
  pcdm:hasMember :page1, :page2 ;
  iana:first :proxyPageArt1 ;
  iana:last:proxyPageArt2 .
:page1 a pcdm:Object ;
  dct:title "Page 1" ;
  bibo:locator "1"^^xsd:integer ;
  pcdm:hasFileSet :fs1, :fs2 ;
  pcdm:hasFile :file1 . # wondering if we can use this to represent the preferred rep when there are multiple filesets.
:fs1 a pcdm:Fileset ;
  dct:publisher :ag1 ;
  dct:description "Digitization funded by Cornell University Class of 1956."@en ;
  dct:rights http://creativecommons.org/licenses/by/3.0/ ; # digital rights, not intellectual work rights. should maybe be on file? this is more access?
  pcdm:managesFile :file1, :file2.
:file1 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.jpg" ;
  ebucore:fileSize "9087656" .
:file2 a pcdm:File , pcdmfft:Text ; # Could include tool/event information used to perform OCR here
  ebucore:filename "1234.txt" ;
  ebucore:fileSize "8909087656" .
:page2 a pcdm:Object ;
  dct:title "Page 2" ;
  bibo:locator "2"^^xsd:integer ;
  pcdm:hasFileSet :fs3 , :fs4 ;
  pcdm:hasFile :file3 .
# skipping :fs3, :fs4, :file3 as follows above
:ag1 a foaf:Organization ;
  foaf:name "Cornell University. Library. Digital Consulting and Production Services"@en .
:ag1 a foaf:Person ;
  foaf:name "Johnson, Hildergard Binder" .
:proxyPage1 a ore:Proxy ;
  ore:proxyFor :page1 ;
  ore:proxyIn :issue1 ;
  iana:next :proxyPage2 .
# ... (on through :proxyPage50)
:proxyPageArt1 a ore:Proxy ;
  ore:proxyFor :page1 ;
  ore:proxyIn :article1 ;
:proxyPageArt2 a ore:Proxy ;
  ore:proxyFor :page2 ;
  ore:proxyIn :article1 ;
  iana:prev :proxyPageArt1 .
# proxies for issue ordering in a Journal resource skipped as follows above pattern

Idea for moving forward

It appears that option #69 is better, because we do want flexibility for the sake of multiple folks using PCDM, but this avoid some of the possible conflation of Class / property definitions.

Additionally, I wonder about the possibility to use some sort of either modeling language logic (OWL property chain axiom?) or external conversion spec (SWRL? Generic conversion scripts?) to map:

pcdm:Object pcdm:hasFileSet pcdm:FileSet pcdm:managesFile pcdm:File

to

pcdm:Object pcdm:hasFile pcdm:File

So use case 1, I'd hope as rendered by institution / application not using Filesets (with possible OWL Property Chain Axiom or other)

:book1 a pcdm:Object ;
  dct:title "Bringer of the mystery dog" ;
  dct:creator <http://id.loc.gov/rwo/agents/n79056779> ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300028051> ;
  pcdm:hasMember :page1, :page2, ... .
:page1 a pcdm:Object ;
  dct:title "Page 32" ;
  bibo:locator "32"^^xsd:integer ;
  pcdm:hasFile :file1, :file2.
:file1 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.jpg" ;
  ebucore:fileSize "9087656" .
:file2 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.png" ;
  ebucore:fileSize "8909087656" .
:page2 a pcdm:Object ;
  dct:title "Page 33" ;
  bibo:locator "33"^^xsd:integer ;
  pcdm:hasMember :detail1 ;
  pcdm:hasFile :file3 .
:detail1 a pcdm:Object ;
  dct:title "Detail of Ishtakhaba" ; # illustration caption
  foaf:depicts <http://www.wikidata.org/entity/Q3155283> ; # including for Agent to depiction convo
  pcdm:hasFile :file5 .
# skipping :fs5, :file4 as follow above
:ag1 a foaf:Organization ;
  foaf:name "Cornell University. Library. Digital Consulting and Production Services"@en .
  1. Journal has Issues, Issues have Articles, Issue has Pages, Articles have Pages, each Page is digitized once but has derivatives, OCR.
:journal1 a pcdm:Object ;
  dct:title "Agricultural History" ;
  dct:publisher <http://id.loc.gov/rwo/agents/n79119036> ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300215390> ;
  pcdm:hasMember :issue1, :issue2, ... :issue19 ;
  iana:first :proxyIssue1 ;
  iana:last :proxyIssue19 .
:issue1 a pcdm:Object ;
  dct:title "Agricultural History" ;
  dct:issue "1945-01"^^dcterms:W3CDTF ;
  bibo:volume "19"^^xsd:integer ;
  bibo:issue "1"^^xsd:integer ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300048715> ;
  pcdm:hasMember :article1, :article2, :page1, :page2, ... :page50 ;
  iana:first :proxyPage1 ;
  iana:last :proxyPage50 .
:article1 a pcdm:Object ;
  dct:title "Factors Influencing the Distribution of the German Pioneer Population in Minnesota" ;
  dct:creator :ag2 ;
  dct:type dcmitype:Text ;
  dct:format <http://vocab.getty.edu/aat/300048715> ;
  pcdm:hasMember :page1, :page2 ;
  iana:first :proxyPageArt1 ;
  iana:last:proxyPageArt2 .
:page1 a pcdm:Object ;
  dct:title "Page 1" ;
  bibo:locator "1"^^xsd:integer ;
  pcdm:hasFile :file1, :file2.
:file1 a pcdm:File , pcdmfft:RasterImage ;
  ebucore:filename "1234.jpg" ;
  ebucore:fileSize "9087656" .
:file2 a pcdm:File , pcdmfft:Text ; # Could include tool/event information used to perform OCR here
  ebucore:filename "1234.txt" ;
  ebucore:fileSize "8909087656" .
:page2 a pcdm:Object ;
  dct:title "Page 2" ;
  bibo:locator "2"^^xsd:integer ;
  pcdm:hasFile :file3, :file4 .
# skipping :file3 as follows above
:ag1 a foaf:Organization ;
  foaf:name "Cornell University. Library. Digital Consulting and Production Services"@en .
:ag1 a foaf:Person ;
  foaf:name "Johnson, Hildergard Binder" .
:proxyPage1 a ore:Proxy ;
  ore:proxyFor :page1 ;
  ore:proxyIn :issue1 ;
  iana:next :proxyPage2 .
# ... (on through :proxyPage50)
:proxyPageArt1 a ore:Proxy ;
  ore:proxyFor :page1 ;
  ore:proxyIn :article1 ;
:proxyPageArt2 a ore:Proxy ;
  ore:proxyFor :page2 ;
  ore:proxyIn :article1 ;
  iana:prev :proxyPageArt1 .
# proxies for issue ordering in a Journal resource skipped as follows above pattern

However, I'd like to see the following questions answered with some kind of consensus for documenting, possibly making restrictions in the ontology, etc. before moving forward:

  • Implementation decision makers for either of the two application stacks interested in this discussion talking about best practices, i.e.:
    • How are we defining FileSet as a community?
    • When do we recommend using FileSet versus not? (again, best practices I'd hope community sourced/supported/documented, but possibly not in the ontology itself)
      • Also, for me, when we would see using both in the same instance data / application.
      • Included in this, also, membership expectations.
    • Updating profiling of FileSet / File to understand what gets captured / lost when using FileSet or not (or during conversion, same caveat as the above point).
  • Clearly documented examples showcasing how the ontology doc would need to be updated in future iterations to support the above decisions (as/if needed, such as adding property chains), as well as expected usage, data conversion or sharing inputs and outputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment