Skip to content

Instantly share code, notes, and snippets.

@ruthtillman
Last active October 7, 2015 18:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ruthtillman/5303ceb038444848ef29 to your computer and use it in GitHub Desktop.
Save ruthtillman/5303ceb038444848ef29 to your computer and use it in GitHub Desktop.
Notes from the DC Fedora Users Group, 2015-10-07

Fedora 4 DC User Group Meeting 2015-10-07

Andrew Woods presentation.

Image

Rest Framework ---

} Access & Preservation Services

Fedora Services ---

ModeShape --- Repository Services. JCR implementation.

Infinispan ---

} Caching, Clustering, & Storage Services

Storage ---

(Objects & Datastreams)

Presentation

Attempting to make the Fedora part at the bottom of the stack as minimal as possible. Translates into RDF and other standards that relate to that. RESTful services to expose that.

He brings up ModeShape's responsiveness. Notes that one can vote on their tickets and there may be requests at times. Modeshape provides the JCR specification. Their project implements the JCR specification. They do other things too, but Fedora interacts solely with the JCR spec in order to avoid getting messy. They also have their own back-ends. Seems to work with Infinispan.

He believes Infinispan is a "hidden gem" for clustering and scaleability. This is the part that touches the disk. Below that is the actual disk storage.

The goal is to implement standards that extend far beyond our community. They've defined the services that people wanted from Fedora 4, the goal then is to align implementation with those standards. They don't want to make people have to build applications against something super customized. They want people to be able to use existing libraries with larger support communities.

Core Features and Standards

CRUD - Linked Data Platform (LDP)

Obviously this is your big one.

Versioning - Memento?

Being considered by W3C as possible standard. It's good for accessing versions of objects. There's more we might need from it (missed this exactly while looking it up, creating versions perhaps?), but ideally we can participate in this community and make this kind of thing happen.

Authorization - WebAC

Fedora expects the requests it's getting are pre-authenticated.

Transactions - ??

Arguments that transactions shouldn't be RESTful. So they have a model for implementing transactions and it's kind of what you'd expect. Nothing gets committed until you tell it to.

Fixity - http://tools.ietf.org/html/rfc3230#section-4.3.2

There are two flavors. First flavor - when you ingest you have an opportunity of providing a checksum with your request and you can get an exception if it doesn't match. Second flavor - Fixity on demand. Make the system read, calculate checksum, and compare to stored value. This isn't 100% implemented yet?

Andrew Woods: Hands-On Portion

Notes may or may not be sparse because it's gonna be tricky.

CRUD Work

On our side, we've got Fedora 4 with LDP, WebAC, and ??Memento??. Every time you do something, a little envelope, as it were, goes out in the world. What's happening on our virtual installations, is that Apache Camel is listening for those messages. They send things into Solr, into Triplestore (Fuskei on the VM)... there's a lot more we could be doing with it, that's just what's on the machine right now. Camel is your oyster.

Question: any particular triplestore becoming popular?

Andrew: slight push toward Blazegraph. [convo] We've certainly plugged it in here and it populates.

Creating a resource

First we have to create a "cover" container.

(Note on the demo, if you see "Toggle Actions" it's a responsive design aspect to handle smaller screens, toggles left sidebar.)

Two options, to create container (think F3 object) or a binary (think F3 datastream).

Create New Child Resource. Name it cover. Recommendation long-term is that you don't assign semantics to your resources. The system can create identifiers for each object. These will also be related to your overall URI.

Question about PIDs in this context.

Answer: Recommendation to store PID as property and use a lookup table, probably cached in some way. Possibly as dc:identifier, possibly in a custom Fedora 3 namespace which refers to the fact that it's legacy Fedora 3 data.

Make resource a pcdm:Object

SPARQL-update:

PREFIX pcdm: <http://pcdm.org/models#>

INSERT {
	<http://localhost:8080/fcrepo/rest/cover> rdf:type pcdm:Object.
}
WHERE {}

Or more concisely (ok, I was making the top one concise like this but I'll expand and show the concise one too)

PREFIX pcdm: <http://pcdm.org/models#>

INSERT {
	<> a pcdm:Object.
}
WHERE {}

Put it into the update command.

Question: Is it possible to update the sidebar prefixes? Add our own?

Answer: (Ruth's thought...I can't actually make it go away when I add one wrong? Unless that was changed in a newer version.) Andrew: Initial concerns about messing with the configuration file, but you could add prefixes there. Shouldn't be too much of an issue but may have to replace after upgrade. However also people have noticed (ok yeah this is me), if you put in what's basically N-Triples, it'll create its own prefixes for that. They may be prefixes you don't want, though.

Question: What would we see if we queried ModeShape directly?

Answer: You're right, Modeshape wouldn't be seeing it as RDF but if you do the Export, you'll see the JCR/XML to see how it looks in ModeShape. You'll see less than the Fedora layer because we've been adding things into the mix.

Note: If you create and delete a resource, you'll get a Tombstone and you won't be able to recreate it in the HTML interface. You can get rid of the Tombstone with cURL, but it's more of a pain. The information is stored in Fedora just like the resource itself was.

Break

David Wilcox: Migrating from Fedora 3 to Fedora 4

Differences between Fedora 3 and Fedora 4

F3: Content Model Architecture. FOXML. Objects (collection of bytestreams & properties), Datastreams(bytestreams in context of object with some properties)

F4: LDF. You have RDF resources (objects & containers), and then you have non-RDF (former datastreams, but slightly more complex now), but then a binary and a description (a non-RDF description?) which has the properties. Your resources are not fundemantally composed of XML.

Therefore part of the migration process is mapping FOXML to these resources.

You'll have to make decisions whether to move existing XML to XML bitstreams or to RDF depending on what's supported.

You can then make decisions how you're going to organize your objects. PCDM introduced. Three types of resources: Collections, Objects, and Files. This is only going to briefly touch on it. Check out the GitHub wiki. Ideally Hydra/Islandora will be able to understand each other better using this. And maybe we can understand external resources too. This is not a Fedora standard or a Prescribed standard. Many people agreeing that it makes sense and they'd like to implement.

Andrew brings up: People talking about putting a Hydra or Islandora repo on top of Fedora 4 that was already running the other one and seeing how easily it'd be able to pick up on those F4 resources. Very basic interoperability. Still in planning and experimental phases.

Organizational differences: Fedora 3 is Flat--everything is top level. Trees are built using RELS, pretty much. Fedora 4 has lots of hierarchy. Containers and binaries are in a hierarchy and all resources descend from a root resource. But, just like Fedora 3, you don't need to show your users the hierarchical location.

File system: Newer versions of Fedora 3 use PairTree and Akubra system. Older has a legacy file system. Fedora 4 uses Infinispan and other MODEism. Containers are stored by default in a database, default LevelDB. The bitstreams, however, are stored in a PairTree directory. You can try to store the containers in other DBs but nothing tested too much yet--if someone wants to try testing and performance to suggest another database, that'd be great. You can also try storing the containers in PairTree...

Question: What about the feature of being able to support connecting to another filesystem?

Answer: Modeshape & Infinispan both allow you to connect to another filesystem with federated connectors. You can READ in another filesystem and act like it's been ingested into Fedora. So if you want, for example, to use a bunch of videos that you don't want to move or use research data that you don't want to move, you can have Fedora read and understand those files. Whatever the hierarchy is in your filesystem, that's how it'll be structured in Fedora. Look for "Federation" in the wiki which describes how to set up the connection and then you create the objects or containers in Fedora proper and link between the two. Also it could become Read/Write but community & need & action.

Identification of Repository Resource: Fedora 3 you have the PID. Can never be altered. Fedora 4 has a Path. You may use an ID minter to change the default. You can still have the PIDs and you can store them as an RDF. Just won't be internally meaninful.

Data Mapping

One of the first things you'll want to do is the up-front data mapping. Assessing what you have.

You may want to use something like PREMIS hasDateCreatedByApplication to record the date the object was created in Fedora 3. There will also be a variety of things like how to maintain the record of who first created the object vs. who brought it into the Fedora 4 repository.

fcrepo3 fcrepo4 Example
PID fedora3model:PID+ yul:328697
state fedoraaccess:objState Active
label fedora3model:label+ Elvis Presley
createDate premis:hasDateCreatedByApplication 2015-03-16T20:11:06.683Z
lastModifiedDate metadataModification 2015-03-16T20:11:06.683Z
ownerId fedora3model:ownerId+ nruest

And

fcrepo3 fcrepo4 Example
DSID dcterms:identifier OBJ
Label dcterms:title* ASC19109.tif
MIME Type ebucore:hasMimeType+ image/tiff
State fedoraaccess:objState Active
Created premis:hasDateCreatedByApplication 2015-03-16T20:11:06.683Z
Versionable fedora:hasVersions* true
Format URI premis:formatDesignation* info:pronom/fmt/156
Alternate IDs dcterms:identifier*
Access URL dcterms:identifier*
Checksum cryptofunc:hashalgorithm* cryptofunc:sha1"c91342b705b15cb4f6ac5362cc6a47d942"

+ = Fedora 3-based? * = Uncertain

There's a Hydra-based Fedora 3 -> Fedora 4 interest group. Something to keep an eye on. More alignment as we realize our problems are largely the same.

Purposes of Data Migration: Access vs. Preservation. What are your repository's goals? How much legacy data do you need to preserve or is it less important? You won't be entirely in one camp or the other most likely, but figure out how much of each is you concern.

What kinds of loss is tolerable? Is transformation of the metadata serialization tolerable?

What about access? This group has a pretty small amount of restrictions. That'll make migration.

Service migration: What kinds of services are you using? Not using Disseminators is going to make our lives easier.

What content is being used by external services and what expectations are out there for continued and consistent access? It's hard to make overall statements because we all have different issues. But these are all things we're going to have to think through.

Using migration-utils: a starting point

There are 2 utilities. There's migration-utils. Hydra has a gem fedora-migrate.

migration-utils is open source, java-based, command line, works 100% from the FOXML so you don't have to worry about front-end stuff.

migration-utils has a /src with test data. /conf has configuration files.

Experimenting in /conf Akubra file.

In line 80, we get the Fedora 4 URL. Tries to dynamically find port, etc., but we might as well hardcode it. There are other things we can change like the Ingest Limit. Ensure it's pulling from the right folder, right now we're pulling from test folder.

Question: When pulling in all your objects, how to get a proper hierarchy from these flat objects?

Answer: Default ID minter will do some kind of hierarchical sorting? It's also possible that ModeShape change will help. 3k/4k resources under a single parent? Does that include root? Default path minter Fedora does takes approach of using 2 characters in hex and doing that 4 levels deep. And then it sorts out to about 4 billion resources. Useful reason to use the internal minter. Assuming random distribution.

Question: You can't do this migration utility against a currently-running instance of Fedora, right?

Answer: No, you'll need a copy of your file system. But if you export all your Fedora 3 objects or run it against a copy of the Fedora 3 disk. You can't run it against a live thing.

It'll copy over all local binaries. But your external references it won't copy in.

Question: Containers are still in LevelDB but they're also in ModeShape?

Answer: No, Modeshape is a level above LevelDB. ModeShape is pulling from LevelDB.

Question: How big?

Answer: Datastreams up to a TB in size in testing. Not a lot of people with it deployed right now. Tested on quite a lot of objects.

Question: Something about backups?

Answer: There is a restore function but just from the backup. Right now your files land on disk. Your containers/objects end up in the LevelDB.

Question: What does a container look like in LevelDB?

Answer: Look for the backup/restore section on the wiki page which should actually show you what things look like in the database. Not pretty. Lots of Java serialization. Java objects serialized to disk. Not very usable w/o the architecture.

break

Continuation of Andrew Woods Hands-On Portion

Transactions. The most time gets taken up when you touch a disk. Multiple actions can be bundled together into a single repository event/transaction. Transactions can be rolled back or they can be committed and actually touch disk.

From header, select Transactions, Start Transaction.

All your resources are now displayed as suffixed(parents)/prefixed)children) with a transaction ID. i.e. http://localhost:8080/fcrepo/rest/tx:d521a323-067f-430e-85d0-122a5ada267a/

As long as you navigate around within the transaction, it's working in the transaction. You can choose to add children, update properties, do whatever you need to do. Then you work in it. No messages are emitted. No clients, etc. can see it? But ...hmm, questions about transactions through other methods of interacting?

Question: Where are the temporary changes stored?

Answer: Unsure offhand. In memory.

Question: Are the objects locked during transactions?

Answer: No. What's supported is the HTTP request e-tag to make sure you're sending the commit to the same object you started editing. Last person in wins.

Authorizations/WebAC

Created container Files under container Cover.

Created container my-acls / acl / authorization.

cover must point to its ACL

And ACL must have one or more authorizations. Authorizations define: agent(s), mode(s), resource(s) can be specific URIs to resources or class(es?) or RDF Types.

Using semantic names for these because of presentation but not necessary.

Authorization in Fedora (4?) works like:

  • When request comes in, all the attributes (user, etc.) get passed down...
  • And gets handled at the ModeShape level.
  • When the request hits ModeShape, ModeShape kicks off to the authorization implementation and passes that attribute information off to it.
  • The authorization implementation uses whatever its decision-making factors are and then sends back True/False to ModeShape.

You'll get whatever the most permissions response is for your user type.

ACL gets inherited. If an object doesn't have one, it'll look to the parents' ACL. If you're using the WebAC and you have no ACLs, you'll get denied.

So, in /my-acls/acl/authorization:

PREFIX acl: <http://www.w3.org/ns/auth/acl#>
PREFIX pcdm: <http://pcdm.org/models#>
INSERT {
<> a acl:Authorization ;
acl:accessToClass pcdm:Object ;
acl:mode acl:Read, acl:Write;
acl:agent "adminuser" .
} WHERE { }

(adminuser has password2 on the test system we're using, testuser password1. You can use these to test what you can see after doing these.)

Uncertain if acl:Write includes acl:Read, probably does. Note for us to look up when we're setting it for myself.

You can do accessToClass or just to accessTo the specific resource URI(s) from Fedora.

Next, need to link the protected resource to its ACL.

Make this update on the Cover resource:

PREFIX acl: <http://www.w3.org/ns/auth/acl#>
INSERT {
<> acl:accessControl </fcrepo/rest/my-acls/acl>
} WHERE { }

Versioning

The HTML UI is just a subset of the repository functionality. Everything else you can do in the REST-API. You can do so much more in the REST-API directly. Versioning exposed through the HTML UI in previous, heh, versions, but it became overwhelming.

curl -ufedoraAdmin:secret3 -i -XPOST -H "slug: v0" localhost:8080/fcrepo/rest/cover/fcr:versions

Now let's try adding a:

INSERT {
<> dc:publisher "The Press"
}
WHERE { }

And then run: curl -ufedoraAdmin:secret3 -i -XPOST -H "slug: v1" localhost:8080/fcrepo/rest/cover/fcr:versions

And now the second version, v.1, has a different version.

Question: Is there any way to version along a hierarchy?

Answer: Ruth tested. It appears that versioning affects all children too. Worth looking into more, for sure, before depending on it. But it appears the answer is yes, versioning a parent versions the children and...importantly...reverting a parent reverts a child!

One, you definitely can't just set and forget versioning. Think of it more like snapshotting or release versioning.

Fixity

Fixity is run on binaries/datastream. There isn't yet fixity on the containers themselves.

Create the binary & upload the file. Then you can see it as, for example, http://localhost:8080/fcrepo/rest/colloquia/1313/cover.jpg/fcr:fixity

Properties
premis: hasFixity
	http://localhost:8080/fcrepo/rest/colloquia/1313/cover.jpg#fixity/1444240554538
Fixity Properties 
	http://localhost:8080/fcrepo/rest/colloquia/1313/cover.jpg#fixity/1444240554538
fedora: status
	SUCCESS
premis: hasContentLocation
	/colloquia/1313/cover.jpg/jcr:content/jcr:data
premis: hasMessageDigest
	urn:sha1:187f6de43474f4f9b34561848f7e98ee5f36e8b5
premis: hasSize
	304305
rdf: type
	http://www.loc.gov/premis/rdf/v1#Fixity

If you wanted to, you could SSH out from vagrant to ubuntu and then to /var/lib/tomcat7/fcrepo4-data, find it, and rename it to something else which would then cause it a problem.

Non-Core Features

There are going to be other use cases and things that aren't Core, and they're working on those so that everyone doesn't have to do their own home work. J. Westgard brings up Camel API extension option. In theory you could do all kinds of triggers and responses through Camel.

Open Service Gateway Initiative. You can do hot deployment, automatic reloading of configuraton, sophisticated dependency reoslution, XML scripting for complex deployments.

Hot deployment: You can do stuff at runtime, i.e. you don't have to restart your server to update code/config!

Hands-On: Into the Vagrant

> vagrant ssh

or: > ssh -p 2222 vagrant@localhost password = vagrant

Type straight up (don't cd) /opt/karaf/bin/client

This gets you straight into the OSGi

Within the client

feature:list | grep fcrepo

Takes a second to display, but then you can see all your features (6 for this installation) and they're all started (right?)

The indexing components in camel...you can just wipe out and reindex.

You could get logs by doing:

sudo tail -f /opt/karaf/data/log/karaf.log

Into Fuseki!

Go to: http://localhost:8080/fuseki/

Try: select * where { http://localhost:8080/fcrepo/rest/cover ?p ?o }

Try: PREFIX ldp: http://www.w3.org/ns/ldp# PREFIX ebucore: http://www.ebu.ch/metadata/ontologies/ebucore/ebucore# select * where { ?s ldp:contains ?o . ?o ebucore:hasMimeType ?m }

Try:

prefix premis: <http://www.loc.gov/premis/rdf/v1#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select ?s ?d where {
?s ?p <http://fedora.info/definitions/v4/audit#InternalEvent> .
?s premis:hasEventRelatedObject <http://localhost:8080/fcrepo/rest/cover> .
?s premis:hasEventDateTime ?d .
FILTER (?d > "2015-10-06T04:21:14Z"^^xsd:dateTime)
}

These events use example.com so your back-end events aren't necessarily exposed. Was messing with something else and missed part of discussion. You can change it in your settings though.

Into Solr

Go to http://localhost:8080/solr

Select Collection in sidebar and click Query. You can just search for :

Response I got searching for :

 "responseHeader": {
    "status": 0,
    "QTime": 34,
    "params": {
      "q": "*:*",
      "indent": "true",
      "wt": "json",
      "_": "1444243072993"
    }
  },
  "response": {
    "numFound": 12,
    "start": 0,
    "docs": [
      {
        "created": "1444237869001",
        "has_parent": "http://localhost:8080/fcrepo/rest/",
        "id": "http://localhost:8080/fcrepo/rest/authors",
        "title": [
          "Authorities"
        ],
        "last_modified": "1444237869001",
        "_version_": 1514393169089790000
      },
      {
        "created": "1444229986532",
        "has_parent": "http://localhost:8080/fcrepo/rest/colloquia",
        "id": "http://localhost:8080/fcrepo/rest/colloquia/3c/94/1a/cb/3c941acb-188b-4601-b257-1a9d7e6ee612",
        "last_modified": "1444230001159",
        "_version_": 1514384918952542200
      },
      {
        "created": "1444238212955",
        "has_parent": "http://localhost:8080/fcrepo/rest/my-alcs/acl",
        "id": "http://localhost:8080/fcrepo/rest/my-alcs/acl/authorization",
        "last_modified": "1444238837082",
        "_version_": 1514394184287518700
      },
      {
        "created": "1444224543732",
        "has_parent": "http://localhost:8080/fcrepo/rest/colloquia",
        "id": "http://localhost:8080/fcrepo/rest/colloquia/1471",
        "title": [
          "Magnetic Mars"
        ],
        "last_modified": "1444224543732",
        "_version_": 1514379199308103700
      },
      {
        "created": "1444238206422",
        "has_parent": "http://localhost:8080/fcrepo/rest/my-alcs",
        "id": "http://localhost:8080/fcrepo/rest/my-alcs/acl",
        "last_modified": "1444239112404",
        "_version_": 1514394473371533300
      },
      {
        "created": "1444238141515",
        "has_parent": "http://localhost:8080/fcrepo/rest/cover",
        "id": "http://localhost:8080/fcrepo/rest/cover/files",
        "last_modified": "1444238141515",
        "_version_": 1514393455561801700
      },
      {
        "created": "1444238202283",
        "has_parent": "http://localhost:8080/fcrepo/rest/",
        "id": "http://localhost:8080/fcrepo/rest/my-alcs",
        "last_modified": "1444238206422",
        "_version_": 1514393522728337400
      },
      {
        "created": "1444228319388",
        "has_parent": "http://localhost:8080/fcrepo/rest/",
        "id": "http://localhost:8080/fcrepo/rest/cover",
        "title": [
          "C"
        ],
        "last_modified": "1444239759544",
        "_version_": 1514395169068155000
      },
      {
        "created": "1444224357217",
        "has_parent": "http://localhost:8080/fcrepo/rest/",
        "id": "http://localhost:8080/fcrepo/rest/colloquia",
        "last_modified": "1444239919748",
        "_version_": 1514395366367166500
      },
      {
        "created": "1444240005968",
        "has_parent": "http://localhost:8080/fcrepo/rest/colloquia/1313",
        "id": "http://localhost:8080/fcrepo/rest/colloquia/1313/215",
        "title": [
          "Test 1"
        ],
        "last_modified": "1444240021234",
        "_version_": 1514395662923333600
      }
    ]
  }
}

Reindexing

In SSH

> sudo service tomcat7 stop
> sudo rm -rf /etc/fuseki/databases/test_data/*
> sudo service tomcat7 start

So that wipes out the Fuseki.

curl -XPOST localhost:9080/reindexing/cover -H"Content-Type: application/json" -d '["activemq:queue:triplestore.reindex"]'

Huh, that really is the right port. So it's reindexing starting on cover and puts it in the listening reindexing triplestore.

Fixity:

But now we're rushing out of the room...

> curl -XPOST localhost:9080/reindexing/cover -H"Content-Type: application/json" -d '["activemq:queue:fixity"]'
> less /tmp/fixityErrors.log

It looks for any binaries and does fixities. On success, do nothing. On failure, write it to /tmp/fixityErrors.log Current feature?? is that everything writes there right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment