Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Scripts and processes description

Git

Git branch: https://github.com/atmire/DSpace/tree/w2p-64334_scripts-prototype

REST Contract: https://github.com/DSpace/Rest7Contract/pull/17

Documentation

Viewing the available scripts

GET /api/system/scripts

Current response:

{
  "_embedded": {
    "scripts": [
      {
        "id": "index-discovery",
        "name": "index-discovery",
        "description": "Update Discovery Solr Search Index",
        "type": "script",
        "parameters": [
          {
            "name": "-r",
            "description": "remove an Item, Collection or Community from index based on its handle",
            "type": "String"
          },
          {
            "name": "-i",
            "description": "add or update an Item, Collection or Community based on its handle or uuid",
            "type": "boolean"
          },
          {
            "name": "-c",
            "description": "clean existing index removing any documents that no longer exist in the db",
            "type": "boolean"
          },
          {
            "name": "-b",
            "description": "(re)build index, wiping out current one if it exists",
            "type": "boolean"
          },
          {
            "name": "-s",
            "description": "Rebuild the spellchecker, can be combined with -b and -f.",
            "type": "boolean"
          },
          {
            "name": "-f",
            "description": "if updating existing index, force each handle to be reindexed even if uptodate",
            "type": "boolean"
          },
          {
            "name": "-h",
            "description": "print this help message",
            "type": "boolean"
          }
        ],
        "_links": {
          "self": {
            "href": "http://localhost:8080/server/api/system/scripts/index-discovery"
          }
        }
      }
    ]
  },
  "_links": {
    "self": {
      "href": "http://localhost:8080/server/api/system/scripts"
    }
  },
  "page": {
    "size": 20,
    "totalElements": 1,
    "totalPages": 1,
    "number": 0
  }
}

Current features:

Starting a process

How to use the endpoint

Starting a process from REST requires a POST call to {dspace.restUrl}/api/system/scripts/{scriptname}/processes with form-data containing a key 'properties' that holds JSON which represents the parameters, this json looks as follows:

[{
"name" : "-i",
"value" : "123456789/9"
},
{
"name" : "-f",
"value" : "true"
}]

A curl example for this request could be:

curl -X POST \
  "http://{dspace.restUrl}/api/system/scripts/index-discovery/processes" \
  -H 'Authorization: TOKEN' \
  -H 'content-type: multipart/form-data' \
  -F 'properties=[{
"name" : "-i",
"value" : "b829ee28-3579-4273-a1f7-7857801be34c"
},
{
"name" : "-f",
"value" : "true"
}]'

Keep in mind that we need to be logged in as an admin to do this currently. Other permissions are to be defined later

Underlying logic

This call will be handled in the https://github.com/atmire/DSpace/blob/w2p-64334_scripts-prototype/dspace-server-webapp/src/main/java/org/dspace/app/rest/ScriptRestController.java#L47 startProcess method where the name will be retrieved from the URL. The ScriptRestRepository will start the process. This ScriptRestRepository#startProcess call will take as parameter the ScriptName from the URL and it'll read in the properties from the request and convert these into a list of ParameterValueRest objects so that we can easily use these within our code. These ParameterValueRest objects hold a name and value which will be filled in by the properties json.

This will result in a List of ParameterValueRest objects, but we do not want our scripts to have to deal with rest objects of the parameters, so we'll convert these into DSpaceCommandLineParameters so that our scripts have one uniform way of having to deal with parameters.

The RestDSpaceRunnableHandler will start with the currentUser as eperson, the scriptname and the list of DSpaceCommandLineParameter objects. The Script with the given name will be retrieved based on the config located in scripts-and-processes.xml. If it finds a suitable script, it'll be started

Where is this process saved?

When we created our RestDSpaceRunnableHandler object, we'll have created a Process object in the database as well. Now when we call the RestDSpaceRunnableHandler#schedule method, we're going to actually schedule this process to run and the status will be set to SCHEDULED. The domain class for this process object is located in org.dspace.content.Process as this is the Entity that will be persisted by hibernate. The DAO responsible is the ProcessDAO and the service responsible is the ProcessService.

If files are required for the process, bitstreams are created and linked to the process using the process2bitstream database table. These bitstreams are not part of a bundle and item, but part of a process. They can be compared to collection logos which are also not part of a bundle and item. The bitstreams also only receive permissions for the user who started the process

Process queue

This initial implementation doesn't use a process queue yet, but is implemented to easily be extended to use a process queue. The plan is to use the spring ThreadPoolTaskExecutor for queueing purposes.

Commandline vs REST call

There are differences between when we call the scripts through the REST api and through the commandline. This is why we created the DSpaceRunnableHandler implementations. The CommandLineDSpaceRunnableHandler will deal with the commandline execution where-as the RestDSpaceRunnableHandler will deal with the REST requests.

The processes started via REST will be executed as a thread from tomcat (no separate process on the server). The processes started via the command line will remain a separate process on the server.

The biggest difference between the two is that the RestDSpaceRunnableHandler will persist Database objects when we start a process where-as the CommandLineDSpaceRunnableHandler will not do this. So the process table and corresponding classes are only used for processes started from REST. The Commandline version of a script will behave the same as in DSpace 6. It can be executed directly from the command line and is not dependent on REST. This implies process queues would also only be used in REST.

If there would be reasons to start command line scripts from the server using the process table, queue, …, this would be possible by starting the process using a REST call. This will ensure the process is treated identical to other processes from the command line.

The goal is of course to replace the https://github.com/atmire/DSpace/blob/w2p-64334_scripts-prototype/dspace-api/src/main/java/org/dspace/discovery/IndexClient.java script with https://github.com/atmire/DSpace/blob/w2p-64334_scripts-prototype/dspace-api/src/main/java/org/dspace/scripts/impl/IndexClient.java. They won't both exist at the same time

Error handling

If the script fails, it will throw an exception. Handling this exception is part of the applicable DSpaceRunnableHandler.handleException implementation.

If this is a REST script, the process table will be updated to mark it as a failed process. The exception is of course logged as well.

If this is a command line script, the script will exit with a non-zero exit status (so it's clear this is an exception). The exception is of course logged as well.

If the entire java process is killed abruptly, the DSpaceRunnableHandler.handleException will most likely not be reached. A subsequent implementation will also ensure the process table can identify such errors (once the java process is restarted).

Converting the scripts to support REST

The actual changes between the current command line scripts and the updated version for REST compatibility boils down to:

  • Moving the command line options to a separate file
  • Migrating the main method to an internalRun method
  • Replacing System.exit with an exit code
  • Moving the setup of the context

The scripts which grant the most advantage to support are scripts which offer features required from the UI from launcher.xml:

  • metadata-import
  • metadata-export
  • import (SAF)
  • export (SAF)
  • curate
  • harvest

Other scripts relevant for a repository administration who should need SSH access are:

  • index-discovery
  • filter-media
  • oai (updating the index)
  • structure-builder
  • community-filiator
  • doi-organiser
  • dsprop
  • itemupdate
  • packager
  • registry-loader
  • initialize-entities

Scripts which can remain in their current state:

  • bitstore-migrate
  • healthcheck
  • checker
  • checker-emailer
  • classpath
  • cleanup
  • create-administrator
  • database
  • embargo-lifter
  • generate-sitemaps
  • index-authority
  • make-handle-config
  • migrate-embargo
  • rdfizer
  • read
  • solr-export-statistics
  • solr-import-statistics
  • solr-reindex-statistics
  • stat-general
  • stat-initial
  • stat-monthly
  • stat-report-general
  • stat-report-initial
  • stat-report-monthly
  • stats-log-converter
  • stats-log-importer
  • stats-util
  • sub-daily
  • test-email
  • update-handle-prefix
  • user
  • validate-date
  • version

Next steps

This functionality does already include a large part of the functionality, but there are more useful or even necessary features:

  • Converting more scripts to support REST
  • Handling the REST scripts in a queue using spring ThreadPoolTaskExecutor
  • Updating the status of the process when the tomcat process was killed abruptly
  • An Angular UI for executing the scripts

Discussion points for next steps

  • Should we use the long or short format for the parameters?
  • What is the added value of a curation task vs performing scripts directly from REST?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.