Skip to content

Instantly share code, notes, and snippets.

@otuoma
Created November 27, 2018 10:04
Show Gist options
  • Save otuoma/b8ef282a46226550bf0cb1c377ab9277 to your computer and use it in GitHub Desktop.
Save otuoma/b8ef282a46226550bf0cb1c377ab9277 to your computer and use it in GitHub Desktop.
Scheduled Tasks in DSpace using Cronjobs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scheduled Tasks in DSpace using Cronjobs\n",
"\n",
"A number of tasks in DSpace need to be carried out automatically and repeatedly. This can be achieved using the **crontab** program in Ubuntu.\n",
"\n",
"**Crontab** is a system daemon (program that runs in the background) used to execute defined tasks at defined times. It comes preinstalled in most linux distributions.\n",
"\n",
"Each user on the system has a crontab file. The files contain instructions for which tasks to be executed at specified times. The cron daemon will execute these files regardless of whether the owner is logged-in to the system at that particular time or not.\n",
"\n",
"Time in a cron file is defined in 5 corresponding columns as shown below:\n",
"\n",
"1. Minute (0-59)\n",
"2. Hour (0-23)\n",
"3. Day of month (1-31)\n",
"4. Month (1-12)\n",
"5. Day of week (0-6 where Sunday = 0 and Saturday = 6)\n",
"\n",
"A crontab file can be opened using the crontab command as shown below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crontab -e"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This opens the crontab file for the currently logged-in user. To open the crontab file for a different user, use the command below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sudo crontab -e -u username"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If asked the editor to use, select _nano_, it is the easiest to use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Cronjob scheduling Guide:\n",
"\n",
".---------------- minute (0 - 59)\n",
"| .------------- hour (0 - 23)\n",
"| | .---------- day of month (1 - 31)\n",
"| | | .------- month (1 - 12)\n",
"| | | | .---- day of week (0 - 6) (Sunday=0 or 7)\n",
"| | | | |\n",
"* * * * * /command/to/be/executed\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Commands that normally run with sudo privileges should be added by using the sudo command i.e:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sudo crontab -e"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To know if a command requires sudo privileges, run it in the shell without preceding with sudo, if you get an error related to permissions or insufficient privileges then most likely you need to precede it with sudo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scheduled tasks in DSpace\n",
"The following sample crontab file, sourced from [wiki.lib.sun.ac.za](http://wiki.lib.sun.ac.za/index.php?title=SUNScholar/Daily_Admin/5.X), contains necessary cronjobs in DSpace with their explanations and prefered time of execution."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## SAMPLE CRONTAB FOR A PRODUCTION DSPACE\n",
"## You obviously may wish to tweak this for your own installation,\n",
"## but this should give you an idea of what you likely wish to schedule via cron.\n",
"##\n",
"## NOTE: You may also need to add additional sysadmin related tasks to your crontab\n",
"## (e.g. zipping up old log files, or even removing old logs, etc).\n",
" \n",
"####################\n",
"# GLOBAL VARIABLES #\n",
"####################\n",
"# Deliver cron email to the system administrator\n",
"MAILTO=\"root\"\n",
"\n",
"# Set the dspace installation directory\n",
"DSPACE_DIR=\"put here your dspace folder\"\n",
"\n",
"################\n",
"# HOURLY TASKS #\n",
"################\n",
"# (Recommended to be run multiple times per day, if possible)\n",
"# At a minimum these tasks should be run daily.\n",
"\n",
"# Regenerate DSpace Sitemaps every 8 hours (12AM, 8AM, 4PM).\n",
"# SiteMaps ensure that your content is more findable in Google, Google Scholar, and other major search engines.\n",
"0 0,8,16 * * * $DSPACE_DIR/bin/dspace generate-sitemaps > /dev/null\n",
" \n",
"###############\n",
"# DAILY TASKS #\n",
"###############\n",
"# (Recommended to be run once per day. Feel free to tweak the scheduled times below.)\n",
" \n",
"# Update the OAI-PMH index with the newest content (and re-optimize that index) at midnight every day\n",
"# NOTE: ONLY NECESSARY IF YOU ARE RUNNING OAI-PMH\n",
"# (This ensures new content is available via OAI-PMH and ensures the OAI-PMH index is optimized for better performance)\n",
"0 0 * * * $DSPACE_DIR/bin/dspace oai import -o > /dev/null\n",
" \n",
"# Clean and Update the Discovery indexes at midnight every day\n",
"# (This ensures that any deleted documents are cleaned from the Discovery search/browse index)\n",
"0 0 * * * $DSPACE_DIR/bin/dspace index-discovery > /dev/null\n",
" \n",
"# Re-Optimize the Discovery indexes at 12:30 every day\n",
"# (This ensures that the Discovery Solr Index is re-optimized for better performance)\n",
"30 0 * * * $DSPACE_DIR/bin/dspace index-discovery -o > /dev/null\n",
" \n",
"# Cleanup Web Spiders from DSpace Statistics Solr Index at 01:00 every day\n",
"# NOTE: ONLY NECESSARY IF YOU ARE RUNNING SOLR STATISTICS\n",
"# (This removes any known web spiders from your usage statistics)\n",
"0 1 * * * $DSPACE_DIR/bin/dspace stats-util -i > /dev/null\n",
" \n",
"# Re-Optimize DSpace Statistics Solr Index at 01:30 every day\n",
"# NOTE: ONLY NECESSARY IF YOU ARE RUNNING SOLR STATISTICS \n",
"# (This ensures that the Statistics Solr Index is re-optimized for better performance)\n",
"30 1 * * * $DSPACE_DIR/bin/dspace stats-util -o > /dev/null\n",
" \n",
"# Send out subscription e-mails at 02:00 every day\n",
"# (This sends an email to any users who have \"subscribed\" to a Collection, notifying them of newly added content.)\n",
"0 2 * * * $DSPACE_DIR/bin/dspace sub-daily > /dev/null\n",
" \n",
"# Run the media filter at 03:00 every day.\n",
"# (This task ensures that thumbnails are generated for newly add images,\n",
"# and also ensures full text search is available for newly added PDF/Word/PPT/HTML documents)\n",
"0 3 * * * $DSPACE_DIR/bin/dspace filter-media -q > $DSPACE_DIR/log/media-filter.log 2>&1\n",
"\n",
"# Run any Curation Tasks queued from the Admin UI at 04:00 every day\n",
"# (Ensures that any curation task that an administrator \"queued\" from the Admin UI is executed\n",
"# asynchronously behind the scenes)\n",
"0 4 * * * $DSPACE_DIR/bin/dspace curate -q admin_ui > /dev/null\n",
"\n",
"# Check for items to release from embargo in DSpace.\n",
"#(This applies to embargoes created with DSpace versions <= 3.2) \n",
"0 5 * * * $DSPACE_DIR/bin/dspace embargo-lifter > $DSPACE_DIR/log/embargo-release.log 2>&1\n",
"\n",
"# Update the local ORCID database with the latest information from the external ORCID database.\n",
"#(This only applies to DSpace versions => 5.2, if you enable ORCID lookups)\n",
"0 6 * * * $DSPACE_DIR/bin/dspace dsrun org.dspace.authority.UpdateAuthorities > $DSPACE_DIR/log/update-orcid-info.log 2>&1\n",
"\n",
"################\n",
"# WEEKLY TASKS #\n",
"################\n",
"# (Recommended to be run once per week, but can be run more or less frequently, based on your local needs/policies)\n",
"\n",
"# Run the checksum checker at 04:00 every Sunday\n",
"# By default it runs through every file (-l) and also prunes old results (-p)\n",
"# (This re-verifies the checksums of all files stored in DSpace. If any files have been changed/corrupted, checksums will differ.)\n",
"#0 4 * * * $DSPACE_DIR/bin/dspace checker -l -p > /dev/null\n",
"#\n",
"# NOTE: LARGER SITES MAY WISH TO USE DIFFERENT OPTIONS. The above \"-l\" option tells DSpace to check *everything*.\n",
"# If your site is very large, you may need to only check a portion of your content per week. The below commented-out task\n",
"# would instead check all the content it can within *one hour*. The next week it would start again where it left off.\n",
"0 4 * * 0 $DSPACE_DIR/bin/dspace checker -d 1h -p > /dev/null\n",
" \n",
"# Mail the results of the checksum checker (see above) to the configured \"mail.admin\" at 05:00 every Sunday.\n",
"# (This ensures the system administrator is notified whether any checksums were found to be different.)\n",
"0 5 * * 0 $DSPACE_DIR/bin/dspace checker-emailer > /dev/null\n",
"\n",
"# Run DSpace statistical analysis tools (12months takes approx 40secs)\n",
"30 0 * * 0 $DSPACE_DIR/bin/dspace stat-general > /dev/null\n",
"35 0 * * 0 $DSPACE_DIR/bin/dspace stat-monthly > /dev/null\n",
"\n",
"# Generate DSpace statistical analysis reports\n",
"00 1 * * 0 $DSPACE_DIR/bin/dspace stat-report-general > /dev/null\n",
"05 1 * * 0 $DSPACE_DIR/bin/dspace stat-report-monthly > /dev/null\n",
" \n",
"#################\n",
"# MONTHLY TASKS #\n",
"#################\n",
"# (Recommended to be run once per month, but can be run more or less frequently, based on your local needs/policies)\n",
"\n",
"# Permanently delete any bitstreams flagged as \"deleted\" in DSpace, on the first of every month at 01:00\n",
"# (This ensures that any files which were deleted from DSpace are actually removed from your local filesystem.\n",
"# By default they are just marked as deleted, but are not removed from the filesystem.)\n",
"0 1 1 * * $DSPACE_DIR/bin/dspace cleanup > /dev/null\n",
"\n",
"# Remove all log files which are more than 30 days old\n",
"# on the first of every month\n",
"01 0 1 * * find $DSPACE_DIR/dspace/log/*.log.* -mtime +30 -exec rm {} \\;\n",
" \n",
"################\n",
"# YEARLY TASKS #\n",
"################\n",
"# (Recommended to be run once per year)\n",
"\n",
"# At 2:00AM every January 1, \"shard\" the DSpace Statistics Solr index.\n",
"# This ensures each year has its own Solr index, which improves performance.\n",
"# NOTE: ONLY NECESSARY IF YOU ARE RUNNING SOLR STATISTICS\n",
"# NOTE: This is scheduled here for 2:00AM so that it happens *after* the daily cleaning & re-optimization of this index.\n",
"0 2 1 1 * $DSPACE_DIR/bin/dspace stats-util -s > /dev/null\n",
"\n",
"################\n",
"# HOUSEKEEPING #\n",
"################\n",
"# (Recommended to be run daily)\n",
"\n",
"# Delete any ~/config/*/*.old files more than 30 days old (created by \"ant update\")\n",
"0 2 1 * * find $DSPACE_DIR/config -name \"*-*-*.old\" -mtime +30 -exec rm {} \\;\n",
"# Delete any ~/*.bak-*-*/ directories more than 30 days old (created by \"ant update\")\n",
"0 2 1 * * find $DSPACE_DIR/*.bak-*-* -maxdepth 0 -type d -mtime +30 -exec rm -rf {} \\;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To trouble shoot cronjobs, enable the logging feature in the file **_/etc/rsyslog.d/50-default.conf_** and uncomment (remove #) in front of cron.\\*. as shown :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Default rules for rsyslog.\n",
"#\n",
"# For more information see rsyslog.conf(5) and /etc/rsyslog.conf\n",
"\n",
"#\n",
"# First some standard log files. Log by facility.\n",
"#\n",
"auth,authpriv.* /var/log/auth.log\n",
"*.*;auth,authpriv.none -/var/log/syslog\n",
"cron.* /var/log/cron.log\n",
"#daemon.* -/var/log/daemon.log\n",
"kern.* -/var/log/kern.log\n",
"#lpr.* -/var/log/lpr.log\n",
"mail.* -/var/log/mail.log\n",
"#user.* "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Bash",
"language": "bash",
"name": "bash"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment