The best way to get data out of DSpace is use to use a protocol called OAI-PMH.
It's a rather old protocol and slightly awkward to work with, but we provide a command-line tool that wraps most of the odd bits so you can focus on just asking for the data you want.
MIT Libraries maintains a command-line tool that we use internally to harvest records from various sources. The tool is available publicly so you can use it too.
https://github.com/MITLibraries/oai-pmh-harvester
You can use a python version installer, such as asdf
(or any other solution you prefer to get a specific version of python installed) or you can build and use the docker container we provide. If you aren't familiar with docker, it may be easier to work with python natively. We generally use the native python for development and use the docker version when we deploy to AWS for automated daily harvests.
Once you are able to get make install
, make test
, and pipenv run oai --help
to succeed without errors you are ready to harvest from OAI-PMH sources (including DSpace@MIT).
We won't go into all the details of OAI-PMH here, but one important feature to understand is that OAI-PMH was designed to allow performing regular update harvests from a source. For example, if you were interested in a set of records that was in the millions and wanted to make sure you had the latest version every day that would mean pulling in a lot of unchanged records.
OAI-PMH instead lets you ask for changes since the last time you harvested by providing it with a start date. So once you've harvested those million records once... you should never need to grab the full set again, and instead can just remember when the last time you harvested was and ask for any records that have changed since then. So while the protocol is a bit awkward in some ways, it allows for very efficient harvesting of records over time.
Harvest all theses records that have been added or modified in June 2022. The thesis collection set is com_1721.1_7582
. If you are interested in a subset of theses, such as for a specific department only, you can identify the set you are interested in most easily by browsing to it in the user interface of dspace@mit and then modifying the URL into a set format. For example, https://dspace.mit.edu/handle/1721.1/7582
is the main thesis collection in the user interface and the corresponding OAI-PMH set is com_1721.1_7582
. You can also see a full list of sets via OAI-PMH itself at: https://dspace.mit.edu//oai/request?verb=ListSets
pipenv run oai -h https://dspace.mit.edu/oai/request -o dspace_theses_2022_06.xml harvest -f 2022-06-01 -u 2022-06-30 -m xoai -s com_1721.1_7582
-h
is the public oai endpoint for the dspace@mit server
-o
is a local filename to store the extracted records
-f
is the from
date or start date of the harvest
-u
is the until
or end date of the harvest
-m
is the metadata format. xoai
is a good value to get the links to binary files, but other options are available (see: https://dspace.mit.edu//oai/request?verb=ListMetadataFormats
)
-s
is the set
, in this case the entire thesis collection in dspace@mit
Note: if a record in that harvest is modified after you have harvested it, it will show up as part of the updated records with the date it was updated. This is important to understand as the command above will not result in all of the theses published in June 2022 even if it feels like it should. Instead, think of it as asking for all theses that were added or updated in June 2022, but not updated since. For some use cases this is frustrating, but if your goal is to always have the most recent records it is essential and that is the core use case for OAI-PMH.
Using our harvester tool and adjusting start and end dates and using the set you are interested in, you should be able to extract the metadata for the records you are intested in. However, often it is desireable to also have the binary files that are described by the metadata. There is no bulk access for those files, but you can extract the direct http link for the files from the metadata to download them.
So the general approach could be:
- identify the set of interest
- harvest metadata for that set
- loop over the resultant metadata and extract the location of the pdf (using xpath queries or other techniques)
- request the binary files (slowly! We highly recommend adding a sleep statement as our dspace@mit server is hosted by a third party that may block your IP if you request too many files at once. If that happens, let us know and we'll do our best to work with the vendor).
graph LR
id[(identify collection)] --> harvest(harvest metadata) --> f{filter} --> retrieve(retrieve files)
subgraph loop very slowly
f
retrieve
end