Skip to content

Instantly share code, notes, and snippets.

@kmcelwee
Last active March 12, 2021 18:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kmcelwee/ad4510fb15d3a349e6f193903fbb0562 to your computer and use it in GitHub Desktop.
Save kmcelwee/ad4510fb15d3a349e6f193903fbb0562 to your computer and use it in GitHub Desktop.

PostgreSQL & other queries in Dataspace

To enter the Postgres command line, you need to be the dspace user (sudo su - dspace). The command is psql. Here is a link to the database diagram for DSpace 5.

It sometimes may be quicker to use the REST API than creating a complicated query. And the JRuby DSpace wrapper (documentation) may be simpler as well.

Useful commands:

  • \dt: describe all tables
  • \d {TABLE}: describe the given table
  • \copy ({query}) to '{filename}' as CSV HEADER: saves the query to a CSV with a header
    • The query needs to be one line. It may be easier to save from less (just press s in less and it'll prompt you to create a log file). This won't be a CSV, but you can use '|' as a delimiter, and it'll get you close.

Get large bitstreams

Get bitstreams over 1GB.

SELECT ib.item_id, bb.bitstream_id, size_bytes
FROM item2bundle ib
INNER JOIN bundle2bitstream bb ON ib.bundle_id = bb.bundle_id
INNER JOIN bitstream b ON bb.bitstream_id = b.bitstream_id
WHERE size_bytes > 1000000000 AND deleted = 'f'

Connect large bitstreams with their download stats

dspace-development#147

Download stats and view stats are collected in Solr. Atmire created command-line tools to help collect this information.

Given some resource_id:

# total downloads
solr-count statistics "type:0 AND owningItem:$id"

# downloads by ip
solr-facet statistics "type:0 AND owningItem:$id" ip

# total page views
solr-count statistics "type:2 AND id:$id"

# page views by ip
solr-facet statistics "type:2 AND id:$id" ip

When we wanted to access this data across many IDs this is how it was implemented:

# Get all items with large bitstreams (from above)
psql -c "COPY(
SELECT DISTINCT h.handle, ib.item_id
FROM handle h
INNER JOIN item2bundle ib ON h.resource_id = ib.item_id
INNER JOIN bundle2bitstream bb ON ib.bundle_id = bb.bundle_id
INNER JOIN bitstream b ON bb.bitstream_id = b.bitstream_id
WHERE size_bytes > 1000000000 AND deleted = 'f')
to STDOUT with CSV;" > item_ids.csv

# select the item_id and extract the solr stats for that item and
#   place it into a new csv
awk -F "\"*,\"*" '{print $2}' item_ids.csv | 
    while read id; 
         do solr-count statistics "type:0 AND owningItem:$id"; 
    done > item_stats.csv

# combine the two one-column CSVs into a new CSV
paste -d, item_ids.csv item_stats.csv > item_ids_and_stats.csv

Get empty filenames

dspace-development#284, dspace-development#289

Collect the collection, community, and handle data for all unnamed bitstreams. This is stored in DSpace as the dc.title metadata on Bitstreams, which is stored in the metadata_field_id as 64 though this could certainly change. This does not check if dc.title is NULL or an empty string.

SELECT (
    SELECT text_value as community_name
    FROM metadatavalue
    WHERE resource_type_id = 4 AND resource_id = community_id AND metadata_field_id = 64
),(
    SELECT text_value as collection_name
    FROM metadatavalue 
    WHERE resource_type_id = 3 AND resource_id = community2collection.collection_id AND metadata_field_id = 64
),handle,bitstream_id
FROM handle 
INNER JOIN item ON (item.item_id = handle.resource_id)
INNER JOIN item2bundle ON (item.item_id = item2bundle.item_id)
INNER JOIN bundle2bitstream ON (item2bundle.bundle_id = bundle2bitstream.bundle_id)
INNER JOIN collection2item on collection2item.item_id = item.item_id
INNER JOIN community2collection on community2collection.collection_id = collection2item.collection_id
WHERE bitstream_id IN (
    SELECT resource_id
    FROM metadatavalue WHERE resource_type_id = 0 
    GROUP BY resource_id 
    HAVING NOT (64 = ANY(ARRAY_AGG(metadata_field_id)))
);

Get all metadata field registries

\copy (select * from metadatafieldregistry where metadata_schema_id=1) to 'dc-metadata.csv' with csv header;
\copy (select * from metadatafieldregistry where metadata_schema_id=2) to 'pu-metadata.csv' with csv header;
\copy (select * from metadatafieldregistry where metadata_schema_id=3) to 'dcterms-metadata.csv' with csv header;
\copy (select * from metadatafieldregistry where metadata_schema_id=4) to 'eperson-metadata.csv' with csv header;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment