To enter the Postgres command line, you need to be the dspace
user (sudo su - dspace
). The command is psql
. Here is a link to the database diagram for DSpace 5.
It sometimes may be quicker to use the REST API than creating a complicated query. And the JRuby DSpace wrapper (documentation) may be simpler as well.
Useful commands:
\dt
: describe all tables\d {TABLE}
: describe the given table\copy ({query}) to '{filename}' as CSV HEADER
: saves the query to a CSV with a header- The query needs to be one line. It may be easier to save from
less
(just presss
inless
and it'll prompt you to create a log file). This won't be a CSV, but you can use '|' as a delimiter, and it'll get you close.
- The query needs to be one line. It may be easier to save from
Get bitstreams over 1GB.
SELECT ib.item_id, bb.bitstream_id, size_bytes
FROM item2bundle ib
INNER JOIN bundle2bitstream bb ON ib.bundle_id = bb.bundle_id
INNER JOIN bitstream b ON bb.bitstream_id = b.bitstream_id
WHERE size_bytes > 1000000000 AND deleted = 'f'
Download stats and view stats are collected in Solr. Atmire created command-line tools to help collect this information.
Given some resource_id
:
# total downloads
solr-count statistics "type:0 AND owningItem:$id"
# downloads by ip
solr-facet statistics "type:0 AND owningItem:$id" ip
# total page views
solr-count statistics "type:2 AND id:$id"
# page views by ip
solr-facet statistics "type:2 AND id:$id" ip
When we wanted to access this data across many IDs this is how it was implemented:
# Get all items with large bitstreams (from above)
psql -c "COPY(
SELECT DISTINCT h.handle, ib.item_id
FROM handle h
INNER JOIN item2bundle ib ON h.resource_id = ib.item_id
INNER JOIN bundle2bitstream bb ON ib.bundle_id = bb.bundle_id
INNER JOIN bitstream b ON bb.bitstream_id = b.bitstream_id
WHERE size_bytes > 1000000000 AND deleted = 'f')
to STDOUT with CSV;" > item_ids.csv
# select the item_id and extract the solr stats for that item and
# place it into a new csv
awk -F "\"*,\"*" '{print $2}' item_ids.csv |
while read id;
do solr-count statistics "type:0 AND owningItem:$id";
done > item_stats.csv
# combine the two one-column CSVs into a new CSV
paste -d, item_ids.csv item_stats.csv > item_ids_and_stats.csv
dspace-development#284, dspace-development#289
Collect the collection, community, and handle data for all unnamed bitstreams. This is stored in DSpace as the dc.title
metadata on Bitstreams, which is stored in the metadata_field_id
as 64
though this could certainly change. This does not check if dc.title
is NULL
or an empty string.
SELECT (
SELECT text_value as community_name
FROM metadatavalue
WHERE resource_type_id = 4 AND resource_id = community_id AND metadata_field_id = 64
),(
SELECT text_value as collection_name
FROM metadatavalue
WHERE resource_type_id = 3 AND resource_id = community2collection.collection_id AND metadata_field_id = 64
),handle,bitstream_id
FROM handle
INNER JOIN item ON (item.item_id = handle.resource_id)
INNER JOIN item2bundle ON (item.item_id = item2bundle.item_id)
INNER JOIN bundle2bitstream ON (item2bundle.bundle_id = bundle2bitstream.bundle_id)
INNER JOIN collection2item on collection2item.item_id = item.item_id
INNER JOIN community2collection on community2collection.collection_id = collection2item.collection_id
WHERE bitstream_id IN (
SELECT resource_id
FROM metadatavalue WHERE resource_type_id = 0
GROUP BY resource_id
HAVING NOT (64 = ANY(ARRAY_AGG(metadata_field_id)))
);
\copy (select * from metadatafieldregistry where metadata_schema_id=1) to 'dc-metadata.csv' with csv header;
\copy (select * from metadatafieldregistry where metadata_schema_id=2) to 'pu-metadata.csv' with csv header;
\copy (select * from metadatafieldregistry where metadata_schema_id=3) to 'dcterms-metadata.csv' with csv header;
\copy (select * from metadatafieldregistry where metadata_schema_id=4) to 'eperson-metadata.csv' with csv header;