Skip to content

Instantly share code, notes, and snippets.

@kspurgin
kspurgin / MARC_upsides.txt
Created March 31, 2017 20:34
MARC upsides
"MARC is concise as a physical format (something that is less important today)"
^--- I know it used to be a lot MORE important, this doesn't feel unimportant when I'm extracting 6.5 million records from one system and transferring them to another server! (And I'm aware there are more concise ways to express data that we often see expressed in XML)
-=-=-=-=-=-=-=-=-=-
Ease of transforming/exporting/analyzing
-=-=-=-=-=-=-=-=-=-
I work with big batches of MARC (in MARC-binary and MARC-XML) and other metadata formats (DC, EAD, DDI, MODS, Oracle Endeca format for indexed data) expressed in XML.
I whip up XSLT to do stuff when I have to. But Terry Reese gave us MARCedit which makes it so easy to do fairly complex transformations across a set of MARC records, or just get an overview of what fields are in a record set and with what frequency.
**Default 'usable date' range is 500 to current year + 6**
MarcToArgot
MarcToArgot::Macros::Shared::PublicationYear
usable_date? - determine whether a given date value is usable for deriving a reasonable single value for sorting/filtering
1997 is usable - 4 digits, in usable range
688 is usable - fewer than 4 digits, but in usable range
9999 is usable - gets translated into current year + 1
499 is NOT usable - out of usable range
6754 is NOT usable - 4 digits, but out of usable range
@kspurgin
kspurgin / OpenRefine_format_verified_WD_recon_results
Created July 25, 2019 20:08
Initial formatting of ead_creators_wikidata_recon_verify.xlsx for use in batch editing Wikidata
[
{
"op": "core/column-removal",
"description": "Remove column date of birth",
"columnName": "date of birth"
},
{
"op": "core/column-removal",
"description": "Remove column date of death",
"columnName": "date of death"
@kspurgin
kspurgin / CCT_q_1.md
Last active October 21, 2019 17:11
CSpace converter tool question

Context -- I'm looking at OHC's skeletal data spreadsheet and wondering how it gets imported.

This question is specifically referring to Cataloging > Other Number field

We have (this is fake data – IRL it looks like only 2 columns at a time are filled in):

@kspurgin
kspurgin / person_spec_doc.xml
Created November 21, 2019 16:53
person_spec output
<?xml version="1.0" encoding="UTF-8"?>
<document name="persons">
<persons_common>
<personTermGroupList>
<personTermGroup>
<termDisplayName>John Mellon; J. T. Mellon</termDisplayName>
<termType>urn:cspace:core.collectionspace.org:vocabularies:name(persontermtype):item:name(descriptor)'descriptor'</termType>
<termSourceID>123</termSourceID>
<termSourceDetail>detail text</termSourceDetail>
<surName>Mellon</surName>
@kspurgin
kspurgin / concept_spec_docs.txt
Created November 21, 2019 17:04
concept_spec output
rake spec SPEC=spec/collectionspace/converter/core/concept_spec.rb
/Users/spurgin/.rvm/rubies/ruby-2.5.3/bin/ruby -I/Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-core-3.9.0/lib:/Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-support-3.9.0/lib /Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-core-3.9.0/exe/rspec spec/collectionspace/converter/core/concept_spec.rb
CORE CONCEPT:
<?xml version="1.0" encoding="UTF-8"?>
<document name="concepts">
<ns2:concepts_common xmlns:ns2="http://collectionspace.org/services/concept" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<conceptTermGroupList>
<conceptTermGroup>
@kspurgin
kspurgin / boris.xml
Created March 4, 2020 20:06
Complicated Test MODS (Boris the Guinea remix)
<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:drs="info://lyrasis/drs-admin/v1" xmlns:dc="http:://purl.org/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dwc="http://rs.tdwg.org/dwc/terms/" xmlns:edm="http://pro.europeana.eu/edm-documentation" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<titleInfo usage="primary">
<title>Boris the Guineafowl</title>
<subTitle>Portrait of a birb</subTitle>
</titleInfo>
<titleInfo type="abbreviated">
<title>Boris</title>
</titleInfo>
<titleInfo>
@kspurgin
kspurgin / core_sd_fields.json
Created May 4, 2020 20:30
Core structured date field config
{
"structuredDate": {
"fields": {
"dateDisplayDate": {
"[config]": {
"extensionName": "structuredDate"
}
},
"dateAssociation": {
"[config]": {
@kspurgin
kspurgin / accessionDateGroup_field_config.json
Created May 4, 2020 20:56
accessionDateGroup field config
{ "accessionDateGroup": {
"[config]": {
"dataType": "DATA_TYPE_STRUCTURED_DATE",
"messages": {
"name": {
"id": "field.acquisitions_common.accessionDateGroup.name",
"defaultMessage": "Accession date"
}
},
"searchView": {
@kspurgin
kspurgin / csv_column_report
Created September 30, 2020 21:14
csv_column_splitting_headache
I use a little awk oneliner derived from https://www.datafix.com.au/cookbook/structure1.html
to verify the structure of client-supplied CSVs (that I convert to TSVs) or TSVs. One client's
table of object data provided as TSV used CRLF row endings, AND included TAB, CRLF, CR, and LF
characters inside individual fields to format multiline notes.
The result of my check on this ONE FILE was as follows:
292 rows are broken into 82 columns
606 rows are broken into 1 columns
486 rows are broken into 0 columns