kspurgin

## MARC_upsides.txt
"MARC is concise as a physical format (something that is less important today)"
^--- I know it used to be a lot MORE important, this doesn't feel unimportant when I'm extracting 6.5 million records from one system and transferring them to another server! (And I'm aware there are more concise ways to express data that we often see expressed in XML)

-=-=-=-=-=-=-=-=-=-
Ease of transforming/exporting/analyzing
-=-=-=-=-=-=-=-=-=-
I work with big batches of MARC (in MARC-binary and MARC-XML) and other metadata formats (DC, EAD, DDI, MODS, Oracle Endeca format for indexed data) expressed in XML.

I whip up XSLT to do stuff when I have to. But Terry Reese gave us MARCedit which makes it so easy to do fairly complex transformations across a set of MARC records, or just get an overview of what fields are in a record set and with what frequency.

## Sortable pub date logic
**Default 'usable date' range is 500 to current year + 6**

MarcToArgot
  MarcToArgot::Macros::Shared::PublicationYear
    usable_date? - determine whether a given date value is usable for deriving a reasonable single value for sorting/filtering
      1997 is usable - 4 digits, in usable range
      688 is usable - fewer than 4 digits, but in usable range
      9999 is usable - gets translated into current year + 1
      499 is NOT usable - out of usable range
      6754 is NOT usable - 4 digits, but out of usable range

## OpenRefine_format_verified_WD_recon_results
[
  {
    "op": "core/column-removal",
    "description": "Remove column date of birth",
    "columnName": "date of birth"
  },
  {
    "op": "core/column-removal",
    "description": "Remove column date of death",
    "columnName": "date of death"

## CCT_q_1.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                kspurgin
                / CCT_q_1.md
            
            
              Last active
              October 21, 2019 17:11
            
              
                CSpace converter tool question
              
          
    Context -- I'm looking at OHC's skeletal data spreadsheet and wondering how it gets imported.
This question is specifically referring to Cataloging > Other Number field
We have (this is fake data – IRL it looks like only 2 columns at a time are filled in):


## person_spec_doc.xml
<?xml version="1.0" encoding="UTF-8"?>
<document name="persons">
  <persons_common>
    <personTermGroupList>
      <personTermGroup>
        <termDisplayName>John Mellon; J. T. Mellon</termDisplayName>
        <termType>urn:cspace:core.collectionspace.org:vocabularies:name(persontermtype):item:name(descriptor)'descriptor'</termType>
        <termSourceID>123</termSourceID>
        <termSourceDetail>detail text</termSourceDetail>
        <surName>Mellon</surName>

## concept_spec_docs.txt
rake spec SPEC=spec/collectionspace/converter/core/concept_spec.rb
/Users/spurgin/.rvm/rubies/ruby-2.5.3/bin/ruby -I/Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-core-3.9.0/lib:/Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-support-3.9.0/lib /Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-core-3.9.0/exe/rspec spec/collectionspace/converter/core/concept_spec.rb


CORE CONCEPT:
<?xml version="1.0" encoding="UTF-8"?>
<document name="concepts">
  <ns2:concepts_common xmlns:ns2="http://collectionspace.org/services/concept" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <conceptTermGroupList>
      <conceptTermGroup>

## boris.xml
<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:drs="info://lyrasis/drs-admin/v1" xmlns:dc="http:://purl.org/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dwc="http://rs.tdwg.org/dwc/terms/" xmlns:edm="http://pro.europeana.eu/edm-documentation" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <titleInfo usage="primary">
        <title>Boris the Guineafowl</title>
        <subTitle>Portrait of a birb</subTitle>
    </titleInfo>
    <titleInfo type="abbreviated">
        <title>Boris</title>
    </titleInfo>
    <titleInfo>

## core_sd_fields.json
{
  "structuredDate": {
      "fields": {
        "dateDisplayDate": {
          "[config]": {
            "extensionName": "structuredDate"
          }
        },
        "dateAssociation": {
          "[config]": {

## accessionDateGroup_field_config.json
{ "accessionDateGroup": {
              "[config]": {
                "dataType": "DATA_TYPE_STRUCTURED_DATE",
                "messages": {
                  "name": {
                    "id": "field.acquisitions_common.accessionDateGroup.name",
                    "defaultMessage": "Accession date"
                  }
                },
                "searchView": {

## csv_column_report
I use a little awk oneliner derived from https://www.datafix.com.au/cookbook/structure1.html
  to verify the structure of client-supplied CSVs (that I convert to TSVs) or TSVs. One client's
  table of object data provided as TSV used CRLF row endings, AND included TAB, CRLF, CR, and LF
  characters inside individual fields to format multiline notes.

The result of my check on this ONE FILE was as follows:

292 rows are broken into 82 columns
606 rows are broken into 1 columns
486 rows are broken into 0 columns
	"MARC is concise as a physical format (something that is less important today)"
	^--- I know it used to be a lot MORE important, this doesn't feel unimportant when I'm extracting 6.5 million records from one system and transferring them to another server! (And I'm aware there are more concise ways to express data that we often see expressed in XML)

	-=-=-=-=-=-=-=-=-=-
	Ease of transforming/exporting/analyzing
	-=-=-=-=-=-=-=-=-=-
	I work with big batches of MARC (in MARC-binary and MARC-XML) and other metadata formats (DC, EAD, DDI, MODS, Oracle Endeca format for indexed data) expressed in XML.

	I whip up XSLT to do stuff when I have to. But Terry Reese gave us MARCedit which makes it so easy to do fairly complex transformations across a set of MARC records, or just get an overview of what fields are in a record set and with what frequency.
	Default 'usable date' range is 500 to current year + 6

	MarcToArgot
	MarcToArgot::Macros::Shared::PublicationYear
	usable_date? - determine whether a given date value is usable for deriving a reasonable single value for sorting/filtering
	1997 is usable - 4 digits, in usable range
	688 is usable - fewer than 4 digits, but in usable range
	9999 is usable - gets translated into current year + 1
	499 is NOT usable - out of usable range
	6754 is NOT usable - 4 digits, but out of usable range
	[
	{
	"op": "core/column-removal",
	"description": "Remove column date of birth",
	"columnName": "date of birth"
	},
	{
	"op": "core/column-removal",
	"description": "Remove column date of death",
	"columnName": "date of death"
	<?xml version="1.0" encoding="UTF-8"?>
	<document name="persons">
	<persons_common>
	<personTermGroupList>
	<personTermGroup>
	<termDisplayName>John Mellon; J. T. Mellon</termDisplayName>
	<termType>urn:cspace:core.collectionspace.org:vocabularies:name(persontermtype):item:name(descriptor)'descriptor'</termType>
	<termSourceID>123</termSourceID>
	<termSourceDetail>detail text</termSourceDetail>
	<surName>Mellon</surName>
	rake spec SPEC=spec/collectionspace/converter/core/concept_spec.rb
	/Users/spurgin/.rvm/rubies/ruby-2.5.3/bin/ruby -I/Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-core-3.9.0/lib:/Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-support-3.9.0/lib /Users/spurgin/.rvm/gems/ruby-2.5.3/gems/rspec-core-3.9.0/exe/rspec spec/collectionspace/converter/core/concept_spec.rb


	CORE CONCEPT:
	<?xml version="1.0" encoding="UTF-8"?>
	<document name="concepts">
	<ns2:concepts_common xmlns:ns2="http://collectionspace.org/services/concept" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<conceptTermGroupList>
	<conceptTermGroup>
	<?xml version="1.0" encoding="UTF-8"?>
	<mods xmlns="http://www.loc.gov/mods/v3" xmlns:drs="info://lyrasis/drs-admin/v1" xmlns:dc="http:://purl.org/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dwc="http://rs.tdwg.org/dwc/terms/" xmlns:edm="http://pro.europeana.eu/edm-documentation" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	<titleInfo usage="primary">
	<title>Boris the Guineafowl</title>
	<subTitle>Portrait of a birb</subTitle>
	</titleInfo>
	<titleInfo type="abbreviated">
	<title>Boris</title>
	</titleInfo>
	<titleInfo>
	{
	"structuredDate": {
	"fields": {
	"dateDisplayDate": {
	"[config]": {
	"extensionName": "structuredDate"
	}
	},
	"dateAssociation": {
	"[config]": {
	{ "accessionDateGroup": {
	"[config]": {
	"dataType": "DATA_TYPE_STRUCTURED_DATE",
	"messages": {
	"name": {
	"id": "field.acquisitions_common.accessionDateGroup.name",
	"defaultMessage": "Accession date"
	}
	},
	"searchView": {
	I use a little awk oneliner derived from https://www.datafix.com.au/cookbook/structure1.html
	to verify the structure of client-supplied CSVs (that I convert to TSVs) or TSVs. One client's
	table of object data provided as TSV used CRLF row endings, AND included TAB, CRLF, CR, and LF
	characters inside individual fields to format multiline notes.

	The result of my check on this ONE FILE was as follows:

	292 rows are broken into 82 columns
	606 rows are broken into 1 columns
	486 rows are broken into 0 columns