Skip to content

Instantly share code, notes, and snippets.

@hectorcorrea
Last active January 29, 2021 14:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hectorcorrea/78641f8d8e7b6cbbcb3dfd904116f7b0 to your computer and use it in GitHub Desktop.
Save hectorcorrea/78641f8d8e7b6cbbcb3dfd904116f7b0 to your computer and use it in GitHub Desktop.
Basic example on how to process a MARC XML file with Traject. Notice that there is a trick to process MARC XML files without namespaces (like those produced by Alma).
# A very simple config file for Traject to parse a MARC XML file.
#
# Usage:
# traject -t xml -c traject_config_marc_xml.rb -w Traject::DebugWriter xml_tiny.xml
to_field 'title', extract_marc('245a', first: true)
# A very simple config file for Traject to parse a MARC XML file
# produced by Alma (i.e. one without namespaces)
#
# Usage:
# traject -t xml -c traject_config_xml_alma.rb -w Traject::DebugWriter xml_tiny_no_ns.xml
#
# AlmaReader stolen from https://github.com/pulibrary/marc_liberation/blob/alma/marc_to_solr/lib/alma_reader.rb
#
# Alma MARC-XML files have no namespace defined, so we have to bypass that
# requirement in Nokogiri.
class ::AlmaReader < Traject::MarcReader
def internal_reader
@modified_internal_reader ||=
begin
result = super
result.singleton_class.alias_method :old_start_element_namespace, :start_element_namespace
# Redefine start_element_namespace to set the @ns to just be whatever
# the URI is. For MARC records it's always the same anyways.
result.singleton_class.define_method(:start_element_namespace) do |name, attributes = [], prefix = nil, uri = nil, ns = {}|
@ns = uri
old_start_element_namespace(name, attributes, prefix, uri, ns)
end
result
end
end
end
# ----------------------------------------------------------
settings do
provide "reader_class_name", "AlmaReader"
end
to_field 'title', extract_marc('245a', first: true)
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim">
<record>
<leader>01352cam a2200349 a 4500</leader>
<datafield tag="245" ind1="0" ind2="0">
<subfield code="6">880-01</subfield>
<subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield>
<subfield code="c">Osada Masayoshi hen.</subfield>
</datafield>
</record>
<record>
<leader>01121ccm a2200289z 4500</leader>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Powhatan&#39;s daughter :</subfield>
<subfield code="b">march</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Sousa, John Philip,</subfield>
<subfield code="d">1854-1932,</subfield>
<subfield code="e">composer.</subfield>
</datafield>
</record>
<record>
<leader>01137cam a2200301 a 4500</leader>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Two pieces /</subfield>
<subfield code="c">by Frank O&#39;Hara.</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">O&#39;Hara, Frank,</subfield>
<subfield code="d">1926-1966.</subfield>
<subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield>
</datafield>
</record>
</collection>
<?xml version="1.0" encoding="UTF-8"?>
<collection>
<record>
<leader>01352cam a2200349 a 4500</leader>
<datafield tag="245" ind1="0" ind2="0">
<subfield code="6">880-01</subfield>
<subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield>
<subfield code="c">Osada Masayoshi hen.</subfield>
</datafield>
</record>
<record>
<leader>01121ccm a2200289z 4500</leader>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Powhatan&#39;s daughter :</subfield>
<subfield code="b">march</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Sousa, John Philip,</subfield>
<subfield code="d">1854-1932,</subfield>
<subfield code="e">composer.</subfield>
</datafield>
</record>
<record>
<leader>01137cam a2200301 a 4500</leader>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Two pieces /</subfield>
<subfield code="c">by Frank O&#39;Hara.</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">O&#39;Hara, Frank,</subfield>
<subfield code="d">1926-1966.</subfield>
<subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield>
</datafield>
</record>
</collection>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment