Last active
January 29, 2021 14:33
-
-
Save hectorcorrea/78641f8d8e7b6cbbcb3dfd904116f7b0 to your computer and use it in GitHub Desktop.
Basic example on how to process a MARC XML file with Traject. Notice that there is a trick to process MARC XML files without namespaces (like those produced by Alma).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A very simple config file for Traject to parse a MARC XML file. | |
# | |
# Usage: | |
# traject -t xml -c traject_config_marc_xml.rb -w Traject::DebugWriter xml_tiny.xml | |
to_field 'title', extract_marc('245a', first: true) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A very simple config file for Traject to parse a MARC XML file | |
# produced by Alma (i.e. one without namespaces) | |
# | |
# Usage: | |
# traject -t xml -c traject_config_xml_alma.rb -w Traject::DebugWriter xml_tiny_no_ns.xml | |
# | |
# AlmaReader stolen from https://github.com/pulibrary/marc_liberation/blob/alma/marc_to_solr/lib/alma_reader.rb | |
# | |
# Alma MARC-XML files have no namespace defined, so we have to bypass that | |
# requirement in Nokogiri. | |
class ::AlmaReader < Traject::MarcReader | |
def internal_reader | |
@modified_internal_reader ||= | |
begin | |
result = super | |
result.singleton_class.alias_method :old_start_element_namespace, :start_element_namespace | |
# Redefine start_element_namespace to set the @ns to just be whatever | |
# the URI is. For MARC records it's always the same anyways. | |
result.singleton_class.define_method(:start_element_namespace) do |name, attributes = [], prefix = nil, uri = nil, ns = {}| | |
@ns = uri | |
old_start_element_namespace(name, attributes, prefix, uri, ns) | |
end | |
result | |
end | |
end | |
end | |
# ---------------------------------------------------------- | |
settings do | |
provide "reader_class_name", "AlmaReader" | |
end | |
to_field 'title', extract_marc('245a', first: true) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<collection xmlns="http://www.loc.gov/MARC21/slim" xmlns:marc="http://www.loc.gov/MARC21/slim"> | |
<record> | |
<leader>01352cam a2200349 a 4500</leader> | |
<datafield tag="245" ind1="0" ind2="0"> | |
<subfield code="6">880-01</subfield> | |
<subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield> | |
<subfield code="c">Osada Masayoshi hen.</subfield> | |
</datafield> | |
</record> | |
<record> | |
<leader>01121ccm a2200289z 4500</leader> | |
<datafield tag="245" ind1="1" ind2="0"> | |
<subfield code="a">Powhatan's daughter :</subfield> | |
<subfield code="b">march</subfield> | |
</datafield> | |
<datafield tag="100" ind1="1" ind2=" "> | |
<subfield code="a">Sousa, John Philip,</subfield> | |
<subfield code="d">1854-1932,</subfield> | |
<subfield code="e">composer.</subfield> | |
</datafield> | |
</record> | |
<record> | |
<leader>01137cam a2200301 a 4500</leader> | |
<datafield tag="245" ind1="1" ind2="0"> | |
<subfield code="a">Two pieces /</subfield> | |
<subfield code="c">by Frank O'Hara.</subfield> | |
</datafield> | |
<datafield tag="100" ind1="1" ind2=" "> | |
<subfield code="a">O'Hara, Frank,</subfield> | |
<subfield code="d">1926-1966.</subfield> | |
<subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield> | |
</datafield> | |
</record> | |
</collection> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<collection> | |
<record> | |
<leader>01352cam a2200349 a 4500</leader> | |
<datafield tag="245" ind1="0" ind2="0"> | |
<subfield code="6">880-01</subfield> | |
<subfield code="a">Kazoku kankei no shakai shinrigaku /</subfield> | |
<subfield code="c">Osada Masayoshi hen.</subfield> | |
</datafield> | |
</record> | |
<record> | |
<leader>01121ccm a2200289z 4500</leader> | |
<datafield tag="245" ind1="1" ind2="0"> | |
<subfield code="a">Powhatan's daughter :</subfield> | |
<subfield code="b">march</subfield> | |
</datafield> | |
<datafield tag="100" ind1="1" ind2=" "> | |
<subfield code="a">Sousa, John Philip,</subfield> | |
<subfield code="d">1854-1932,</subfield> | |
<subfield code="e">composer.</subfield> | |
</datafield> | |
</record> | |
<record> | |
<leader>01137cam a2200301 a 4500</leader> | |
<datafield tag="245" ind1="1" ind2="0"> | |
<subfield code="a">Two pieces /</subfield> | |
<subfield code="c">by Frank O'Hara.</subfield> | |
</datafield> | |
<datafield tag="100" ind1="1" ind2=" "> | |
<subfield code="a">O'Hara, Frank,</subfield> | |
<subfield code="d">1926-1966.</subfield> | |
<subfield code="0">http://id.loc.gov/authorities/names/n79042130</subfield> | |
</datafield> | |
</record> | |
</collection> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment