Skip to content

Instantly share code, notes, and snippets.

@jprante
Last active August 23, 2022 09:47
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jprante/10827708 to your computer and use it in GitHub Desktop.
Save jprante/10827708 to your computer and use it in GitHub Desktop.
Ingest Harvard Library Bibliographic Dataset into Elasticsearch (as raw unmapped MARC21 fields)

HOWTO

Ingest Harvard Library Bibliographic Dataset into Elasticsearch (as raw unmapped MARC21 fields)

This HOWTO is for Linux systems (Windows is very similar)

  • install Java 8 into /usr/java/jdk1.8.0

  • install Elasticsearch 1.1.0

  • assign 50% of RAM to Elasticsearch heap and enabling G1 GC by placing this file to ${ES_HOME}/bin/elasticsearch.in.sh. This example is for 8G RAM:

      #!/bin/sh
      ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
      JAVA_OPTS="$JAVA_OPTS -Xms4g"
      JAVA_OPTS="$JAVA_OPTS -Xmx4g"
      JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
    
  • place this file in ${ES_HOME}/config/elasticsearch.yml

      cluster:
         name: xbib
      index:
         codec:
      	 bloom:
      	   load: false
         merge:
      	 scheduler:
      	   type: concurrent
      	   max_thread_count: 4
      	 policy:
      	   type: tiered
      	   max_merged_segment: 1gb
      	   segments_per_tier: 4
      	   max_merge_at_once: 4
      	   max_merge_at_once_explicit: 4
      indices:
         memory:
      	 index_buffer_size: 33%
         store:
      	 throttle:
      	   type: none
      threadpool:
        merge:
      	type: fixed
      	size: 4
      	queue_size: 32
        bulk:
      	type: fixed
      	size: 8
      	queue_size: 32
    
  • export a TOOLS folder to your environment (e.g. export TOOLS=$HOME/xbib), and create a $TOOLS folder structure:

      mkdir -p $TOOLS/lib
      mkdir -p $TOOLS/bin
      mkdir -p $TOOLS/logs
      mkdir -p $TOOLS/import
    
  • download MARC21 records from http://openmetadata.lib.harvard.edu/bibdata to $TOOLS/import and unpack the tar.gz file to *.mrc files. Result should look like

      $ cd $TOOLS/import
      $ find .
      .
      ./20140408
      ./20140408/data
      ./20140408/data/hlom
      ./20140408/data/hlom/ab.bib.00.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.12.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.06.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.09.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.10.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.11.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.02.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.01.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.13.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.07.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.08.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.05.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.03.20140404.full.mrc
      ./20140408/data/hlom/ab.bib.04.20140404.full.mrc
      ./20140408/harvard.tar.gz
    
  • download http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar to $TOOLS/lib

  • create a logging configuration file in $TOOLS/bin/log4j.properties

      log4j.rootLogger=INFO, file, console
      log4j.appender.out=org.apache.log4j.ConsoleAppender
      log4j.appender.out.layout=org.apache.log4j.PatternLayout
      log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
      log4j.appender.file=org.apache.log4j.FileAppender
      log4j.appender.file.layout=org.apache.log4j.PatternLayout
      log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
      log4j.appender.file.append=false
      log4j.appender.file.file=logs/xbib.log
      log4j.logger.org.xbib.elasticsearch=DEBUG
    
  • create ingest script in $TOOLS/bin/harvard2es

      #!/bin/sh
      java="/usr/java/jdk1.8.0/bin/java"
      echo '
      {
      	"path" : "'${TOOLS}'/import/",
      	"pattern" : "*.mrc",
      	"elements" : "marc/bib",
      	"concurrency" : 8,
      	"elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true",
      	"index" : "harvard",
      	"type" : "title",
      	"shards" : 1,
      	"replica" : 0,
      	"maxbulkactions" : 3000,
      	"maxconcurrentbulkrequests" : 10,
      	"maxtimewait" : "180s",
      	"mock" : false,
      	"client" : "bulk",
      	"direct" : true
      }
      ' | ${java} \
      	 -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \
      	 org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC
    
  • change directory to $TOOLS

  • run $TOOLS/bin/harvard2es

  • wait ~70 minutes (with a single Elasticsearch node on commodity hardware)

@cleydyr
Copy link

cleydyr commented Aug 23, 2022

Unfortunately the link http://openmetadata.lib.harvard.edu/bibdata is now dead. :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment