This HOWTO is for Linux systems (Windows is very similar)
-
install Java 8 into
/usr/java/jdk1.8.0
-
install Elasticsearch 1.1.0
-
assign 50% of RAM to Elasticsearch heap and enabling G1 GC by placing this file to
${ES_HOME}/bin/elasticsearch.in.sh
. This example is for 8G RAM:#!/bin/sh ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/* JAVA_OPTS="$JAVA_OPTS -Xms4g" JAVA_OPTS="$JAVA_OPTS -Xmx4g" JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
-
place this file in
${ES_HOME}/config/elasticsearch.yml
cluster: name: xbib index: codec: bloom: load: false merge: scheduler: type: concurrent max_thread_count: 4 policy: type: tiered max_merged_segment: 1gb segments_per_tier: 4 max_merge_at_once: 4 max_merge_at_once_explicit: 4 indices: memory: index_buffer_size: 33% store: throttle: type: none threadpool: merge: type: fixed size: 4 queue_size: 32 bulk: type: fixed size: 8 queue_size: 32
-
export a TOOLS folder to your environment (e.g.
export TOOLS=$HOME/xbib
), and create a$TOOLS
folder structure:mkdir -p $TOOLS/lib mkdir -p $TOOLS/bin mkdir -p $TOOLS/logs mkdir -p $TOOLS/import
-
download MARC21 records from
http://openmetadata.lib.harvard.edu/bibdata
to $TOOLS/import and unpack the tar.gz file to*.mrc
files. Result should look like$ cd $TOOLS/import $ find . . ./20140408 ./20140408/data ./20140408/data/hlom ./20140408/data/hlom/ab.bib.00.20140404.full.mrc ./20140408/data/hlom/ab.bib.12.20140404.full.mrc ./20140408/data/hlom/ab.bib.06.20140404.full.mrc ./20140408/data/hlom/ab.bib.09.20140404.full.mrc ./20140408/data/hlom/ab.bib.10.20140404.full.mrc ./20140408/data/hlom/ab.bib.11.20140404.full.mrc ./20140408/data/hlom/ab.bib.02.20140404.full.mrc ./20140408/data/hlom/ab.bib.01.20140404.full.mrc ./20140408/data/hlom/ab.bib.13.20140404.full.mrc ./20140408/data/hlom/ab.bib.07.20140404.full.mrc ./20140408/data/hlom/ab.bib.08.20140404.full.mrc ./20140408/data/hlom/ab.bib.05.20140404.full.mrc ./20140408/data/hlom/ab.bib.03.20140404.full.mrc ./20140408/data/hlom/ab.bib.04.20140404.full.mrc ./20140408/harvard.tar.gz
-
download
http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar
to$TOOLS/lib
-
create a logging configuration file in $TOOLS/bin/log4j.properties
log4j.rootLogger=INFO, file, console log4j.appender.out=org.apache.log4j.ConsoleAppender log4j.appender.out.layout=org.apache.log4j.PatternLayout log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n log4j.appender.file.append=false log4j.appender.file.file=logs/xbib.log log4j.logger.org.xbib.elasticsearch=DEBUG
-
create ingest script in
$TOOLS/bin/harvard2es
#!/bin/sh java="/usr/java/jdk1.8.0/bin/java" echo ' { "path" : "'${TOOLS}'/import/", "pattern" : "*.mrc", "elements" : "marc/bib", "concurrency" : 8, "elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true", "index" : "harvard", "type" : "title", "shards" : 1, "replica" : 0, "maxbulkactions" : 3000, "maxconcurrentbulkrequests" : 10, "maxtimewait" : "180s", "mock" : false, "client" : "bulk", "direct" : true } ' | ${java} \ -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \ org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC
-
change directory to
$TOOLS
-
run
$TOOLS/bin/harvard2es
-
wait ~70 minutes (with a single Elasticsearch node on commodity hardware)
Unfortunately the link http://openmetadata.lib.harvard.edu/bibdata is now dead. :(