Bill Dueber billdueber

View rjgit_snippets.rb
1 2 3 4 5 6 7 8 9 10
require 'jgit-3.1.0.jar'
 
dirpath = '.'
 
 
# get the repo
 
 
dir = java.io::File.new(dirpath)
frb = org.eclipse.jgit.storage.file.FileRepositoryBuilder.new
View gist:6784727

Subject: ANNOUNCEMENT: Traject MARC->Solr indexer beta release

Jonathan Rochkind (Johns Hopkins), along with Bill Dueber (University of Michigan), is happy to announce a first beta release of "traject," a framework for indexing MARC data to Solr.

traject, in the vein of solrmarc, allows you to define your indexing rules using simple macro and translation files. However, traject runs under JRuby and is "ruby all the way down," so you can easily provide additional logic by simply requring ruby files.

traject is currently in a beta release, but is already being used in production to generated the HathiTrust Catalog (http://www.hathitrust.org/). traject was developed under a test-first mentality and has undergone both continuous integration and an extensive benchmarking/profiling period to keep it fast.

You can view the code[1] on github, and easily install it as a (jruby) gem using "gem install traject".

View extractor_spec.rb
1 2 3 4 5 6 7 8 9 10
module Traject
class MarcExtractor
# A set of Spec object, with knowlege about the collection as a whole
class SpecSet
attr_reader :interesting_tags, :options
 
def initialize(opts = {})
@specs = {}
View _UPDATE: MY BAD_
1 2 3
indy slowdown was due to me havint JRUBY_OPTS include -J-XX:+TieredCompilation and
-J-XX:TieredStopAtLevel=1, supposedly to make startup faster. Removing them
removes the performance issue. Jira ticket closed.
View marc4j_jruby.rb
1 2 3 4 5 6 7 8 9 10
# I just nabbed the source of marc4j and built it with "ant jar"
 
require 'marc4j-2.5.1-beta.jar'
 
# Conveniently add Enumerable to the reader interface so I can get #each, #each_with_index, etc.
# This would be automatic if MarcReader were specified as an iterable, as per a recent github issue
# on the marc4j repo (https://github.com/marc4j/marc4j/issues/11)
 
module org.marc4j::MarcReader
include Enumerable
View marc2solr_sample_log.txt
1 2 3 4 5 6 7 8 9 10
# Not saying this is optimal, just what I currently do.
 
# Start out by logging what the hell we're doing: what config files are loaded, where we're sending documents, etc.
 
INFO 08:48:20 1252 ROOT Loading files in /l/solr-vufind/apps/marc2solr_example/umich/lib
INFO 08:48:23 4113 MARC2Solr.Conf Set suss url to http://localhost:8024/solr/biblio
INFO 08:48:23 4114 MARC2Solr.Conf Using 3 threads for the suss
INFO 08:48:24 4334 ROOT Using 4 threads; activiating threach
INFO 08:48:24 4335 ROOT Indexing file /l/solr-vufind/data/vufind_full_20130715.seq.gz
INFO 08:48:24 4335 MARC2Solr.Conf Sniffed marc file type as seq
View marc2solr_lessons.adoc

Things I did wrong

These are the things off the top of my head that drive me crazy and/or that I’ve had to work around. I’m sure there are more that I’ll come up with later.

The fundamental problem, it feels to me, is that the (equivalent of the) MARC::Reader.each loop is hidden. Pretty much all the rest of these problems flow from that. Basically, I want to give up on the idea of hiding the primary loop from the user, and just assume the user is both a programmer and a non-idiot.

View marcquery.rb
1 2 3 4 5 6 7 8 9 10
require 'parslet'
 
# A complex field-selection syntax to get MARC fields. Something I'm messing around
# with for a marc indexing process I'm thinking of building to replace marcspec
#
#
# spec := <tag>
# <tag>!<ind><ind>
# tag := '245' # literal string
# := '6##' # use hashes to mean "any character"
View marc_xml_test.rb
1 2 3 4 5 6 7 8 9 10
require 'marc'
require 'marc4j4r'
require 'benchmark'
 
iterations = 1
xmlsourcefile = 'topics.xml' # 18k records as a MARC-XML collection
 
puts RUBY_DESCRIPTION
View gist:5824804
1 2 3 4 5 6 7 8 9 10
<fieldtype name="text" class="solr.TextField" positionIncrementGap="1000">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="&amp;" replacement=" and " />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b([A-Ga-g])[\#♯](\s+|\Z)" replacement="$1 sharp$2" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b([A-Ga-g])\s*[b♭](\s+|\Z)" replacement="$1 flat$2" />
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b[Cc]\+\+" replacement="cplusplus" />
Something went wrong with that request. Please try again.