Skip to content

Instantly share code, notes, and snippets.

View billdueber's full-sized avatar

Bill Dueber billdueber

View GitHub Profile
use strict;
use JSON;
use Data::Dumper;
my $file = 'gistfile1.txt';
open(INFILE, $file) || die "Can't open '$file': $!";
my $i = 0;
my %elements;

Timing JRuby-based MARC indexing vs solrmarc v2.0

The data and processing

My MARC indexing does the following processing (working with marc4j records):

  • normal fields. 29 fields that can be described using nothing but the normal solrmarc syntax. This might be extensive (15-20 tags, lookups in either hashes or sets of regexp matches for transformation) but doesn't require custom code. This generic processor is written in Ruby.
  • custom fields. 10 (or 14 if it's a serial) fields that require custom code. These are also all Ruby.
  • all fields The single "get all the text in all the fields numbered 10 to 999" method. Ruby.
  • xml Turn the record into a MARC-XML string. This uses JRuby to get a java.io.StringWriter and javax.xml.transform.stream and then call marc4j's MarcXmlWriter method to write to it. Which, I just looked, isn't exactly how solrmarc does it. I'll have to benchmark it both ways. Both Ruby and Java
# Example of pushing stuff to solr with solrj in jruby
require 'rubygems'
# Load any .jar files your want with "require '../path/to/jarfile.jar'"
# For both marc4j4r and jruby_streaming_update_solr_server, if you load the
# appropriate jar first that's the version that will be used. If not,
# we fall back on the one shipped with the gem
# require '../jars/myjavacode.jar'
Just a quick place to put this that's better than IRC:
Suppose mm="<2 -1"
As implemented now, the search
dog cat =>
dog AND cat
Likewise
dog cat -mouse =>
$:.unshift 'lib'
require 'marc'
require 'benchmark'
require 'profiler'
tags = ['001','005', '100','110','111','240','243','245', /^6[0-9][0-9]$/, '700', '710', '711']
rec = MARC::Reader.new('batch.dat').first
# The current version. Using self.inxex(field) makes this O(n^2)!
# Rebuild the HashWithChecksumAttribute with the current
# values of the fields Array
def reindex
@tags = {}
self.each do |field|
@tags[field.tag] ||= []
@tags[field.tag] << self.index(field) ##### AAAAAAAHHHHHHHH ####
end
module MARC
# Simply what the class name says.
# The checksum is used to see if the FieldMap's array has changed.
class HashWithChecksumAttribute < Hash
attr_accessor :checksum
end
# The FieldMap is an Array of DataFields and Controlfields.
# It also contains a HashWithChecksumAttribute with a Hash-based
# Code to benchmark various serializations of MARC records using ruby-marc
# Not included is XML -- serialization using ruby-marc is ridiculously slow and the # filesizes are bigger than anything else. Even with the lib-xml reader,
# deserialization is also relatively slow
#
# I didn't bother to benchmark json/pure in later runs because it's just so damn
# slow that it would never be a good choice.
#
# My results can be found at http://robotlibrarian.billdueber.com/sizespeed-of-various-marc-serializations-using-ruby-marc/
require 'marc'
@billdueber
billdueber / autoload.rb
Created October 25, 2010 17:41
Problem with autoload and threading
require 'rubygems'
require 'rdf'
require 'threach'
(1..10).threach(3) do |c|
u = RDF::URI.new("http://example.org/#{c}/"); puts u.to_s
end
@billdueber
billdueber / marc_deserialization_bench.rb
Created October 25, 2010 20:26
Deserialization speed of marc-in-json vs marcxml under ruby-marc with fastest available libraries
require 'rubygems'
require 'marc'
require 'yajl'
require 'benchmark'
iterations = 5
xmlsourcefile = 'topics.xml' # 18k records as a MARC-XML collection
jsonsourcefile = 'topics.ndj' # Same records as newline-delimited marc-in-json