billdueber/marc2solr_lessons.adoc

## marc2solr_lessons.adoc

      
    Raw
  

              marc2solr_lessons.adoc
            
          
    Lessons learned from marc2solr


Things I did wrong


These are the things off the top of my head that drive me crazy and/or that I’ve had to work around. I’m sure there are more that I’ll come up with later.


The fundamental problem, it feels to me, is that the (equivalent of the) MARC::Reader.each loop is hidden. Pretty much all the rest of these problems flow from that. Basically, I want to give up on the idea of hiding the primary loop from the user, and just assume the user is both a programmer and a non-idiot.


Using a DSL with instance_eval was a mistake


So, I have my nice little DSL and so on, and there are several things I’d like to do over and over again. instance_eval invokes a new scope, which makes this impossible


instance_eval fail

title_codes = 'abdefghknp'
field('title') do
  function(:getTitle) {
    args: title_codes  # => FAIL! title_codes undefined
  }
end


…because title_codes isn’t defined during the instance eval. This makes for lots of repetition, or the need for a whole separate dictionary-type instance variable within your specification object (which is then operated on mostly under the covers). Uck.


No place to put intermediate results


The second upshoot of this scope issue is that you’ve got nowhere to stash stuff.I’ve got something like five slightly-different title fields, and all of them find the 245 each and every time. Ditto with a bunch of custom HathiTrust data we’ve stuffed into some 9XX fields. Lots of repeated work, including finding the fields, trimming of punctuation, etc. All because I don’t have a good place to stash them.


But even if I did have a place to stash partial results from custom methods (which I experimented with, by throwing stuff in a hash in the record object that gets passed around), it’s totally hidden. Which means that deleting or changing one field rule — or even changing their order in the file —  could make another fail (and probably fail silently, since it’ll just look for data and not find any), and the data chain isn’t at all obvious. So, to keep thing clear, I end up doing the same work over and over again, which makes me feel dirty.


Fixed-use key/value mapping based on solrmarc’s translation files


So, I’ve got match_map, which I like (and which actually post-dates my code, so it’s not being used), and it does its job it well. But my DSL has a tight coupling between a solr field definition and a particular map, which means I can’t (easily) do things like chaining match_maps together or combining them with other arbitrary filters (e.g., I spend lots of time removing trailing punctuation from stored values, since solr won’t muck with stored values for me at all).


Having it all done in actual ruby code, with #map and #filter and such, would be much clearer and more flexible.


BTW, this calls back to the intermediate results issue; I’ve got a bunch of rule pairs that are, essentially raw_value and raw_value_with_translation (e.g., I store both location-in-english and local-as-four-letter-code). For the super-expensive ones, I’ve made custom functions that return multiple solrfield/value pairs, but it’s hard to follow the code when I do that. So again, I do the same work two or three times, only to return one before translation and one after.


Tight coupling between solr field names and the code that produces the values


Right now, I’ve got tight integration between solr field names and the code/specification that produces the values. For just me, this is fine, but it would make sharing code more difficult than a more generic interface (e.g., people just expose method functions that take in a record and return a (possible empty) array of values. I like the idea of being able to include a gem and get access to field definitions that I can then stuff into whatever solrfield I want.


No chance to preprocess


So, something I do in my alephsequential marc reader is automatically turn illegal indicators into spaces as they come in (and throw a warning). I aslo look for poorly-formed values (e.g., embedded newlines), that sort of thing. Not sure exactly what I’m going to do once I start using a stock MARC reader, since my marc2solr code doesn’t give me a good way to preprocess the records. What I’m doing now is calling a custom function as I get the record ID, and side-effecting the shit out of things in that method. Talk about feeling dirty…


marcspec doesn’t deal with indicators or 880s


This was just dumb copying of solrmarc on my part. Every time something needs to take an indicator into account, I need to use a custom function. Idiot. Coming up more and more with the RDA-in-MARC fields appearing.


I also think it might be a good idea to either allow automatic matching of 880s, or a way to specify 880s specifically (e.g., 880[245] or something). Dealing with that stuff in a preprocessor or whatever is a pain in the butt.


Using StreamingUpdateSolrServer was a mistake


The suss (and it’s solr4 equivalent, whose name esacpes me at the moment) eats errors. If something goes wrong with a record, you have no way of knowing. The right way to do it would be to send each document individually, using a thread pool to parallelize it, and deal with errors as they come up.


I can’t branch


This is another side effect of hiding the #each loop. If I want to do more detailed indexing based on format (music is the canonical example), I’ve got no way to do different indexing per type (except, again, by hiding it all in custom functions).


I put some custom functions in the marc2solr distibution itself


…which means that instead of looking in the ./lib folder of my project, I have to dig into the gem source to see what the hell is happening. I would instead ship a standard project skeleton with the basics already in a ./lib directory.


Things I did right


I didn’t get it all wrong.


Multiple configuration files as well as command-line overrides


Allowing multiple config files makes things nice. I can have separate config files that specify machine information, indexing code, etc. Makes it easy to mix and match.


Automatically load up everything in the ./lib directory


I put custom crap in there — ruby files and .jar files — and load them all up. If I want to expose a new custom function, I can just throw it in a file and drop it into ./lib and it’ll be available. Including the ability to throw in a ruby file that basically just requires gems.


dryrun and printdoc flags


dryrun goes through the whole process but disables all communication with solr.


printdoc spits out all the solrfield:value pairs along with the record ID. See other options below.


Logging and Benchmarking


marc2solr does pretty damn good logging, so it’s pretty easy for me to find stuff. Tuning it to the debug level spits out the record ID for every record read, making it easy to figure out on exactly which record things are blowing up.


I also set up a switch to benchmark how long each solrfield took to produce during a run. Good for tuning.


Options to delete and commit


marc2solr takes subcommands; they are index, obviously, commit, ping, and delete. I use all of them, because it’s just damn convenient.


Seemless use of .gz files


marc2solr transparently sets up a gzip stream if the filename ends in .gz. I know it’s simple, but for me this is nice every day. I don’t want to have to pipe my ginormous files through zcat or whatever, esp. with MARC serializations that have meaningful line numbers for error reporting.


Lots of options


marc2solr options

[vufind@mojito ~]$ marc2solr index --help
Options:
             --custom, -C <s>:   Any custom value you want. In a config file,
                                 use two String arguments (custom key value);
                                 on the command line use (--custom key=value)
                                 or (--custom key="three word value")
  --config, -c <filename/uri>:   Configuation file specifying options.
                                 Repeatable. Command-line arguments always
                                 override the config file(s)
              --benchmark, -B:   Benchmark production of each solr field
            --NObenchmark, -N:   Benchmark production of each solr field
                 --dryrun, -y:   Don't send anything to solr
               --NOdryrun, -O:   Disable a previous 'dryrun' directive
              --printmarc, -r:   Print MARC Record (as text) to --debugfile
            --NOprintmarc, -i:   Turn off printing MARC Record (as text) to
                                 --debugfile
               --printdoc, -d:   Print each completed document to --debugfile
             --NOprintdoc, -n:   Turn off printing each completed document to
                                 --debugfile
          --debugfile, -e <s>:   Where to send output from --printmarc and
                                 --printdoc (takes filename, 'STDERR',
                                 'STDOUT', or 'NONE') (repeatable) (default:
                                 STDOUT)
              --clearsolr, -l:   Clean out Solr by deleting everything in it
                                 (DANGEROUS)
            --NOclearsolr, -a:   Disable a previous --clearsolr command
             --skipcommit, -S:   DON'T send solr a 'commit' afterwards
            --threads, -h <i>:   Number of threads to use to process MARC
                                 records (>1 => use 'threach') (default: 1)
        --sussthreads, -s <i>:   Number of threads to send completed docs to
                                 Solr (default: 1)
           --susssize, -u <i>:   Size of the documente queue for sending to
                                 Solr (default: 128)
            --machine, -m <s>:   Name of solr machine (e.g., solr.myplace.org)
               --port, -p <i>:   Port of solr machine (e.g., '8088')
           --solrpath, -P <s>:   URL path to solr
                --javabin, -j:   Use javabin (presumes /update/bin is
                                 configured in schema.xml)
              --NOjavabin, -v:   Don't use javabin
            --logfile, -o <s>:   Name of the logfile (filename, 'STDERR',
                                 'DEFAULT', or 'NONE'). 'DEFAULT' is a file
                                 based on input file name (default: DEFAULT)
           --loglevel, -L <s>:   Level at which to log (DEBUG, INFO, WARN,
                                 ERROR, OFF) (default: INFO)
       --logbatchsize, -b <i>:   Write progress information to logfile after
                                 every N records (default: 25000)
          --indexfile, -x <s>:   The index file describing your specset
                                 (usually index.dsl)
                --tmapdir <s>:   Directory that contains any translation maps
              --customdir <s>:   The directory containging custom routine
                                 libraries (usually the 'lib' next to
                                 index.rb). Repeatable
           --marctype, -t <s>:   Type of marc file ('bestguess', 'strictmarc'.
                                 'marcxml', 'alephsequential',
                                 'permissivemarc') (default: bestguess)
           --encoding, -g <s>:   Encoding of the MARC file ('bestguess',
                                 'utf8', 'marc8', 'iso') (default: bestguess)
                --gzipped, -z:   Is the input gzipped? An extenstion of .gz
                                 will always force this to true
                       --help:   Show this message