Skip to content

Instantly share code, notes, and snippets.

@jrochkind
Last active December 19, 2015 07:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jrochkind/5920077 to your computer and use it in GitHub Desktop.
Save jrochkind/5920077 to your computer and use it in GitHub Desktop.
Notes on a jruby solr marc indexing api/dsl

Just initial sketch, I'm SURE as I actually started to implement, some details would change--can't spec it all out in advance iron-clad.

I don't know what to call it, let's say sindexer

From the command line, you'd run:

sindex config_file.rb config_file2.rb -i input_marc.mrc

You can have multiple config files listed. Config files are plain old ruby (DSL), although they don't neccesarily need to look like that. If no -i argument, then it reads from STDIN.

The sindex command line is just a light wrapper over an SIndex::Indexer class, maybe it does something like:

indexer = SIndexer::Index.new
config_files.each do |config_file|
   indexer.instance_eval do
      load config_file
    end
end
indexer.index!

That is, the config files are just instance_eval'd in the context of the Indexer, so their 'dsl' is just methods defined in Indexer.

So, the acutal important part, the config file:

# The first thing that's in the config file is key/value configuration, we'll
# exmaple that first. 

# A FLAT hash of configuration key/values. FLAT, not nested hashes. 
# use dots for hiearchy. 

configuration do
  set "solr.url", "http://somewhere.tld/solr"
  # TRANSPARENCY in referring to things that effect SolrJ,
  # there's probably a default here, but....
  set "solrj.solr_server_class", "StreamingUpdateSolrServer"
  
  # There could sometimes be lambdas in config too?
end  

You can also set/override config from the command-line thing:

sindex config_file.rb -c solrj.solr_server_class=StreamingUpdateSolrServer

Although obviously that doens't work with lambdas. But that's part of why we allow multiple config files on the command line, you can mix and match them. There might also be other command line flags that are shortcuts for certain -c keys

Okay, but the actual meat, how do do indexing mapping! The base operation is a method that takes a block (but there can be DSL on top of this)

#in a sindex config file

index_to "solr_field_name" do |accumulator, record, context|
  # I'm not paying attention to the actual api of marc4j/ruby-marc here, just psuedocoding it
  
  accumulator <<  record['245']['h'].upcase  
  accumulator <<  record['700']['a']  
  accumulator.uniq!  # it's just ruby, mutate that array!
end

The index_to method gives the block an accumulator array, at the end of the block it just takes everything in accumulator and adds it to it's own internal hash['solr_field_name_provided_as_argument']

You can call index_to multiple times with same solr field name, they just get added to each other. index_to's end up called in order of definition, guaranteed.

The third 'context' parameter to the block is an SIndex::Context object, that gives you a bunch of stuff, it's cleared out for each new record iterated through, it's the record-specific context:

  • context.output_hash --> the hash keyed by solr field names of what will eventually get sent to solr, you can look at what's there already to base your logic on it if if you want
  • context.accumulator, context.record --> Same as the first and second block args.
  • context.clipboard --> A hash that's initially empty, for your own use. For storing expensive-to-compute things that you might need to use in multiple index_tos. Say, a mapping to your internal 'content type' vocabulary.
  • Maybe more stuff, I dunno. Part of the point of putting a Context as third arg, is it's extensible without changing number of args.

Okay, so the basic logical building block is that method that takes a block like that. But we can provide re-usable "macros", by providing existing lambda's to the index_to argument.

For instance, yeah, it's inconvneient to deal with marc records manually like that.

There are a few different ways we could do that, I'm still thinking through it.

module MarcMacros
def extract_marc(marc_spec)
  # you know marc_spec is like "245abh:700abc"
  # the magic of lambda, we _return_ a lambda that takes
  # those three args from `index_to` block
  return lambda do |accumulator, record, context|
    # it's a closure, we can use `marc_spec` here, just do it
    marc_spec.split(":").each do |field|
      accumulator << record[field] # yeah, this is over-simplification
    end
  end
end
end

# Now somehow MarcMacro's gets mixed in to the SIndex::Indexer, and
# in config file you can do this, it's just plain old ruby now,
# since extract_marc returns a lambda

index_to 'title' &extract_marc("245abc:240abc")

## Or maybe we make `index_to` take an optional second parameter
# that's a lambda, to avoid the confusing `&` stuff:

index_to 'title', extract_marc("245abc:240ab")

# With that second way of doing it, you can even combine-em to do
# some post-processing:

index_to 'title', extract_marc("245ab") do |acc, record, context|
  # some post processing of the accumulator, why not!
end

Some of these kinds of 'macros' can be built in. Others can be distributed in gems. Others can be defined locally just for your local use.

Macros's can keep simple common use cases not that much more complicated to write than they were in SolrMarc, even for people who aren't rubyists. But the macros are just built on the basic building block of passing around lambdas, it's easy to break out to full ruby to make it do exactly what you want, for the rubyist.

In general, keep as much of it plain ruby as possible, just modules extend'd into the Indexer.

Other methods the indexer might support in a config file (remember the config file is just instance_eval'd in an Indexer):

before_record do |record, context|
  # before any of the index_to's are run, you can set things up if you want, pre-calc,
  # whatever. Note this is also a place you could put logic in that needs to do
  # calculations that determine _multiple_ output fields, you can just
  # write em directly to context.output_hash
end

after_record do |record, context|
  # after all the index_to's
end

There would be macro's for 'translation map' kind of stuff.

Just use the ordinary ruby LOAD_PATH for where to look up translation map source files, as well as macros? Using ordinary ruby require for loading macro modules? Perhaps a command line flag as well as a dsl method in config for modifying ruby load path, so you don't have to deal with weird ruby puncutation variables, or so you can add relative (to config file!) paths, etc.

Logging

would probably use an ordinary ruby Logger, with different levels debug, info, etc. Can change logging level in config/command line. As well as set log file. (with option for sending logging just to stderr, for integrating in unix pipelines that's a LOT convenient. Maybe send to stderr AND log file?)

Output

Some abstraction over output, so it's easy to output to a file instead of Solr -- a file of exact Solr XML, or of json of our output maps, or whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment