billdueber/jruby_marc_solr.mkd

## jruby_marc_solr.mkd

      
    Raw
  

              jruby_marc_solr.mkd
            
          
    Timing JRuby-based MARC indexing vs solrmarc v2.0

The data and processing

My MARC indexing does the following processing (working with marc4j records):

normal fields. 29 fields that can be described using nothing but the normal solrmarc syntax. This might be extensive (15-20 tags, lookups in either hashes or sets of regexp matches for transformation) but doesn't require custom code. This generic processor is written in Ruby.
custom fields. 10 (or 14 if it's a serial) fields that require custom code. These are also all Ruby.
all fields The single "get all the text in all the fields numbered 10 to 999" method. Ruby.
xml Turn the record into a MARC-XML string. This uses JRuby to get a java.io.StringWriter and javax.xml.transform.stream and then call marc4j's MarcXmlWriter method to write to it. Which, I just looked, isn't exactly how solrmarc does it. I'll have to benchmark it both ways. Both Ruby and Java
HLB Take a record, find all the callnumbers, normalize them, do a lookup against a set of callnumber ranges, and return all ranges in which each callnumber falls. All in Java (exact same .jar file used in solrmarc and called from JRuby)

The test case is 150,000 records that I just pulled out of a recent dump.
A word on "Single"-threading

I can't actually run my code with only one thread, due to the way solrj.StreamingUpdateSolrServer (SUSS) works. These numbers represent a single thread doing the read-marc-from-file (Aleph Sequential MARC) and process into a a solr doc, and a second thread whose job is to send stuff from the SUSS queue to solr itself.
Because of this, the cost of sending the documents to solr will be masked (total time added will likely drop to near zero) in my base-case with two threads.
Indexing all my fields

solrmarc is running with the direct-to-disk indexing (not via http). JRuby is using StreamingUpdateSolrServer over HTTP. The indexing process runs on the same machine the solr process is running on.
The 8-thread run is 6 for processing and 2 for sending stuff to Solr (i.e., I passed the number "2" in the number-of-threads slot to the SUSS constructor).
Original code


Threads
1
2
8


solrmarc
310
-
-


jruby  |    -  |      240 |     617

Results are in records/second. Higher is better
JRuby with 2 threads runs here about 75% the speed of solrmarc with a single thread (with, again, the caveat that the send-stuff-to-solr cost is probably almost completely masked).
For the record, when doing a full run, solrmarc generally reports more in the 275 records/second range.
After optimizing Array methods

I had an extra call to .flatten.compact which was running on every field as derived. I removed it and changed the basic code to use #uniq!, #flatten! and #compact! instead of their non-bang counterparts.


Threads
1
2
8


solrmarc
310
-
-


jruby  |    -  |      312   | 803

Now the JRuby code with two threads is on par with the single-threaded solrmarc.
Removing HLB from the indexing

Because the HLB code is (a) all Java, and (b) very expensive, it will tend to mask the differences between the two systems (and, because I'm the only one doing HLB, will make the numbers less valuable to non-me people).
Here's the same run but without any HLB processing
Original code


Threads
1
2
8


solrmarc
384
-
-


jruby  |    -  |      254 |     684

Here JRuby is running at 66% the speed of solrmarc when using just two threads.
It looks like I'm maxing out the jruby speed in some way -- either running out of threads (we have several solr processes going on that machine), maxing out how fast the two threads can push stuff to solr, or maybe even hitting the limit of how fast solr can ingest the stuff (since I'm making Solr do a fair bit of processing during the indexing phase via pattern filters and such).
How about for a full run?

I indexed a recent full dump of 6,917,324 records and got an overall pace of 838 records/second, with the run taking just under 2.5 hours. That's about 50K records/minute, or a little over 3 million records/hour.
Where does the time go?

I ran the 2-thread JRuby version and benchmarked how long things took (see above for what each type of processing step entails):
Original code


Processing
Seconds
%total


Normal fields
302
56.2%


Custom fields
73
13.6%


Single "allfields" field
36
6.7%


HLB
73
13.6%


to_xml
53
9.9%


After optimizing Array methods


Processing
Seconds
%total


Normal fields
166
41.2%


Custom fields
72
18.1%


Single "allfields" field
34
8.5%


HLB
72
18.1%


to_xml
54
13.6%


HLB is already pretty damn efficient; I'm not sure how much I could gain there. But the to_xml and the allfields calls are probably ripe for a little optimization.
Processing	Seconds	%total
Normal fields	302	56.2%
Custom fields	73	13.6%
Single "allfields" field	36	6.7%
HLB	73	13.6%
to_xml	53	9.9%
Processing	Seconds	%total
Normal fields	166	41.2%
Custom fields	72	18.1%
Single "allfields" field	34	8.5%
HLB	72	18.1%
to_xml	54	13.6%