Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save viveknarang/628e533c3fdc652c666a9ae3db1e4983 to your computer and use it in GitHub Desktop.
Save viveknarang/628e533c3fdc652c666a9ae3db1e4983 to your computer and use it in GitHub Desktop.
[13:14] == SolrUser [6020d824@gateway/web/freenode/ip.96.32.216.36] has joined #solr
[13:14] == shellac [~textual@vpn-user-249-054.nomadic.bris.ac.uk] has joined #solr
[13:14] == clyon_ [~clyon@69-12-169-18.dedicated.static.sonic.net] has joined #solr
[13:15] <SolrUser> Hi, So I am trying to index documents using the ConcurrentUpdateSolrClient but even with different thread counts I do not notice any changes in indexing time ?
[13:16] <SolrUser> is there something i am not doing right ?
[13:20] <@hoss> SolrUser: it's posible that the code you have feeding docs to your SolrClient is itself the bottleneck
[13:20] <@hoss> in general, i'm not really a fan of ConcurrentUpdateSolrClient -- it doesn't do what most people expect, and it's error handling mechanincs are confusing
[13:20] <SolrUser> okay
[13:21] <@hoss> i haven't done extenisive testing to prove it, but i'm pretty sure that with a well configured HttpClient using aggressive keep-alive options, you can get HttpSolrClient to be just as efficient
[13:21] <@hoss> where CUSC is *suppose* to be helpful is that you can "throw documents" at it w/o worrying about your client threads waiting for the response from solr
[13:22] == Guest4 [~textual@77.16.69.51.tmi.telenormobil.no] has joined #solr
[13:22] <@hoss> but as it's javadocs note, it buffers docs and sends them in a single connection -- arguably to reduce network overhead of lots of connections
[13:22] <SolrUser> what would you recommend me to use for threaded indexing ?
[13:22] == Brig_ [4227a68a@gateway/web/freenode/ip.66.39.166.138] has joined #solr
[13:22] <@elyograg> CUSC is good for initial bulk loading where you don't care about knowing whether the indexing worked.
[13:23] <@elyograg> SolrUser: handling multiple threads yourself and using either HttpSolrClient or CloudSolrClient.
[13:23] <SolrUser> okay
[13:23] == bauruine [~bauruine@2a01:4f8:130:8285:fefe::36] has quit [Max SendQ exceeded]
[13:24] <@hoss> like i said: i supsect if you had an HttpSolrClient, and you made sure the underlying HttpClient had a high max threads per host (or whatever it's called these days) and long keep-alive options, you could probably ramp up the number of threads using a single instance and see faster indexing (untill either your client CPUs or solr CPUs gets saturated)
[13:24] <SolrUser> okay
[13:25] <@elyograg> Something I've been working on is a design with a central queue with mulitple consumer threads pulling things off the queue, building document lists, and sending requests ... with one or more threads adding to the queue.
[13:25] <SolrUser> that would sure be great
[13:27] == Brig_ [4227a68a@gateway/web/freenode/ip.66.39.166.138] has quit [Ping timeout: 260 seconds]
[13:27] <@elyograg> like hoss says, if the HttpClient object is configured right, one SolrClient object can handle many threads.
[13:27] <SolrUser> One more question: Should indexing time be constant or should it vary ? I am observing varying indexing times for exact data set on same machine and their variation is huge.
[13:28] <@elyograg> the defaults for HttpClient objects only allow two threads to run at the same time.
[13:29] <SolrUser> humm interesting i didnt knew that
[13:30] <@elyograg> if you're not committing as part of the update, times should be somewhat similar, but variation can happen. If you're committing, it could be extremely variable.
[13:30] == bauruine [~bauruine@2a01:4f8:130:8285:fefe::36] has joined #solr
[13:31] <SolrUser> Humm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment