Skip to content

Instantly share code, notes, and snippets.

@bess
Last active February 17, 2022 17:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bess/557fc498cf843505939ffa94bf7ac655 to your computer and use it in GitHub Desktop.
Save bess/557fc498cf843505939ffa94bf7ac655 to your computer and use it in GitHub Desktop.
blacklight_dynamic_sitemap setup

Configuring blacklight_dynamic_sitemap

For pdc_discovery, a blacklight application at Princeton to improve findability for open access data sets, we want to publish a sitemap so that search engine crawlers can more easily index our content. Orangelight, Princeton's library catalog, which is also a Blacklight application, uses an older system, blacklight-sitemap. However, blacklight-sitemap hasn't been updated in awhile, and using rake tasks to re-generate very large sitemaps is less than ideal because it takes time and the sitemaps become stale quickly. Given these drawbacks to our existing approach, I was excited to try the more recent solution in use at Stanford and Penn State (among others): blacklight_dynamic_sitemap.

Jack Reed, one of the authors of this solution, has a good blog post describing the strategy behind the gem. That article is a great place to start to get the high level overview of what's happening. To summarize, a sitemap can only contain 50k records, so we split our index into chunks such that the top level sitemap stays within that 50k limit, and each entry in the top level sitemap links to a sub-sitemap representing a chunk of the overall document collection. See Jack's article for a much more thorough explanation.

Implementing a dynamic sitemap in this way requires, as one might expect, adding the blacklight_dynamic_sitemap gem to one's Blacklight application. This part of the process went well and behaved exactly as described in the gem's README. However, getting it to work also requires configuring solr to calculate a hash value for each document, and that part gave me some trouble. I'm documenting it here for my own future reference and in case it helps anyone else.

I don't know why exactly, but the solr configuration recommended in the blacklight_dynamic_sitemap did not work for me. However, a slightly different configuration, adapted from Penn State's implementation of the same solution, did work. The relevant stanza of my solrconfig.xml file now looks like this:

  <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory" name="add_hash_id">
    <bool name="enabled">true</bool>
    <str name="signatureField">hashed_id_ssi</str>
    <bool name="overwriteDupes">false</bool>
    <str name="fields">id</str>
    <str name="signatureClass">solr.processor.Lookup3Signature</str>
  </updateProcessor>

  <updateRequestProcessorChain name="cloud" processor="add_hash_id" default="true">
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

After putting that solr config in place and re-indexing my content, each record now has a field called hashed_id_ssi, the first digit of which determines which sitemap bucket it will appear in. A top level sitemap is available at https://MY_APPLICATION_NAME/sitemap and we're ready to set those indexing spiders loose on our data sets!

Many thanks to my excellent colleague Hector Correa for helping me solve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment