Skip to content

Instantly share code, notes, and snippets.

Created August 7, 2020 00:37
Show Gist options
  • Save bibliotechy/e70417dff4d2e4e92cec03f635bf56f1 to your computer and use it in GitHub Desktop.
Save bibliotechy/e70417dff4d2e4e92cec03f635bf56f1 to your computer and use it in GitHub Desktop.


2019.05.29 TUL Cob Cob Prod Solr 3 Production Issues

Response Doc


May 29-30, 2019, multiple incidents:

  • Wednesday May 29th, 1:04 PM-1:18 PM (13m 23s outage)
  • Wednesday May 29th, 9:24 PM-9:35 PM (12m 47s outage)
  • Thursday May 30th, 9:56 AM-10:03 AM (6m 41s outage)
  • Thursday May 30th, 11:36 AM-11:41 AM (4m 46s outage)
  • Thursday May 30th, 3:00 PM-3:09 PM (8m 43s outage)


  • Christina


  • Partial fix in place
  • Process & infrastructure cleanups queued


Without full use by teams of the TUL COB Stage environment for reviewing infrastructure & codebases staged for Production, a bad Solr box was put into Production.


Production LibrarySearch (tul cob) experienced multiple outages (see dates above), directly impacting users.

Root Causes

  • Incomplete usage by PMs & developers of process for using Stage before Prod to test infrastructure & codebases queued for release to Production
  • Solr Prod Box 3 JVM having issues with SWAP space, leading to Solr slow responses & box network response issues
  • private networking was turned on new box
  • TUL Cob has no elasticity for slow Solr responses
  • Solr dependency is strong, so we need to know right away, but good to help manage this as we add other solr cores / dependencies
  • Infrastructure & application deployment pre-Christina follows no clear guidelines or specifications, so trial & error in replicating for new infrastructure resources
    • solr cores issue in playbook builds (how were they built? Mistake introduced where these cores weren’t created after this?)


  • Solr Prod 3 VM overwhelmed (?); Network response slows down / freezes; Tul Cob app gets no response; Passenger backs up; Outage occurs.


  • Change to Cob Prod Solr 1 or Cob Prod Solr 2
  • Full rebuild & try new operations parameters on Cob Prod Solr 3 box
  • Add timeout handling for Solr queries within Cob Prod Rails Application
  • Get TUL Cob team properly using Stage environment
  • Test also on QA or Stage
    • recreate solr searches against solr based on access logs
    • test that it wasn’t a search pattern issue
    • also starts to create a load test
  • Get full Dev Team to document infrastructure, Ops, DevOps specs & use Terraform going forward
  • Add HealthChecks for Solr services
  • Longterm: Change to SolrCloud cluster
  • Longterm: Networking & connectivity monitoring to help inform logs analysis & error reporting


  • Honeybadger health checks on TUL Cob

Action Items

  • Tim will reindex .60 solr (cob-prod-solr-2) with airflow-stage; (in progress)
  • David will create PR to swap Prod solr back to cob-prod-solr-1
    • manually update bashrc errors in qa + stage
    • PR on QA with prod solr swaps + bash repairs
    • Off than, move to deploy to stage via main then to prod (tul_cob)
  • Christina will rebuild cob-prod-solr-3 after above swap & after grabbing logs from that machine

Lessons Learned

What went well:

  • Paired on day on the fix - dkinzer / cmharlow / tb
  • Knew where to target fixes / issue info gathering / dkinzer
  • Honeybadger monitoring for tul cob / cmharlow

What went wrong:

  • Solr SWAP was hard with deployment process (bc partially present) / dkinzer
  • Didn’t test staged new infrastructure / chad
  • Communication issue(s) around new patterns & stage environments / group discussion
  • Lost vacation day / christina
  • Still uncertain what happened in guts of solr around this / steven

Where we got lucky:

  • Occurred in summer & during a move, so traffic isn’t as high / jennifer
  • Outages were relatively short (less than 15 minutes total)

Questions or Comments overall

  • When to break process / dkinzer
  • Crystalized discussion on what on-call means in our team, what SLAs exist & are enforced / christina


  • Christina will take resolutions above + make sure are ticketed in relevant projects (if not already)

Supporting information

  • Team Values & Practices Discussion
    • Protocol for documentation PRs
      • Review PRs in meetings as group; then merge after
      • Decided to do group review of this
      • Make sure there are PR updates
    • Make sure relevant stuff is communicated out to PMs, managers, etc.
      • Also, GrittyOps can be used by others who want to open issues to be reviewed by the dev team

Title (incident number)






Root Causes




Action Items

Lessons Learned

What went well:

What went wrong:

Where we got lucky:


Supporting information

Template from: Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy."Site Reliability Engineering."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment