Skip to content

Instantly share code, notes, and snippets.

@bibliotechy
Created August 7, 2020 00:37
Show Gist options
  • Save bibliotechy/e70417dff4d2e4e92cec03f635bf56f1 to your computer and use it in GitHub Desktop.
Save bibliotechy/e70417dff4d2e4e92cec03f635bf56f1 to your computer and use it in GitHub Desktop.

Title

2019.05.29 TUL Cob Cob Prod Solr 3 Production Issues

Response Doc

Date(s)

May 29-30, 2019, multiple incidents:

  • Wednesday May 29th, 1:04 PM-1:18 PM (13m 23s outage)
  • Wednesday May 29th, 9:24 PM-9:35 PM (12m 47s outage)
  • Thursday May 30th, 9:56 AM-10:03 AM (6m 41s outage)
  • Thursday May 30th, 11:36 AM-11:41 AM (4m 46s outage)
  • Thursday May 30th, 3:00 PM-3:09 PM (8m 43s outage)

Authors

  • Christina

Status

  • Partial fix in place
  • Process & infrastructure cleanups queued

Summary

Without full use by teams of the TUL COB Stage environment for reviewing infrastructure & codebases staged for Production, a bad Solr box was put into Production.

Impact

Production LibrarySearch (tul cob) experienced multiple outages (see dates above), directly impacting users.

Root Causes

  • Incomplete usage by PMs & developers of process for using Stage before Prod to test infrastructure & codebases queued for release to Production
  • Solr Prod Box 3 JVM having issues with SWAP space, leading to Solr slow responses & box network response issues
  • private networking was turned on new box
  • TUL Cob has no elasticity for slow Solr responses
  • Solr dependency is strong, so we need to know right away, but good to help manage this as we add other solr cores / dependencies
  • Infrastructure & application deployment pre-Christina follows no clear guidelines or specifications, so trial & error in replicating for new infrastructure resources
    • solr cores issue in playbook builds (how were they built? Mistake introduced where these cores weren’t created after this?)

Trigger

  • Solr Prod 3 VM overwhelmed (?); Network response slows down / freezes; Tul Cob app gets no response; Passenger backs up; Outage occurs.

Resolution

  • Change to Cob Prod Solr 1 or Cob Prod Solr 2
  • Full rebuild & try new operations parameters on Cob Prod Solr 3 box
  • Add timeout handling for Solr queries within Cob Prod Rails Application
  • Get TUL Cob team properly using Stage environment
  • Test also on QA or Stage
    • recreate solr searches against solr based on access logs
    • test that it wasn’t a search pattern issue
    • also starts to create a load test
  • Get full Dev Team to document infrastructure, Ops, DevOps specs & use Terraform going forward
  • Add HealthChecks for Solr services
  • Longterm: Change to SolrCloud cluster
  • Longterm: Networking & connectivity monitoring to help inform logs analysis & error reporting

Detection

  • Honeybadger health checks on TUL Cob

Action Items

  • Tim will reindex .60 solr (cob-prod-solr-2) with airflow-stage; (in progress)
  • David will create PR to swap Prod solr back to cob-prod-solr-1
    • manually update bashrc errors in qa + stage
    • PR on QA with prod solr swaps + bash repairs
    • Off than, move to deploy to stage via main then to prod (tul_cob)
  • Christina will rebuild cob-prod-solr-3 after above swap & after grabbing logs from that machine

Lessons Learned

What went well:

  • Paired on day on the fix - dkinzer / cmharlow / tb
  • Knew where to target fixes / issue info gathering / dkinzer
  • Honeybadger monitoring for tul cob / cmharlow

What went wrong:

  • Solr SWAP was hard with deployment process (bc partially present) / dkinzer
  • Didn’t test staged new infrastructure / chad
  • Communication issue(s) around new patterns & stage environments / group discussion
  • Lost vacation day / christina
  • Still uncertain what happened in guts of solr around this / steven

Where we got lucky:

  • Occurred in summer & during a move, so traffic isn’t as high / jennifer
  • Outages were relatively short (less than 15 minutes total)

Questions or Comments overall

  • When to break process / dkinzer
  • Crystalized discussion on what on-call means in our team, what SLAs exist & are enforced / christina

Timeline

  • Christina will take resolutions above + make sure are ticketed in relevant projects (if not already)

Supporting information

  • Team Values & Practices Discussion
    • Protocol for documentation PRs
      • Review PRs in meetings as group; then merge after
      • Decided to do group review of this
      • Make sure there are PR updates
    • Make sure relevant stuff is communicated out to PMs, managers, etc.
      • Also, GrittyOps can be used by others who want to open issues to be reviewed by the dev team

Title (incident number)

Date

Authors

Status

Summary

Impact

Root Causes

Trigger

Resolution

Detection

Action Items

Lessons Learned

What went well:

What went wrong:

Where we got lucky:

Timeline

Supporting information

Template from: Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy."Site Reliability Engineering."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment