bibliotechy/example.md

## example.md

      
    Raw
  

              example.md
            
          
    Title

2019.05.29 TUL Cob Cob Prod Solr 3 Production Issues
Response Doc


https://github.com/tulibraries/grittyOps/wiki/2019.05.29-TUL-Cob---Cob-Prod-Solr-3-Production-Issues

Date(s)

May 29-30, 2019, multiple incidents:

Wednesday May 29th, 1:04 PM-1:18 PM (13m 23s outage)
Wednesday May 29th, 9:24 PM-9:35 PM (12m 47s outage)
Thursday May 30th, 9:56 AM-10:03 AM (6m 41s outage)
Thursday May 30th, 11:36 AM-11:41 AM (4m 46s outage)
Thursday May 30th, 3:00 PM-3:09 PM (8m 43s outage)

Authors


Christina

Status


Partial fix in place
Process & infrastructure cleanups queued

Summary

Without full use by teams of the TUL COB Stage environment for reviewing infrastructure & codebases staged for Production, a bad Solr box was put into Production.
Impact

Production LibrarySearch (tul cob) experienced multiple outages (see dates above), directly impacting users.
Root Causes


Incomplete usage by PMs & developers of process for using Stage before Prod to test infrastructure & codebases queued for release to Production
Solr Prod Box 3 JVM having issues with SWAP space, leading to Solr slow responses & box network response issues
private networking was turned on new box
TUL Cob has no elasticity for slow Solr responses
Solr dependency is strong, so we need to know right away, but good to help manage this as we add other solr cores / dependencies
Infrastructure & application deployment pre-Christina follows no clear guidelines or specifications, so trial & error in replicating for new infrastructure resources

solr cores issue in playbook builds (how were they built? Mistake introduced where these cores weren’t created after this?)


Trigger


Solr Prod 3 VM overwhelmed (?); Network response slows down / freezes; Tul Cob app gets no response; Passenger backs up; Outage occurs.

Resolution


Change to Cob Prod Solr 1 or Cob Prod Solr 2
Full rebuild & try new operations parameters on Cob Prod Solr 3 box
Add timeout handling for Solr queries within Cob Prod Rails Application
Get TUL Cob team properly using Stage environment
Test also on QA or Stage

recreate solr searches against solr based on access logs
test that it wasn’t a search pattern issue
also starts to create a load test


Get full Dev Team to document infrastructure, Ops, DevOps specs & use Terraform going forward
Add HealthChecks for Solr services
Longterm: Change to SolrCloud cluster
Longterm: Networking & connectivity monitoring to help inform logs analysis & error reporting

Detection


Honeybadger health checks on TUL Cob

Action Items


Tim will reindex .60 solr (cob-prod-solr-2) with airflow-stage; (in progress)
David will create PR to swap Prod solr back to cob-prod-solr-1

manually update bashrc errors in qa + stage
PR on QA with prod solr swaps + bash repairs
Off than, move to deploy to stage via main then to prod (tul_cob)


Christina will rebuild cob-prod-solr-3 after above swap & after grabbing logs from that machine

Lessons Learned

What went well:


Paired on day on the fix - dkinzer / cmharlow / tb
Knew where to target fixes / issue info gathering / dkinzer
Honeybadger monitoring for tul cob / cmharlow

What went wrong:


Solr SWAP was hard with deployment process (bc partially present) / dkinzer
Didn’t test staged new infrastructure / chad
Communication issue(s) around new patterns & stage environments / group discussion
Lost vacation day / christina
Still uncertain what happened in guts of solr around this / steven

Where we got lucky:


Occurred in summer & during a move, so traffic isn’t as high / jennifer
Outages were relatively short (less than 15 minutes total)

Questions or Comments overall


When to break process / dkinzer
Crystalized discussion on what on-call means in our team, what SLAs exist & are enforced / christina

Timeline


Christina will take resolutions above + make sure are ticketed in relevant projects (if not already)

Supporting information


Team Values & Practices Discussion

Protocol for documentation PRs

Review PRs in meetings as group; then merge after
Decided to do group review of this
Make sure there are PR updates


Make sure relevant stuff is communicated out to PMs, managers, etc.

Also, GrittyOps can be used by others who want to open issues to be reviewed by the dev team


## template.md

      
    Raw
  

              template.md
            
          
    Title (incident number)

Date

Authors

Status

Summary

Impact

Root Causes

Trigger

Resolution

Detection

Action Items

Lessons Learned

What went well:

What went wrong:

Where we got lucky:

Timeline

Supporting information

Template from: Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy."Site Reliability Engineering."