2019.05.29 TUL Cob Cob Prod Solr 3 Production Issues
- https://github.com/tulibraries/grittyOps/wiki/2019.05.29-TUL-Cob---Cob-Prod-Solr-3-Production-Issues
May 29-30, 2019, multiple incidents:
- Wednesday May 29th, 1:04 PM-1:18 PM (13m 23s outage)
- Wednesday May 29th, 9:24 PM-9:35 PM (12m 47s outage)
- Thursday May 30th, 9:56 AM-10:03 AM (6m 41s outage)
- Thursday May 30th, 11:36 AM-11:41 AM (4m 46s outage)
- Thursday May 30th, 3:00 PM-3:09 PM (8m 43s outage)
- Christina
- Partial fix in place
- Process & infrastructure cleanups queued
Without full use by teams of the TUL COB Stage environment for reviewing infrastructure & codebases staged for Production, a bad Solr box was put into Production.
Production LibrarySearch (tul cob) experienced multiple outages (see dates above), directly impacting users.
- Incomplete usage by PMs & developers of process for using Stage before Prod to test infrastructure & codebases queued for release to Production
- Solr Prod Box 3 JVM having issues with SWAP space, leading to Solr slow responses & box network response issues
- private networking was turned on new box
- TUL Cob has no elasticity for slow Solr responses
- Solr dependency is strong, so we need to know right away, but good to help manage this as we add other solr cores / dependencies
- Infrastructure & application deployment pre-Christina follows no clear guidelines or specifications, so trial & error in replicating for new infrastructure resources
- solr cores issue in playbook builds (how were they built? Mistake introduced where these cores weren’t created after this?)
- Solr Prod 3 VM overwhelmed (?); Network response slows down / freezes; Tul Cob app gets no response; Passenger backs up; Outage occurs.
- Change to Cob Prod Solr 1 or Cob Prod Solr 2
- Full rebuild & try new operations parameters on Cob Prod Solr 3 box
- Add timeout handling for Solr queries within Cob Prod Rails Application
- Get TUL Cob team properly using Stage environment
- Test also on QA or Stage
- recreate solr searches against solr based on access logs
- test that it wasn’t a search pattern issue
- also starts to create a load test
- Get full Dev Team to document infrastructure, Ops, DevOps specs & use Terraform going forward
- Add HealthChecks for Solr services
- Longterm: Change to SolrCloud cluster
- Longterm: Networking & connectivity monitoring to help inform logs analysis & error reporting
- Honeybadger health checks on TUL Cob
Tim will reindex .60 solr (cob-prod-solr-2) with airflow-stage; (in progress)David will create PR to swap Prod solr back to cob-prod-solr-1manually update bashrc errors in qa + stagePR on QA with prod solr swaps + bash repairsOff than, move to deploy to stage via main then to prod (tul_cob)
Christina will rebuild cob-prod-solr-3 after above swap & after grabbing logs from that machine
- Paired on day on the fix - dkinzer / cmharlow / tb
- Knew where to target fixes / issue info gathering / dkinzer
- Honeybadger monitoring for tul cob / cmharlow
- Solr SWAP was hard with deployment process (bc partially present) / dkinzer
- Didn’t test staged new infrastructure / chad
- Communication issue(s) around new patterns & stage environments / group discussion
- Lost vacation day / christina
- Still uncertain what happened in guts of solr around this / steven
- Occurred in summer & during a move, so traffic isn’t as high / jennifer
- Outages were relatively short (less than 15 minutes total)
- When to break process / dkinzer
- Crystalized discussion on what on-call means in our team, what SLAs exist & are enforced / christina
- Christina will take resolutions above + make sure are ticketed in relevant projects (if not already)
- Team Values & Practices Discussion
- Protocol for documentation PRs
- Review PRs in meetings as group; then merge after
- Decided to do group review of this
- Make sure there are PR updates
- Make sure relevant stuff is communicated out to PMs, managers, etc.
- Also, GrittyOps can be used by others who want to open issues to be reviewed by the dev team
- Protocol for documentation PRs