Skip to content

Instantly share code, notes, and snippets.

@scumola
Created July 2, 2012 17:50
Show Gist options
  • Save scumola/3034568 to your computer and use it in GitHub Desktop.
Save scumola/3034568 to your computer and use it in GitHub Desktop.
Alex C IM chat log from June 29, 2012 re: AWS/DOL outage
9:13:42 PM Alex Cook: We are back up
9:13:49 PM Alex Cook: I think it was WRA2 controller
9:13:51 PM steveholly051802: Ahh.
9:13:53 PM steveholly051802: ok
9:13:56 PM Alex Cook: https://rpm.newrelic.com/accounts/47839/applications/177265/transactions#id=52084693
9:15:35 PM steveholly051802: So we were ok. The WRA2 service was hanging or something?
9:15:55 PM Alex Cook: it looks like it not 100% sure yet
9:16:29 PM Alex Cook: is mapi going down now??
9:16:34 PM steveholly051802: I'm seeing mapi errors now too.
9:16:39 PM steveholly051802: It could be amazon.
9:17:02 PM Alex Cook: checking the console
9:17:08 PM steveholly051802: yea, me too. (
9:17:26 PM steveholly051802: The console it taking a long time for me.
9:18:01 PM steveholly051802: Seeing RDS issues too.
9:18:06 PM Alex Cook: mobile looks like some sort of memcache error possibly?
9:19:25 PM steveholly051802: If it can't contact the memcache servers, that would be a problem. I'm gussing that it's amazon since it's so wide-spread.
9:19:39 PM Alex Cook: yeah I think so too
9:20:34 PM Alex Cook: the cpu's on RDS were low
9:20:39 PM Alex Cook: connections just drop
9:22:35 PM steveholly051802: It's AWS: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.
9:22:40 PM steveholly051802: http://status.aws.amazon.com/
9:23:18 PM steveholly051802: I took most of the affected hosts out of nagios for a 2-hour downtime
9:23:28 PM Alex Cook: cool well at least that's an easy explaination
9:23:47 PM steveholly051802: Yea.
9:24:09 PM steveholly051802: When it's a huge thing, of unreladed things, it's usually AWS.
9:24:25 PM steveholly051802: to use the technical terms. :)
9:24:56 PM Alex Cook: lol
9:26:31 PM Alex Cook: it looks like origin is down again
9:27:19 PM steveholly051802: I'm getting data from origin.dishonline.com
9:27:25 PM Alex Cook: hrmm
9:27:40 PM steveholly051802: but it's possible that some machines are checking out or hanging on back-end resources like RDS and memcache
9:27:54 PM steveholly051802: haproxy thinks all apps are healthy.
9:29:20 PM steveholly051802: AWS can't load the list of instances that are failing network checks in US-East.
9:29:35 PM steveholly051802: The status checks thing is hanging on the console.
9:30:01 PM Alex Cook: yeah same for me
9:30:26 PM steveholly051802: Not much we can do but to sit it out, I think.
9:30:34 PM Alex Cook: "Unable to process request, please retry shortly?
9:30:40 PM Alex Cook: with out the ?
9:31:11 PM steveholly051802: on the main EC2 dashboard, I'm seeing "EBS Volumes: An error occurred", and same with EBS snapshots and Key Pairs.
9:31:18 PM steveholly051802: so they're having internal issues as well.
9:31:35 PM Alex Cook: yeah freaking amazon
9:32:02 PM Alex Cook: ok I'm going to afk and check back in a few
9:32:14 PM steveholly051802: ok
Changed status to Away: Away (9:33:28 PM)
9:33:33 PM steveholly051802: (back in a sec)
Changed status to Online (9:36:41 PM)
Changed status to Away: Away (9:37:43 PM)
9:40:41 PM steveholly051802: (back)
Changed status to Idle (9:47:43 PM)
Changed status to Available (10:11:13 PM)
Changed status to Online (10:11:17 PM)
10:11:37 PM Alex Cook: same
10:12:17 PM steveholly051802: Amazon has updated that it was power and network, and they've restored power and we're seeing machines come back now.
10:12:33 PM Alex Cook: awesome
10:12:39 PM steveholly051802: Not everything yet though.
10:12:47 PM steveholly051802: Pingdom says that the site is up now.
10:13:16 PM Alex Cook: yeah my prod logs have been whizzing by
10:13:24 PM steveholly051802: nagios has 8 hosts down and 108 critical services.
10:13:33 PM steveholly051802: normal is 0 hosts down and about 20 services.
10:13:57 PM Alex Cook: yeah must have been pretty serious
10:14:00 PM steveholly051802: Almost all of the appserver boxes are back
10:18:47 PM Alex Cook: yeah I just checked Netflix is down too
10:18:56 PM Alex Cook: and comcast
10:19:16 PM steveholly051802: What? The Chaos monkey is supposed to fix all of their outage issues!
10:19:24 PM steveholly051802: Nobody is watching TV tonight, I guess.
10:19:26 PM Alex Cook: haha
10:40:33 PM steveholly051802: Can you get to prod-appserv13 and check to see if things are healthy?
10:40:48 PM steveholly051802: the machine is up, but the healthcheck is still failing. The app might need kicking.
10:41:17 PM Alex Cook: sure
10:41:22 PM steveholly051802: Thanks.
10:41:31 PM steveholly051802: RDS replication has failed almost everywhere, I think.
10:41:42 PM steveholly051802: I may have to re-spin all of the slaves.
10:41:54 PM Alex Cook: crap
10:42:20 PM Alex Cook: yeah once the masters go down or are unavailable the slaves are screwed
10:43:26 PM steveholly051802: many RDS instances are unavailable alltogether.
10:43:34 PM steveholly051802: mostly solr
10:44:14 PM Alex Cook: nginx is having trouble connecting to the host
10:44:19 PM Alex Cook: on that prod server
10:44:33 PM Alex Cook: tomcat could have crashed on solr
10:44:39 PM steveholly051802: I will restart nginx.
10:46:04 PM steveholly051802: Can you check solr on anything that's red in nagios with the word solr in the hostname?
10:46:17 PM Alex Cook: yeah
10:47:14 PM steveholly051802: wait.
10:47:28 PM steveholly051802: It's probably just RDS. I can reboot that RDS node and see if things come back.
10:47:41 PM steveholly051802: nginx looks happy to me on prod-appserv13 now after a restart
10:47:50 PM Alex Cook: excellent
10:47:53 PM Alex Cook: yeah try a restart
10:47:57 PM Alex Cook: that might wake it up
10:48:00 PM steveholly051802: rebooting the prod-solr RDS instance.
10:53:02 PM steveholly051802: RDS is taking its sweet time rebooting
11:01:25 PM steveholly051802: If you want, you can probably continue your evening. The site's back up. I just need to wait until the rest of the instances return to normal again. The solr RDS instance isn't rebooting, so it's possible that it's still an affected EC2 instance
11:01:59 PM Alex Cook: Ok, well I'll stay on IM and what not and glance at it
11:02:13 PM Alex Cook: let me know if there are any issues after the reboots
11:02:23 PM steveholly051802: ok. Will do.
11:02:31 PM steveholly051802: Search seems to be ok on the site.
11:03:06 PM steveholly051802: Can you just spot-test basic functionality on the site to make sure that things are as expected?
11:03:15 PM Alex Cook: sure
11:05:42 PM Alex Cook: everything looks ok
11:08:05 PM steveholly051802: Excellent.
11:08:21 PM steveholly051802: "You may resume your napping …" (what movie was that)?
11:08:37 PM Alex Cook: lol not sure
11:09:11 PM steveholly051802: Empire Strikes Back - on of the commanders talking to Darth Vader - when they were chasing the M Falcon.
11:09:19 PM steveholly051802: s/on/one/
11:09:29 PM Alex Cook: lol really? I'll have to check that out
11:39:26 PM steveholly051802: Shit all of the app servers are checking out again.
11:41:12 PM Alex Cook: crap
11:41:24 PM steveholly051802: Same thing. NewRelic thinks it's WRA2
11:41:45 PM Alex Cook: could be because the servers are down though
11:41:47 PM steveholly051802: Damn Amazon probably broke again.
11:42:06 PM Alex Cook: I'm also seeing prod-slave-app1a as deleting
11:42:16 PM steveholly051802: The servers are up. I'm spinning down 2 RDS slaves now (one mobile, one WWW)
11:42:22 PM Alex Cook: ah ok
11:42:22 PM steveholly051802: Yea, I'm doing that on purpose.
11:42:27 PM steveholly051802: to re-start replication
11:42:42 PM steveholly051802: but the app should fail over - there are 3 more slaves available in RDS for the WWW app
11:43:00 PM steveholly051802: so I'm thinking that this is something else.
11:43:15 PM steveholly051802: Can you check one of the apps to make sure that it's not RDS?
11:43:20 PM Alex Cook: yeah
11:43:24 PM steveholly051802: Thanks
11:44:26 PM steveholly051802: I thought that we fixed haproxy to handle 502's.
11:44:27 PM Alex Cook: looks like app server 10 is having problems connecting to the host
11:44:31 PM Alex Cook: like 13 was
11:44:36 PM steveholly051802: so nginx is unhappy?
11:45:31 PM Alex Cook: it's happy now
11:45:40 PM steveholly051802: I restarted nginx on 10
11:45:48 PM steveholly051802: weird.
11:46:08 PM steveholly051802: AWS didn't reboot it
11:46:32 PM steveholly051802: I'll restart nginx on all app servers
11:46:37 PM Alex Cook: ok
11:47:29 PM steveholly051802: they look happy now.
11:49:49 PM steveholly051802: re-creating the first two RDS instances again now.
11:51:21 PM steveholly051802: nginx isn't being very smart about something.
11:53:58 PM Alex Cook: indeed, it doesn't seem to be recovering well
11:54:43 PM steveholly051802: I'm spinning down the x-prod-slave-app2a instance too since it never has recovered from the outage to begin with.
11:54:53 PM steveholly051802: I'll re-build it though.
11:55:06 PM steveholly051802: The apps are happy again.
11:55:38 PM Alex Cook: yep looks like we are back for now
12:16:17 AM steveholly051802: (back in a minute)
12:21:34 AM steveholly051802: (back)
12:21:42 AM Alex Cook: k
12:22:25 AM steveholly051802: The RDS slaves are taking *forever* to rebuild. Everyone in the world must be doing the same thing and hammering the system now.
12:23:07 AM Alex Cook: more than likely
12:23:16 AM Alex Cook: I'm sure a lot of people went down
12:24:29 AM steveholly051802: Yup. We'll hear about it on reddit and HN tomorrow.
12:24:48 AM Alex Cook: lol
12:24:55 AM Alex Cook: I'm sure amazon will get some crap for this
12:25:06 AM steveholly051802: yup
12:48:50 AM steveholly051802: Logging in, I get sent to ProcessLogin.do (or something like that) which isn't showing up for me in prod. Can you check? I get logged in ok, but I get an error page just after login.
12:49:18 AM Alex Cook: sure, might be a CSA issue
12:49:24 AM Alex Cook: or something with secure servers
12:49:35 AM steveholly051802: nevermind. That time it worked fine for me.
12:49:51 AM Alex Cook: yeah worked for me too
12:50:36 AM steveholly051802: k.. Thanks.
12:50:50 AM steveholly051802: Still spinning-up the first pair of slaves. Still not done yet. Uugh.
12:51:14 AM Alex Cook: yeah slow...
12:51:19 AM steveholly051802: If you want to go to bed, I think that would be fine.
12:51:25 AM steveholly051802: When does ingest happen? 8am?
12:51:40 AM steveholly051802: I'd like to get as many slaves as I can before ingest happens.
12:51:46 AM Alex Cook: it will kick off between 2-4 I think
12:51:51 AM steveholly051802: Oh, crap.
12:52:17 AM steveholly051802: If it doesn't replicate and just has stale data will it be ok?
12:52:33 AM Alex Cook: it should be ok if it's a day behind
12:52:45 AM steveholly051802: ok. I can continue to re-spin slaves tomorrow then.
12:52:49 AM Alex Cook: cool
12:53:04 AM Alex Cook: alright I'll keep my phone on
12:53:07 AM steveholly051802: k
12:53:10 AM steveholly051802: g'night
12:53:13 AM Alex Cook: later
Changed status to Offline (12:53:26 AM)
[SLEEP]
8:19:04 AM Alex Cook: Looks like prod-solr-slave is still rebooting, unless you just did that
8:19:12 AM Alex Cook: DJ is not happy
9:33:29 AM Alex Cook: what's up
9:33:34 AM steveholly051802: Yea, the slave-solr RDS instance never was rebooted, so I'm spinning up another one (prod-solr-slave2) and I changed DNS appropriately.
9:33:47 AM steveholly051802: I'm also re-spinning mapi-slave04 now.
9:34:13 AM Alex Cook: cool, that's what I was thinking was the solution just didn't want to pull the trigger on prod without knowing 100%
9:34:33 AM Alex Cook: sounds like AWS is still trying to recover from the outage
9:37:42 AM steveholly051802: Yea, we still have that one instance that's in permanent reboot state.
9:37:58 AM steveholly051802: so I'm guessing that things are not all right yet with AWS.
9:38:52 AM steveholly051802: I should contact them about the rebooting one.
9:39:44 AM steveholly051802: The spinning up of instances is already faster this morning, I can tell.
9:40:14 AM Alex Cook: nice, yeah I was reading a lot of forum post about the looping reboot state
9:40:22 AM Alex Cook: https://forums.aws.amazon.com/thread.jspa?messageID=359904
9:40:34 AM Alex Cook: we aren't the only ones...
9:41:17 AM steveholly051802: Ahh, so it's been reported then. Good.
9:43:38 AM steveholly051802: prod-dj12 doesn't seem to be sick. Is the app happy on that machine?
9:43:48 AM steveholly051802: The rest are complaining, but dj12 isn't.
9:44:20 AM Alex Cook: hrmmm it looks happy
9:45:29 AM steveholly051802: Is it taking traffic?
9:45:45 AM Alex Cook: 137k jobs in queue, checking the logs
9:47:53 AM steveholly051802: Now the AWS console is hanging for me.
9:48:01 AM Alex Cook: where dj10 only has 4 jobs in queue
9:48:21 AM Alex Cook: I can't ssh to dj12
9:48:25 AM Alex Cook: timeout
9:48:36 AM Alex Cook: it could be in la la land
9:48:50 AM steveholly051802: I'm in as user deploy.
9:49:06 AM Alex Cook: oh there it goes
9:49:12 AM Alex Cook: timed out the first time
9:49:52 AM steveholly051802: I'm going to head out for an hour. When I get back, I'll check if the RDS instances have completed re-spinning and fire up another batch.
9:50:09 AM steveholly051802: If something (else) breaks, give me a call on my cell, ok?
9:50:45 AM Alex Cook: sure before you go, it looks like the DJ ELB may be down?
9:50:59 AM Alex Cook: nvm it's just a warn
9:51:02 AM steveholly051802: Can you look into DJ? It should come back when the solr RDS instance comes back, right? Can you make sure that DJ is pointing to the DOL DNS name instead of directly at the RDS DNS name?
9:51:15 AM Alex Cook: sure
9:51:39 AM steveholly051802: Thanks.
9:52:02 AM steveholly051802: The prod-apidj ELB is the one for the new DJ. It looks good to me.
9:52:18 AM Alex Cook: ok cool, and yeah we are point to the DNS for DJ
9:52:30 AM steveholly051802: Excellent, so that should come back when the re-spin is done.
9:52:43 AM steveholly051802: DNS has been changed to point to the new one when it comes up.
9:53:05 AM Alex Cook: cool I'll keep an eye on it while your gone
9:53:10 AM steveholly051802: ok
9:53:26 AM steveholly051802: We're going to go get breakfast and hit a few garage sales, but we'll be back.
10:23:55 AM Alex Cook: Actually dj is not pointing to the DNS I was looking in the wrong place
11:57:18 AM steveholly051802: I'm back now. Sorry, it was more like 2 hours.
11:57:25 AM steveholly051802: Gonna re-spin the last two RDS machines.
11:57:26 AM Alex Cook: cool
11:57:38 AM Alex Cook: Looks like the secure server is having issues now too
11:57:45 AM steveholly051802: Aack
11:58:44 AM steveholly051802: spinning up x-prod-slave-app2a now.
11:58:53 AM steveholly051802: (RDS)
11:59:58 AM Alex Cook: also are you sure prod-slave-solr2 is using DNS name x-prod-slave-solr.dishonline.com
12:00:29 PM Alex Cook: We are still seeing errors trying to connect to x-prod-slave-solr. cuuvfxxjlwmd.us-east-1.rds. amazonaws. com
12:00:43 PM Alex Cook: can't figure out why
12:01:30 PM steveholly051802: $ ./find_ip.rb solr | grep CNAME
7592097 CNAME prod-slavedb02 prod-solr-search
8631099 CNAME x-prod-slave-solr2.cuuvfxxjlwmd.us-east-1.rds.amazonaws.com. x-prod-solr-slave
8631100 CNAME x-prod-slave-solr2.cuuvfxxjlwmd.us-east-1.rds.amazonaws.com. x-prod-slave-solr
12:01:56 PM Alex Cook: ok I think NR just hasnt had enough time
12:02:08 PM Alex Cook: the errors have gone down significantly
12:02:14 PM steveholly051802: I changed it last night, so the old one should not be cached anymore.
12:02:33 PM Alex Cook: could be queued jobs
12:03:37 PM steveholly051802: Can we flush the queue or something to free things up?
12:03:48 PM Alex Cook: yeah I could go in the DB's and clear them
12:04:42 PM steveholly051802: You don't have to, but we could do that if DJ doesn't come back.
12:05:22 PM Alex Cook: meh, there aren't a lot of q'd up jobs
12:05:26 PM Alex Cook: not sure what's up there
12:05:46 PM Alex Cook: but they seem to be trailing off
12:06:19 PM steveholly051802: So what's not working anymore?
12:06:30 PM steveholly051802: Are logins ok?
12:06:35 PM Alex Cook: Brendan is looking into it
12:09:00 PM Alex Cook: Brendan thinks the LB might be having problems
12:09:19 PM steveholly051802: for DJ or secure?
12:09:40 PM Alex Cook: secure
12:10:37 PM steveholly051802: None of the secure instances were in the affected zone from last night.
12:10:49 PM steveholly051802: The LB shows that all 4 secure boxes are passing their healthchecks
12:11:05 PM Alex Cook: hrmm
12:11:07 PM steveholly051802: We had to restart nginx on the app servers. Perhaps we need to bump nginx on the secure boxes.
12:11:52 PM Alex Cook: could be, prod-dj12 was acting really weird this morning too, found out it was in limbo
12:11:56 PM Alex Cook: but a deploy fixed it
12:12:51 PM steveholly051802: I restarted nginx on all 4 secure servers.
12:28:00 PM Alex Cook: brb grabbing power cable
12:28:04 PM steveholly051802: ok
12:31:06 PM Alex Cook: can you hop in the #dol channel?
12:31:13 PM Alex Cook: then we can group chat
12:31:21 PM steveholly051802: Sure.
1:10:46 PM steveholly051802: Gonna get some appserv problems because the 4th RDS instance went away for the re-spin.
1:11:00 PM Alex Cook: ok
1:11:06 PM Alex Cook: was just going to start looking at it lol
1:34:24 PM steveholly051802: ok, now re-spinning the x-prod-slave-app1a instance (again) because replication is still broken on that one, but that seems to be the last one. All others are done and look good.
1:34:51 PM Alex Cook: cool, John and I just restarted all the thin servers on DJ
1:34:56 PM steveholly051802: ok
1:35:08 PM Alex Cook: it seems that my deploy wasn't killing the thin server so we weren't running new code
1:35:34 PM Alex Cook: still looking at it though
1:41:48 PM steveholly051802: ok
2:05:26 PM steveholly051802: Is DJ happy at this time?
2:05:48 PM Alex Cook: yep
2:05:53 PM steveholly051802: ok
2:06:05 PM steveholly051802: I thought so, just making sure.
2:07:21 PM steveholly051802: solr-search-slave01 is still lagging with solr replication. Can you or John look at that?
2:07:31 PM Alex Cook: yeah
2:08:54 PM Alex Cook: john says we might be able to wait to check on that till monday
2:09:19 PM steveholly051802: ok
2:34:51 PM Alex Cook: ok I'm going stealth mode for a while, I'll jump back on if I get some alerts
Changed status to Offline (2:35:25 PM)
2:51:22 PM steveholly051802: ok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment