Created
July 2, 2012 17:50
-
-
Save scumola/3034568 to your computer and use it in GitHub Desktop.
Alex C IM chat log from June 29, 2012 re: AWS/DOL outage
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9:13:42 PM Alex Cook: We are back up | |
9:13:49 PM Alex Cook: I think it was WRA2 controller | |
9:13:51 PM steveholly051802: Ahh. | |
9:13:53 PM steveholly051802: ok | |
9:13:56 PM Alex Cook: https://rpm.newrelic.com/accounts/47839/applications/177265/transactions#id=52084693 | |
9:15:35 PM steveholly051802: So we were ok. The WRA2 service was hanging or something? | |
9:15:55 PM Alex Cook: it looks like it not 100% sure yet | |
9:16:29 PM Alex Cook: is mapi going down now?? | |
9:16:34 PM steveholly051802: I'm seeing mapi errors now too. | |
9:16:39 PM steveholly051802: It could be amazon. | |
9:17:02 PM Alex Cook: checking the console | |
9:17:08 PM steveholly051802: yea, me too. ( | |
9:17:26 PM steveholly051802: The console it taking a long time for me. | |
9:18:01 PM steveholly051802: Seeing RDS issues too. | |
9:18:06 PM Alex Cook: mobile looks like some sort of memcache error possibly? | |
9:19:25 PM steveholly051802: If it can't contact the memcache servers, that would be a problem. I'm gussing that it's amazon since it's so wide-spread. | |
9:19:39 PM Alex Cook: yeah I think so too | |
9:20:34 PM Alex Cook: the cpu's on RDS were low | |
9:20:39 PM Alex Cook: connections just drop | |
9:22:35 PM steveholly051802: It's AWS: 8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region. | |
9:22:40 PM steveholly051802: http://status.aws.amazon.com/ | |
9:23:18 PM steveholly051802: I took most of the affected hosts out of nagios for a 2-hour downtime | |
9:23:28 PM Alex Cook: cool well at least that's an easy explaination | |
9:23:47 PM steveholly051802: Yea. | |
9:24:09 PM steveholly051802: When it's a huge thing, of unreladed things, it's usually AWS. | |
9:24:25 PM steveholly051802: to use the technical terms. :) | |
9:24:56 PM Alex Cook: lol | |
9:26:31 PM Alex Cook: it looks like origin is down again | |
9:27:19 PM steveholly051802: I'm getting data from origin.dishonline.com | |
9:27:25 PM Alex Cook: hrmm | |
9:27:40 PM steveholly051802: but it's possible that some machines are checking out or hanging on back-end resources like RDS and memcache | |
9:27:54 PM steveholly051802: haproxy thinks all apps are healthy. | |
9:29:20 PM steveholly051802: AWS can't load the list of instances that are failing network checks in US-East. | |
9:29:35 PM steveholly051802: The status checks thing is hanging on the console. | |
9:30:01 PM Alex Cook: yeah same for me | |
9:30:26 PM steveholly051802: Not much we can do but to sit it out, I think. | |
9:30:34 PM Alex Cook: "Unable to process request, please retry shortly? | |
9:30:40 PM Alex Cook: with out the ? | |
9:31:11 PM steveholly051802: on the main EC2 dashboard, I'm seeing "EBS Volumes: An error occurred", and same with EBS snapshots and Key Pairs. | |
9:31:18 PM steveholly051802: so they're having internal issues as well. | |
9:31:35 PM Alex Cook: yeah freaking amazon | |
9:32:02 PM Alex Cook: ok I'm going to afk and check back in a few | |
9:32:14 PM steveholly051802: ok | |
Changed status to Away: Away (9:33:28 PM) | |
9:33:33 PM steveholly051802: (back in a sec) | |
Changed status to Online (9:36:41 PM) | |
Changed status to Away: Away (9:37:43 PM) | |
9:40:41 PM steveholly051802: (back) | |
Changed status to Idle (9:47:43 PM) | |
Changed status to Available (10:11:13 PM) | |
Changed status to Online (10:11:17 PM) | |
10:11:37 PM Alex Cook: same | |
10:12:17 PM steveholly051802: Amazon has updated that it was power and network, and they've restored power and we're seeing machines come back now. | |
10:12:33 PM Alex Cook: awesome | |
10:12:39 PM steveholly051802: Not everything yet though. | |
10:12:47 PM steveholly051802: Pingdom says that the site is up now. | |
10:13:16 PM Alex Cook: yeah my prod logs have been whizzing by | |
10:13:24 PM steveholly051802: nagios has 8 hosts down and 108 critical services. | |
10:13:33 PM steveholly051802: normal is 0 hosts down and about 20 services. | |
10:13:57 PM Alex Cook: yeah must have been pretty serious | |
10:14:00 PM steveholly051802: Almost all of the appserver boxes are back | |
10:18:47 PM Alex Cook: yeah I just checked Netflix is down too | |
10:18:56 PM Alex Cook: and comcast | |
10:19:16 PM steveholly051802: What? The Chaos monkey is supposed to fix all of their outage issues! | |
10:19:24 PM steveholly051802: Nobody is watching TV tonight, I guess. | |
10:19:26 PM Alex Cook: haha | |
10:40:33 PM steveholly051802: Can you get to prod-appserv13 and check to see if things are healthy? | |
10:40:48 PM steveholly051802: the machine is up, but the healthcheck is still failing. The app might need kicking. | |
10:41:17 PM Alex Cook: sure | |
10:41:22 PM steveholly051802: Thanks. | |
10:41:31 PM steveholly051802: RDS replication has failed almost everywhere, I think. | |
10:41:42 PM steveholly051802: I may have to re-spin all of the slaves. | |
10:41:54 PM Alex Cook: crap | |
10:42:20 PM Alex Cook: yeah once the masters go down or are unavailable the slaves are screwed | |
10:43:26 PM steveholly051802: many RDS instances are unavailable alltogether. | |
10:43:34 PM steveholly051802: mostly solr | |
10:44:14 PM Alex Cook: nginx is having trouble connecting to the host | |
10:44:19 PM Alex Cook: on that prod server | |
10:44:33 PM Alex Cook: tomcat could have crashed on solr | |
10:44:39 PM steveholly051802: I will restart nginx. | |
10:46:04 PM steveholly051802: Can you check solr on anything that's red in nagios with the word solr in the hostname? | |
10:46:17 PM Alex Cook: yeah | |
10:47:14 PM steveholly051802: wait. | |
10:47:28 PM steveholly051802: It's probably just RDS. I can reboot that RDS node and see if things come back. | |
10:47:41 PM steveholly051802: nginx looks happy to me on prod-appserv13 now after a restart | |
10:47:50 PM Alex Cook: excellent | |
10:47:53 PM Alex Cook: yeah try a restart | |
10:47:57 PM Alex Cook: that might wake it up | |
10:48:00 PM steveholly051802: rebooting the prod-solr RDS instance. | |
10:53:02 PM steveholly051802: RDS is taking its sweet time rebooting | |
11:01:25 PM steveholly051802: If you want, you can probably continue your evening. The site's back up. I just need to wait until the rest of the instances return to normal again. The solr RDS instance isn't rebooting, so it's possible that it's still an affected EC2 instance | |
11:01:59 PM Alex Cook: Ok, well I'll stay on IM and what not and glance at it | |
11:02:13 PM Alex Cook: let me know if there are any issues after the reboots | |
11:02:23 PM steveholly051802: ok. Will do. | |
11:02:31 PM steveholly051802: Search seems to be ok on the site. | |
11:03:06 PM steveholly051802: Can you just spot-test basic functionality on the site to make sure that things are as expected? | |
11:03:15 PM Alex Cook: sure | |
11:05:42 PM Alex Cook: everything looks ok | |
11:08:05 PM steveholly051802: Excellent. | |
11:08:21 PM steveholly051802: "You may resume your napping …" (what movie was that)? | |
11:08:37 PM Alex Cook: lol not sure | |
11:09:11 PM steveholly051802: Empire Strikes Back - on of the commanders talking to Darth Vader - when they were chasing the M Falcon. | |
11:09:19 PM steveholly051802: s/on/one/ | |
11:09:29 PM Alex Cook: lol really? I'll have to check that out | |
11:39:26 PM steveholly051802: Shit all of the app servers are checking out again. | |
11:41:12 PM Alex Cook: crap | |
11:41:24 PM steveholly051802: Same thing. NewRelic thinks it's WRA2 | |
11:41:45 PM Alex Cook: could be because the servers are down though | |
11:41:47 PM steveholly051802: Damn Amazon probably broke again. | |
11:42:06 PM Alex Cook: I'm also seeing prod-slave-app1a as deleting | |
11:42:16 PM steveholly051802: The servers are up. I'm spinning down 2 RDS slaves now (one mobile, one WWW) | |
11:42:22 PM Alex Cook: ah ok | |
11:42:22 PM steveholly051802: Yea, I'm doing that on purpose. | |
11:42:27 PM steveholly051802: to re-start replication | |
11:42:42 PM steveholly051802: but the app should fail over - there are 3 more slaves available in RDS for the WWW app | |
11:43:00 PM steveholly051802: so I'm thinking that this is something else. | |
11:43:15 PM steveholly051802: Can you check one of the apps to make sure that it's not RDS? | |
11:43:20 PM Alex Cook: yeah | |
11:43:24 PM steveholly051802: Thanks | |
11:44:26 PM steveholly051802: I thought that we fixed haproxy to handle 502's. | |
11:44:27 PM Alex Cook: looks like app server 10 is having problems connecting to the host | |
11:44:31 PM Alex Cook: like 13 was | |
11:44:36 PM steveholly051802: so nginx is unhappy? | |
11:45:31 PM Alex Cook: it's happy now | |
11:45:40 PM steveholly051802: I restarted nginx on 10 | |
11:45:48 PM steveholly051802: weird. | |
11:46:08 PM steveholly051802: AWS didn't reboot it | |
11:46:32 PM steveholly051802: I'll restart nginx on all app servers | |
11:46:37 PM Alex Cook: ok | |
11:47:29 PM steveholly051802: they look happy now. | |
11:49:49 PM steveholly051802: re-creating the first two RDS instances again now. | |
11:51:21 PM steveholly051802: nginx isn't being very smart about something. | |
11:53:58 PM Alex Cook: indeed, it doesn't seem to be recovering well | |
11:54:43 PM steveholly051802: I'm spinning down the x-prod-slave-app2a instance too since it never has recovered from the outage to begin with. | |
11:54:53 PM steveholly051802: I'll re-build it though. | |
11:55:06 PM steveholly051802: The apps are happy again. | |
11:55:38 PM Alex Cook: yep looks like we are back for now | |
12:16:17 AM steveholly051802: (back in a minute) | |
12:21:34 AM steveholly051802: (back) | |
12:21:42 AM Alex Cook: k | |
12:22:25 AM steveholly051802: The RDS slaves are taking *forever* to rebuild. Everyone in the world must be doing the same thing and hammering the system now. | |
12:23:07 AM Alex Cook: more than likely | |
12:23:16 AM Alex Cook: I'm sure a lot of people went down | |
12:24:29 AM steveholly051802: Yup. We'll hear about it on reddit and HN tomorrow. | |
12:24:48 AM Alex Cook: lol | |
12:24:55 AM Alex Cook: I'm sure amazon will get some crap for this | |
12:25:06 AM steveholly051802: yup | |
12:48:50 AM steveholly051802: Logging in, I get sent to ProcessLogin.do (or something like that) which isn't showing up for me in prod. Can you check? I get logged in ok, but I get an error page just after login. | |
12:49:18 AM Alex Cook: sure, might be a CSA issue | |
12:49:24 AM Alex Cook: or something with secure servers | |
12:49:35 AM steveholly051802: nevermind. That time it worked fine for me. | |
12:49:51 AM Alex Cook: yeah worked for me too | |
12:50:36 AM steveholly051802: k.. Thanks. | |
12:50:50 AM steveholly051802: Still spinning-up the first pair of slaves. Still not done yet. Uugh. | |
12:51:14 AM Alex Cook: yeah slow... | |
12:51:19 AM steveholly051802: If you want to go to bed, I think that would be fine. | |
12:51:25 AM steveholly051802: When does ingest happen? 8am? | |
12:51:40 AM steveholly051802: I'd like to get as many slaves as I can before ingest happens. | |
12:51:46 AM Alex Cook: it will kick off between 2-4 I think | |
12:51:51 AM steveholly051802: Oh, crap. | |
12:52:17 AM steveholly051802: If it doesn't replicate and just has stale data will it be ok? | |
12:52:33 AM Alex Cook: it should be ok if it's a day behind | |
12:52:45 AM steveholly051802: ok. I can continue to re-spin slaves tomorrow then. | |
12:52:49 AM Alex Cook: cool | |
12:53:04 AM Alex Cook: alright I'll keep my phone on | |
12:53:07 AM steveholly051802: k | |
12:53:10 AM steveholly051802: g'night | |
12:53:13 AM Alex Cook: later | |
Changed status to Offline (12:53:26 AM) | |
[SLEEP] | |
8:19:04 AM Alex Cook: Looks like prod-solr-slave is still rebooting, unless you just did that | |
8:19:12 AM Alex Cook: DJ is not happy | |
9:33:29 AM Alex Cook: what's up | |
9:33:34 AM steveholly051802: Yea, the slave-solr RDS instance never was rebooted, so I'm spinning up another one (prod-solr-slave2) and I changed DNS appropriately. | |
9:33:47 AM steveholly051802: I'm also re-spinning mapi-slave04 now. | |
9:34:13 AM Alex Cook: cool, that's what I was thinking was the solution just didn't want to pull the trigger on prod without knowing 100% | |
9:34:33 AM Alex Cook: sounds like AWS is still trying to recover from the outage | |
9:37:42 AM steveholly051802: Yea, we still have that one instance that's in permanent reboot state. | |
9:37:58 AM steveholly051802: so I'm guessing that things are not all right yet with AWS. | |
9:38:52 AM steveholly051802: I should contact them about the rebooting one. | |
9:39:44 AM steveholly051802: The spinning up of instances is already faster this morning, I can tell. | |
9:40:14 AM Alex Cook: nice, yeah I was reading a lot of forum post about the looping reboot state | |
9:40:22 AM Alex Cook: https://forums.aws.amazon.com/thread.jspa?messageID=359904 | |
9:40:34 AM Alex Cook: we aren't the only ones... | |
9:41:17 AM steveholly051802: Ahh, so it's been reported then. Good. | |
9:43:38 AM steveholly051802: prod-dj12 doesn't seem to be sick. Is the app happy on that machine? | |
9:43:48 AM steveholly051802: The rest are complaining, but dj12 isn't. | |
9:44:20 AM Alex Cook: hrmmm it looks happy | |
9:45:29 AM steveholly051802: Is it taking traffic? | |
9:45:45 AM Alex Cook: 137k jobs in queue, checking the logs | |
9:47:53 AM steveholly051802: Now the AWS console is hanging for me. | |
9:48:01 AM Alex Cook: where dj10 only has 4 jobs in queue | |
9:48:21 AM Alex Cook: I can't ssh to dj12 | |
9:48:25 AM Alex Cook: timeout | |
9:48:36 AM Alex Cook: it could be in la la land | |
9:48:50 AM steveholly051802: I'm in as user deploy. | |
9:49:06 AM Alex Cook: oh there it goes | |
9:49:12 AM Alex Cook: timed out the first time | |
9:49:52 AM steveholly051802: I'm going to head out for an hour. When I get back, I'll check if the RDS instances have completed re-spinning and fire up another batch. | |
9:50:09 AM steveholly051802: If something (else) breaks, give me a call on my cell, ok? | |
9:50:45 AM Alex Cook: sure before you go, it looks like the DJ ELB may be down? | |
9:50:59 AM Alex Cook: nvm it's just a warn | |
9:51:02 AM steveholly051802: Can you look into DJ? It should come back when the solr RDS instance comes back, right? Can you make sure that DJ is pointing to the DOL DNS name instead of directly at the RDS DNS name? | |
9:51:15 AM Alex Cook: sure | |
9:51:39 AM steveholly051802: Thanks. | |
9:52:02 AM steveholly051802: The prod-apidj ELB is the one for the new DJ. It looks good to me. | |
9:52:18 AM Alex Cook: ok cool, and yeah we are point to the DNS for DJ | |
9:52:30 AM steveholly051802: Excellent, so that should come back when the re-spin is done. | |
9:52:43 AM steveholly051802: DNS has been changed to point to the new one when it comes up. | |
9:53:05 AM Alex Cook: cool I'll keep an eye on it while your gone | |
9:53:10 AM steveholly051802: ok | |
9:53:26 AM steveholly051802: We're going to go get breakfast and hit a few garage sales, but we'll be back. | |
10:23:55 AM Alex Cook: Actually dj is not pointing to the DNS I was looking in the wrong place | |
11:57:18 AM steveholly051802: I'm back now. Sorry, it was more like 2 hours. | |
11:57:25 AM steveholly051802: Gonna re-spin the last two RDS machines. | |
11:57:26 AM Alex Cook: cool | |
11:57:38 AM Alex Cook: Looks like the secure server is having issues now too | |
11:57:45 AM steveholly051802: Aack | |
11:58:44 AM steveholly051802: spinning up x-prod-slave-app2a now. | |
11:58:53 AM steveholly051802: (RDS) | |
11:59:58 AM Alex Cook: also are you sure prod-slave-solr2 is using DNS name x-prod-slave-solr.dishonline.com | |
12:00:29 PM Alex Cook: We are still seeing errors trying to connect to x-prod-slave-solr. cuuvfxxjlwmd.us-east-1.rds. amazonaws. com | |
12:00:43 PM Alex Cook: can't figure out why | |
12:01:30 PM steveholly051802: $ ./find_ip.rb solr | grep CNAME | |
7592097 CNAME prod-slavedb02 prod-solr-search | |
8631099 CNAME x-prod-slave-solr2.cuuvfxxjlwmd.us-east-1.rds.amazonaws.com. x-prod-solr-slave | |
8631100 CNAME x-prod-slave-solr2.cuuvfxxjlwmd.us-east-1.rds.amazonaws.com. x-prod-slave-solr | |
12:01:56 PM Alex Cook: ok I think NR just hasnt had enough time | |
12:02:08 PM Alex Cook: the errors have gone down significantly | |
12:02:14 PM steveholly051802: I changed it last night, so the old one should not be cached anymore. | |
12:02:33 PM Alex Cook: could be queued jobs | |
12:03:37 PM steveholly051802: Can we flush the queue or something to free things up? | |
12:03:48 PM Alex Cook: yeah I could go in the DB's and clear them | |
12:04:42 PM steveholly051802: You don't have to, but we could do that if DJ doesn't come back. | |
12:05:22 PM Alex Cook: meh, there aren't a lot of q'd up jobs | |
12:05:26 PM Alex Cook: not sure what's up there | |
12:05:46 PM Alex Cook: but they seem to be trailing off | |
12:06:19 PM steveholly051802: So what's not working anymore? | |
12:06:30 PM steveholly051802: Are logins ok? | |
12:06:35 PM Alex Cook: Brendan is looking into it | |
12:09:00 PM Alex Cook: Brendan thinks the LB might be having problems | |
12:09:19 PM steveholly051802: for DJ or secure? | |
12:09:40 PM Alex Cook: secure | |
12:10:37 PM steveholly051802: None of the secure instances were in the affected zone from last night. | |
12:10:49 PM steveholly051802: The LB shows that all 4 secure boxes are passing their healthchecks | |
12:11:05 PM Alex Cook: hrmm | |
12:11:07 PM steveholly051802: We had to restart nginx on the app servers. Perhaps we need to bump nginx on the secure boxes. | |
12:11:52 PM Alex Cook: could be, prod-dj12 was acting really weird this morning too, found out it was in limbo | |
12:11:56 PM Alex Cook: but a deploy fixed it | |
12:12:51 PM steveholly051802: I restarted nginx on all 4 secure servers. | |
12:28:00 PM Alex Cook: brb grabbing power cable | |
12:28:04 PM steveholly051802: ok | |
12:31:06 PM Alex Cook: can you hop in the #dol channel? | |
12:31:13 PM Alex Cook: then we can group chat | |
12:31:21 PM steveholly051802: Sure. | |
1:10:46 PM steveholly051802: Gonna get some appserv problems because the 4th RDS instance went away for the re-spin. | |
1:11:00 PM Alex Cook: ok | |
1:11:06 PM Alex Cook: was just going to start looking at it lol | |
1:34:24 PM steveholly051802: ok, now re-spinning the x-prod-slave-app1a instance (again) because replication is still broken on that one, but that seems to be the last one. All others are done and look good. | |
1:34:51 PM Alex Cook: cool, John and I just restarted all the thin servers on DJ | |
1:34:56 PM steveholly051802: ok | |
1:35:08 PM Alex Cook: it seems that my deploy wasn't killing the thin server so we weren't running new code | |
1:35:34 PM Alex Cook: still looking at it though | |
1:41:48 PM steveholly051802: ok | |
2:05:26 PM steveholly051802: Is DJ happy at this time? | |
2:05:48 PM Alex Cook: yep | |
2:05:53 PM steveholly051802: ok | |
2:06:05 PM steveholly051802: I thought so, just making sure. | |
2:07:21 PM steveholly051802: solr-search-slave01 is still lagging with solr replication. Can you or John look at that? | |
2:07:31 PM Alex Cook: yeah | |
2:08:54 PM Alex Cook: john says we might be able to wait to check on that till monday | |
2:09:19 PM steveholly051802: ok | |
2:34:51 PM Alex Cook: ok I'm going stealth mode for a while, I'll jump back on if I get some alerts | |
Changed status to Offline (2:35:25 PM) | |
2:51:22 PM steveholly051802: ok |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment