Instantly share code, notes, and snippets.

Embed
What would you like to do?
Investigating Performance Issue
0:00:00-0:01:00 Thinking
0:01:00-0:03:15 Wrote query to get 100 queries after a given point in time that took a long time.
0:03:15-0:03:30 I try running the query and get an error. There's an issue with how I'm using TIMESTAMP AT TIME ZONE in Postgres.
0:03:30-0:04:15 I look up the docs for TIMESTAMP. I discover the issue is I wrote TIMESTAMP AT TIME ZONE instead of TIMESTAMP WITH TIME ZONE. I then run the query and it works.
0:04:15-0:05:45 I scroll through the results and look at them.
0:05:45-0:06:45 I modify the query I have to display how much time is spent in different parts of our stack. I'm looking at how much time was spent in our master database compared to our worker databases.
0:06:45-0:07:00 I get a few errors from ambigious fields in the SQL query. This is because I'm joining two tables that both have columns named start_time and both have columns named data. I modify the SQL query to refer explicitly to one of the tables.
0:07:00-07:30:00 The query I wrote is taking a long time. I kill it and take a look at the query plan. I see that for some reason the query is performing a sequential scan over all historical queries. This indicates an issue with my query.
0:07:30-0:08:15 I look at the query and realize the problem. Since I'm joining two tables, then sorting the result by time, then limiting the queries, the join needs to be completed first. This causes the join need to be computed over all queries since the time I'm looking at.
0:08:15-0:09:00 I modify the query so it only looks at queries between the time I care about and one hour after the time I care about. The query now returns quickly.
0:09:00-0:10:00 I scroll through the results and see that almost all of the time is being spent in our master database. Looking through the queries ran at the time, it looks like a lot of them are a specifc type of query. One that is particularly computationally expensive.
0:10:00-0:12:15 I scroll through the queries and confirm almost all the queries around the time of the incident were the computationally expensive queries.
0:12:15-0:12:30 I lower my threshold for slow queries to look at more data. I see pretty much the same exact thing.
0:12:30-0:13:15 I look up the CPU usage of our master database on an internal dashboard and see that it had a large spike in CPU around the time I'm looking at.
0:13:15-0:14:30 I'm trying to look at what queries were running when the CPU spike first started.
14:30-17:30 I dig through the data, but I'm having hard time spotting what queries actually caused the issue.
17:30-18:45 I modify the query I'm writing to omit parts of the data I don't care about. This makes it easier to go through larger portions of the data.
18:45-20:15 I go through the data and see that there were a bunch of fast queries, then a bunch of the computationally expensive queries, then things started to be slow. I think this is a good enough depth in the outage and start to write up a doc.
20:15-20:30 I create a new document in Dropbox Paper.
20:30-20:45 I write the title of the document.
20:45-22:45 I dig through a couple of the graphs I found and take screen shots of them.
22:45-24:45 I write a quick summary of what I found and my hypothesis on what was causing queries to be slow.
24:45-25:30 I post the queries I ran and the results I got into the doc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment