jgagne33/gist:84347

## gistfile1.txt
[11:50am] Bino: hi
[11:51am] tpol: hi... so I just noticed a long running query as I was logging in to check things over...
[11:52am] Bino: ok, great, just what we're looking for
[11:53am] tpol: the db was basically reading like crazy... whatever the query was that cleared the cache... it probably didn't impact you previously because your cache was so large
[11:59am] Bino: so the query cleared the cache?
[11:59am] Bino: can you find the query?
[12:02pm] tpol: hmm... I'm not seeing it yet
[12:08pm] You left the chat by being disconnected from the server.
[12:08pm] You rejoined the room.
[12:08pm] Bino left the chat room. (Read error: 104 (Connection reset by peer))
[12:16pm] You left the chat by being disconnected from the server.
[12:17pm] You rejoined the room.
[12:17pm] Bino_ left the chat room. (Read error: 104 (Connection reset by peer))
[12:22pm] Bino: sorry, connection probs here tpol
[12:22pm] Bino: any luck?
[12:23pm] tpol: well I see a couple of queries running that are using derived tables and look odd... but I'm not seeing one particularly ugly query
[12:25pm] tpol: there is a sphinx indexing query hitting the db that is in #2 of your top queries
[12:26pm] Bino: yeah, looking over that with Erik Jones in a ticket
[12:26pm] tpol: what is the schedule for sphinx indexing... could they be hitting at the same time
[12:26pm] Bino: yeah, we want to stagger them, I've actually asked a bunch of times for that
[12:27pm] Bino: Thought it was already done but I can see if our other dev has time
[12:28pm] Bino: was supposed to be staggered when the fractional cluster was set-up
[12:28pm] Bino: there are also some frequent jobs that start at the same time that I thought could be staggered also
[12:28pm] tpol: you indicated a spike at 10:45 was that EST?
[12:28pm] Bino: no pacific
[12:29pm] Bino: sorry, shoulda mentioned
[12:29pm] tpol: and you were looking at newrelic?
[12:30pm] Bino: yeah
[12:31pm] tpol: were you looking at the Activity Record activity chart or something else?
[12:41pm] Bino: sorry tpol I was looking at the overview response time and throughput and AR activity
[12:41pm] tpol: ok
[12:42pm] tpol: # Query_time: 43.609536  Lock_time: 0.000000  Rows_sent: 1  Rows_examined: 200194
[12:42pm] tpol: SELECT `login` FROM `users`    WHERE (LOWER(`users`.login) = 'ines');
[12:42pm] Bino: basically it looks like performance is great outside of few sticking points where things spike and then it comes right back in-line
[12:42pm] Bino: if we could pick those things off, I think we'd have good performance
[12:42pm] Bino: woah!
[12:42pm] Bino: 43 seconds for a login
[12:43pm] tpol: thats doing a full table scan
[12:43pm] tpol: it can't use an index because of the use of LOWER
[12:43pm] Bino: jeeze
[12:43pm] Bino: so if we take out that lower, it will be able to?
[12:44pm] tpol: yes... users table is utf8 which is case-insensitive
[12:44pm] Bino: could that have been waiting on another query? is that why it took so long? Logging in seems to be super fast in general(although a full-table scan needs to be fixed anyway)
[12:44pm] tpol: and you have an index on login so that would use an index if you drop the lower
[12:45pm] Bino: ok, cool I'll do that
[12:46pm] tpol: your database is doing a lot of disk reads... it could be that there just isn't enough memory to adequately cache your database
[12:47pm] tpol: that seems to be the common theme in the slow logs... there isn't really one query that slows it down... they are all slower than they should be because just about everything is reading from disk
[12:48pm] tpol: and because they are all contending for the disk resource they are slowing each other down
[12:48pm] Bino: I don't really understand what it means, "reading from disk"
[12:48pm] Bino: as opposed to reading from cache?
[12:50pm] tpol: posts is heavily indexed but an index on repost_token would help
[12:50pm] tpol: # Query_time: 48.569710  Lock_time: 0.000000  Rows_sent: 1  Rows_examined: 225509
[12:50pm] tpol: SELECT * FROM `posts`    WHERE (`posts`.`repost_token` = '5e7dbad189d62e8ee375925e9f47a0dd2c9c094d')  LIMIT 1;
[12:50pm] tpol: Bino: sure... so how mysql works is
[12:51pm] tpol: data is accessed from disk... once data has been accessed once it is stored in a memory cache...
[12:51pm] tpol: for innodb tables it is stored in the innodb_buffer_pool
[12:52pm] tpol: for MyISAm the indexes are stored in the key_buffer and the data is stored by the OS in OS cache
[12:52pm] tpol: as new data is read from disk it overwrites the oldest contents of the memory caches
[12:53pm] tpol: these caches are what enables mysql to return large amounts of data really quickly
[12:54pm] tpol: when the "active data set" (the data you actively request in your application) is larger than the cache you will see almost constant disk reads on the database
[12:55pm] tpol: Bino: here is the output of vmstat
[12:55pm] tpol: 0  4  8768 286276  25836  72148    0    0    53  326    4    2 10  0 87  2
[12:55pm] tpol:  1  2  8768 285912  25876  72164    0    0  4756  228 2087 2155  3  1 47 50
[12:55pm] tpol:  2  1  8768 286020  25892  72176    0    0  5618    36 1881 2347 16  0 34 47
[12:55pm] tpol:  0  2  8768 286028  25916  72172    0    0  3416    58 1584 1945  3  1 48 48
[12:55pm] tpol:  0  5  8768 280228  25932  78372    0    0  5476    26 1957 2160  6  1 44 49
[12:55pm] tpol: sorry hang on
[12:55pm] tpol: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
[12:55pm] tpol:  r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa
[12:55pm] tpol:  0  4  8768 286276  25836  72148    0    0    53  326    4    2 10  0 87  2
[12:55pm] tpol:  1  2  8768 285912  25876  72164    0    0  4756  228 2087 2155  3  1 47 50
[12:55pm] tpol:  2  1  8768 286020  25892  72176    0    0  5618    36 1881 2347 16  0 34 47
[12:55pm] tpol:  0  2  8768 286028  25916  72172    0    0  3416    58 1584 1945  3  1 48 48
[12:55pm] tpol:  0  5  8768 280228  25932  78372    0    0  5476    26 1957 2160  6  1 44 49
[12:55pm] tpol: there... so under IO there are two columns... bi and bo
[12:56pm] tpol: bi = blocks read from disk into memory
[12:56pm] tpol: bo = blocks written to disk
[12:56pm] tpol: the first line is the average
[12:56pm] tpol: the average is based on when the server started running
[12:56pm] tpol: the server has been up for 54 days
[12:57pm] tpol: the second line is the 2 second average for the item for the past 2 seconds
[12:57pm] Bino: ok, looking, sorry
[12:57pm] Bino: got a call
[12:57pm] tpol: so you can see that the average for this performance is very low... not a lot of bi at 53
[12:58pm] tpol: but the results after that are in the thousands and are actually maxing out the read/write capability of the disks at times
[12:58pm] Bino: right, so where does that come from?
[12:58pm] Bino: its from reading active data
[12:58pm] Bino: ?
[12:59pm] tpol: its reading so much data that the data is overwritten in the cache very shortly after being written
[12:59pm] tpol: so the caches need to be larger
[1:00pm] tpol: you recently shrunk several of your caches didn't you?
[1:00pm] Bino: umm, we reduced to innodb buffer
[1:00pm] tpol: that would do it
[1:01pm] Bino: and then today we reduced the ram
[1:01pm] Bino: so the question is
[1:01pm] Bino: we JUST moved to a fractional cluster
[1:01pm] Bino: from an env with much less RAM and less buffer
[1:01pm] Bino: but we ran better on that env than this one
[1:02pm] tpol: weren't you on a shared db before?
[1:02pm] Bino: when we moved over, we were given(and charged for  a lot more ram than is standard, so we're trying to get down to the standard level which should still be much more than our previous environment
[1:03pm] Bino: yeah, we were on a shared
[1:04pm] tpol: so a shared db has minimum of 4096MB of ram
[1:04pm] tpol: its shared but its possible that your database was using 80-90% of those memory buffers
[1:07pm] tpol: hmm... well if you really want to stay at this memory level you could direct some reads to the replica and that will spread out the disk IO's between two subsystems but my recommendation would be to increase memory and increase your innodb_buffer... you might not need to go to 4096 but I think you'll benefit from more than you have
[1:08pm] tpol: the *standard* memory configs are just a starting point for the fractional configurations we almost always have to tweak that for how the database is used and tuned
[1:11pm] Bino: ok, I just am confused because it seems if you come from a shared to a dedicated, you should be running awesome right from the start, there wouldn't be a need to have so much memory
[1:12pm] Bino: so you don't think optimizing the queries will solve the problem
[1:12pm] Bino: its a big expense for us at this point so we need to look at every avenue before spending on memory
[1:13pm] tpol: there are a few items that can be tuned but they are relatively infrequent compared to other items
[1:13pm] tpol: how is performance?
[1:14pm] tpol: are the queries slow to the point where application response is suffering
[1:14pm] tpol: or are we just trying to adapt to a few instances of latency?
[1:15pm] tpol: also are you working more with staging today then you were last week
[1:15pm] Bino: well, we have times where we just come to a standstill and then break through it and things are fine
[1:15pm] Bino: I haven't even been able to get on staging which is another issue, its been timing out and I'm not sure why
[1:15pm] tpol: because your staging db will press your production data out of cache
[1:15pm] Bino: I have a ticket for that
[1:16pm] Bino: I didn't know they were even related
[1:16pm] tpol: hmm... odd, I've seen at least a couple staging queries
[1:16pm] Bino: I thought they were totally separate
[1:18pm] tpol: its hosted on your fractional db... usually we ask about that during the deploy because we can put your staging db on to a shared staging instance if you'd prefer
[1:18pm] tpol: but if you aren't really using it then it shouldn't be impacting you
[1:19pm] Bino: we use it just to test things before prod
	[11:50am] Bino: hi
	[11:51am] tpol: hi... so I just noticed a long running query as I was logging in to check things over...
	[11:52am] Bino: ok, great, just what we're looking for
	[11:53am] tpol: the db was basically reading like crazy... whatever the query was that cleared the cache... it probably didn't impact you previously because your cache was so large
	[11:59am] Bino: so the query cleared the cache?
	[11:59am] Bino: can you find the query?
	[12:02pm] tpol: hmm... I'm not seeing it yet
	[12:08pm] You left the chat by being disconnected from the server.
	[12:08pm] You rejoined the room.
	[12:08pm] Bino left the chat room. (Read error: 104 (Connection reset by peer))
	[12:16pm] You left the chat by being disconnected from the server.
	[12:17pm] You rejoined the room.
	[12:17pm] Bino_ left the chat room. (Read error: 104 (Connection reset by peer))
	[12:22pm] Bino: sorry, connection probs here tpol
	[12:22pm] Bino: any luck?
	[12:23pm] tpol: well I see a couple of queries running that are using derived tables and look odd... but I'm not seeing one particularly ugly query
	[12:25pm] tpol: there is a sphinx indexing query hitting the db that is in #2 of your top queries
	[12:26pm] Bino: yeah, looking over that with Erik Jones in a ticket
	[12:26pm] tpol: what is the schedule for sphinx indexing... could they be hitting at the same time
	[12:26pm] Bino: yeah, we want to stagger them, I've actually asked a bunch of times for that
	[12:27pm] Bino: Thought it was already done but I can see if our other dev has time
	[12:28pm] Bino: was supposed to be staggered when the fractional cluster was set-up
	[12:28pm] Bino: there are also some frequent jobs that start at the same time that I thought could be staggered also
	[12:28pm] tpol: you indicated a spike at 10:45 was that EST?
	[12:28pm] Bino: no pacific
	[12:29pm] Bino: sorry, shoulda mentioned
	[12:29pm] tpol: and you were looking at newrelic?
	[12:30pm] Bino: yeah
	[12:31pm] tpol: were you looking at the Activity Record activity chart or something else?
	[12:41pm] Bino: sorry tpol I was looking at the overview response time and throughput and AR activity
	[12:41pm] tpol: ok
	[12:42pm] tpol: # Query_time: 43.609536 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 200194
	[12:42pm] tpol: SELECT `login` FROM `users` WHERE (LOWER(`users`.login) = 'ines');
	[12:42pm] Bino: basically it looks like performance is great outside of few sticking points where things spike and then it comes right back in-line
	[12:42pm] Bino: if we could pick those things off, I think we'd have good performance
	[12:42pm] Bino: woah!
	[12:42pm] Bino: 43 seconds for a login
	[12:43pm] tpol: thats doing a full table scan
	[12:43pm] tpol: it can't use an index because of the use of LOWER
	[12:43pm] Bino: jeeze
	[12:43pm] Bino: so if we take out that lower, it will be able to?
	[12:44pm] tpol: yes... users table is utf8 which is case-insensitive
	[12:44pm] Bino: could that have been waiting on another query? is that why it took so long? Logging in seems to be super fast in general(although a full-table scan needs to be fixed anyway)
	[12:44pm] tpol: and you have an index on login so that would use an index if you drop the lower
	[12:45pm] Bino: ok, cool I'll do that
	[12:46pm] tpol: your database is doing a lot of disk reads... it could be that there just isn't enough memory to adequately cache your database
	[12:47pm] tpol: that seems to be the common theme in the slow logs... there isn't really one query that slows it down... they are all slower than they should be because just about everything is reading from disk
	[12:48pm] tpol: and because they are all contending for the disk resource they are slowing each other down
	[12:48pm] Bino: I don't really understand what it means, "reading from disk"
	[12:48pm] Bino: as opposed to reading from cache?
	[12:50pm] tpol: posts is heavily indexed but an index on repost_token would help
	[12:50pm] tpol: # Query_time: 48.569710 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 225509
	[12:50pm] tpol: SELECT * FROM `posts` WHERE (`posts`.`repost_token` = '5e7dbad189d62e8ee375925e9f47a0dd2c9c094d') LIMIT 1;
	[12:50pm] tpol: Bino: sure... so how mysql works is
	[12:51pm] tpol: data is accessed from disk... once data has been accessed once it is stored in a memory cache...
	[12:51pm] tpol: for innodb tables it is stored in the innodb_buffer_pool
	[12:52pm] tpol: for MyISAm the indexes are stored in the key_buffer and the data is stored by the OS in OS cache
	[12:52pm] tpol: as new data is read from disk it overwrites the oldest contents of the memory caches
	[12:53pm] tpol: these caches are what enables mysql to return large amounts of data really quickly
	[12:54pm] tpol: when the "active data set" (the data you actively request in your application) is larger than the cache you will see almost constant disk reads on the database
	[12:55pm] tpol: Bino: here is the output of vmstat
	[12:55pm] tpol: 0 4 8768 286276 25836 72148 0 0 53 326 4 2 10 0 87 2
	[12:55pm] tpol: 1 2 8768 285912 25876 72164 0 0 4756 228 2087 2155 3 1 47 50
	[12:55pm] tpol: 2 1 8768 286020 25892 72176 0 0 5618 36 1881 2347 16 0 34 47
	[12:55pm] tpol: 0 2 8768 286028 25916 72172 0 0 3416 58 1584 1945 3 1 48 48
	[12:55pm] tpol: 0 5 8768 280228 25932 78372 0 0 5476 26 1957 2160 6 1 44 49
	[12:55pm] tpol: sorry hang on
	[12:55pm] tpol: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
	[12:55pm] tpol: r b swpd free buff cache si so bi bo in cs us sy id wa
	[12:55pm] tpol: 0 4 8768 286276 25836 72148 0 0 53 326 4 2 10 0 87 2
	[12:55pm] tpol: 1 2 8768 285912 25876 72164 0 0 4756 228 2087 2155 3 1 47 50
	[12:55pm] tpol: 2 1 8768 286020 25892 72176 0 0 5618 36 1881 2347 16 0 34 47
	[12:55pm] tpol: 0 2 8768 286028 25916 72172 0 0 3416 58 1584 1945 3 1 48 48
	[12:55pm] tpol: 0 5 8768 280228 25932 78372 0 0 5476 26 1957 2160 6 1 44 49
	[12:55pm] tpol: there... so under IO there are two columns... bi and bo
	[12:56pm] tpol: bi = blocks read from disk into memory
	[12:56pm] tpol: bo = blocks written to disk
	[12:56pm] tpol: the first line is the average
	[12:56pm] tpol: the average is based on when the server started running
	[12:56pm] tpol: the server has been up for 54 days
	[12:57pm] tpol: the second line is the 2 second average for the item for the past 2 seconds
	[12:57pm] Bino: ok, looking, sorry
	[12:57pm] Bino: got a call
	[12:57pm] tpol: so you can see that the average for this performance is very low... not a lot of bi at 53
	[12:58pm] tpol: but the results after that are in the thousands and are actually maxing out the read/write capability of the disks at times
	[12:58pm] Bino: right, so where does that come from?
	[12:58pm] Bino: its from reading active data
	[12:58pm] Bino: ?
	[12:59pm] tpol: its reading so much data that the data is overwritten in the cache very shortly after being written
	[12:59pm] tpol: so the caches need to be larger
	[1:00pm] tpol: you recently shrunk several of your caches didn't you?
	[1:00pm] Bino: umm, we reduced to innodb buffer
	[1:00pm] tpol: that would do it
	[1:01pm] Bino: and then today we reduced the ram
	[1:01pm] Bino: so the question is
	[1:01pm] Bino: we JUST moved to a fractional cluster
	[1:01pm] Bino: from an env with much less RAM and less buffer
	[1:01pm] Bino: but we ran better on that env than this one
	[1:02pm] tpol: weren't you on a shared db before?
	[1:02pm] Bino: when we moved over, we were given(and charged for a lot more ram than is standard, so we're trying to get down to the standard level which should still be much more than our previous environment
	[1:03pm] Bino: yeah, we were on a shared
	[1:04pm] tpol: so a shared db has minimum of 4096MB of ram
	[1:04pm] tpol: its shared but its possible that your database was using 80-90% of those memory buffers
	[1:07pm] tpol: hmm... well if you really want to stay at this memory level you could direct some reads to the replica and that will spread out the disk IO's between two subsystems but my recommendation would be to increase memory and increase your innodb_buffer... you might not need to go to 4096 but I think you'll benefit from more than you have
	[1:08pm] tpol: the standard memory configs are just a starting point for the fractional configurations we almost always have to tweak that for how the database is used and tuned
	[1:11pm] Bino: ok, I just am confused because it seems if you come from a shared to a dedicated, you should be running awesome right from the start, there wouldn't be a need to have so much memory
	[1:12pm] Bino: so you don't think optimizing the queries will solve the problem
	[1:12pm] Bino: its a big expense for us at this point so we need to look at every avenue before spending on memory
	[1:13pm] tpol: there are a few items that can be tuned but they are relatively infrequent compared to other items
	[1:13pm] tpol: how is performance?
	[1:14pm] tpol: are the queries slow to the point where application response is suffering
	[1:14pm] tpol: or are we just trying to adapt to a few instances of latency?
	[1:15pm] tpol: also are you working more with staging today then you were last week
	[1:15pm] Bino: well, we have times where we just come to a standstill and then break through it and things are fine
	[1:15pm] Bino: I haven't even been able to get on staging which is another issue, its been timing out and I'm not sure why
	[1:15pm] tpol: because your staging db will press your production data out of cache
	[1:15pm] Bino: I have a ticket for that
	[1:16pm] Bino: I didn't know they were even related
	[1:16pm] tpol: hmm... odd, I've seen at least a couple staging queries
	[1:16pm] Bino: I thought they were totally separate
	[1:18pm] tpol: its hosted on your fractional db... usually we ask about that during the deploy because we can put your staging db on to a shared staging instance if you'd prefer
	[1:18pm] tpol: but if you aren't really using it then it shouldn't be impacting you
	[1:19pm] Bino: we use it just to test things before prod