Skip to content

Instantly share code, notes, and snippets.

@jgagne33
Created March 24, 2009 22:17
Show Gist options
  • Save jgagne33/84395 to your computer and use it in GitHub Desktop.
Save jgagne33/84395 to your computer and use it in GitHub Desktop.
Bino: hi
[11:51am] tpol: hi... so I just noticed a long running query as I was logging in to check things over...
[11:52am] Bino: ok, great, just what we're looking for
[11:53am] tpol: the db was basically reading like crazy... whatever the query was that cleared the cache... it probably didn't impact you previously because your cache was so large
[11:59am] Bino: so the query cleared the cache?
[11:59am] Bino: can you find the query?
[12:02pm] tpol: hmm... I'm not seeing it yet
[12:08pm] You left the chat by being disconnected from the server.
[12:08pm] You rejoined the room.
[12:08pm] Bino left the chat room. (Read error: 104 (Connection reset by peer))
[12:16pm] You left the chat by being disconnected from the server.
[12:17pm] You rejoined the room.
[12:17pm] Bino_ left the chat room. (Read error: 104 (Connection reset by peer))
[12:22pm] Bino: sorry, connection probs here tpol
[12:22pm] Bino: any luck?
[12:23pm] tpol: well I see a couple of queries running that are using derived tables and look odd... but I'm not seeing one particularly ugly query
[12:25pm] tpol: there is a sphinx indexing query hitting the db that is in #2 of your top queries
[12:26pm] Bino: yeah, looking over that with Erik Jones in a ticket
[12:26pm] tpol: what is the schedule for sphinx indexing... could they be hitting at the same time
[12:26pm] Bino: yeah, we want to stagger them, I've actually asked a bunch of times for that
[12:27pm] Bino: Thought it was already done but I can see if our other dev has time
[12:28pm] Bino: was supposed to be staggered when the fractional cluster was set-up
[12:28pm] Bino: there are also some frequent jobs that start at the same time that I thought could be staggered also
[12:28pm] tpol: you indicated a spike at 10:45 was that EST?
[12:28pm] Bino: no pacific
[12:29pm] Bino: sorry, shoulda mentioned
[12:29pm] tpol: and you were looking at newrelic?
[12:30pm] Bino: yeah
[12:31pm] tpol: were you looking at the Activity Record activity chart or something else?
[12:41pm] Bino: sorry tpol I was looking at the overview response time and throughput and AR activity
[12:41pm] tpol: ok
[12:42pm] tpol: # Query_time: 43.609536 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 200194
[12:42pm] tpol: SELECT `login` FROM `users` WHERE (LOWER(`users`.login) = 'ines');
[12:42pm] Bino: basically it looks like performance is great outside of few sticking points where things spike and then it comes right back in-line
[12:42pm] Bino: if we could pick those things off, I think we'd have good performance
[12:42pm] Bino: woah!
[12:42pm] Bino: 43 seconds for a login
[12:43pm] tpol: thats doing a full table scan
[12:43pm] tpol: it can't use an index because of the use of LOWER
[12:43pm] Bino: jeeze
[12:43pm] Bino: so if we take out that lower, it will be able to?
[12:44pm] tpol: yes... users table is utf8 which is case-insensitive
[12:44pm] Bino: could that have been waiting on another query? is that why it took so long? Logging in seems to be super fast in general(although a full-table scan needs to be fixed anyway)
[12:44pm] tpol: and you have an index on login so that would use an index if you drop the lower
[12:45pm] Bino: ok, cool I'll do that
[12:46pm] tpol: your database is doing a lot of disk reads... it could be that there just isn't enough memory to adequately cache your database
[12:47pm] tpol: that seems to be the common theme in the slow logs... there isn't really one query that slows it down... they are all slower than they should be because just about everything is reading from disk
[12:48pm] tpol: and because they are all contending for the disk resource they are slowing each other down
[12:48pm] Bino: I don't really understand what it means, "reading from disk"
[12:48pm] Bino: as opposed to reading from cache?
[12:50pm] tpol: posts is heavily indexed but an index on repost_token would help
[12:50pm] tpol: # Query_time: 48.569710 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 225509
[12:50pm] tpol: SELECT * FROM `posts` WHERE (`posts`.`repost_token` = '5e7dbad189d62e8ee375925e9f47a0dd2c9c094d') LIMIT 1;
[12:50pm] tpol: Bino: sure... so how mysql works is
[12:51pm] tpol: data is accessed from disk... once data has been accessed once it is stored in a memory cache...
[12:51pm] tpol: for innodb tables it is stored in the innodb_buffer_pool
[12:52pm] tpol: for MyISAm the indexes are stored in the key_buffer and the data is stored by the OS in OS cache
[12:52pm] tpol: as new data is read from disk it overwrites the oldest contents of the memory caches
[12:53pm] tpol: these caches are what enables mysql to return large amounts of data really quickly
[12:54pm] tpol: when the "active data set" (the data you actively request in your application) is larger than the cache you will see almost constant disk reads on the database
[12:55pm] tpol: Bino: here is the output of vmstat
[12:55pm] tpol: 0 4 8768 286276 25836 72148 0 0 53 326 4 2 10 0 87 2
[12:55pm] tpol: 1 2 8768 285912 25876 72164 0 0 4756 228 2087 2155 3 1 47 50
[12:55pm] tpol: 2 1 8768 286020 25892 72176 0 0 5618 36 1881 2347 16 0 34 47
[12:55pm] tpol: 0 2 8768 286028 25916 72172 0 0 3416 58 1584 1945 3 1 48 48
[12:55pm] tpol: 0 5 8768 280228 25932 78372 0 0 5476 26 1957 2160 6 1 44 49
[12:55pm] tpol: sorry hang on
[12:55pm] tpol: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
[12:55pm] tpol: r b swpd free buff cache si so bi bo in cs us sy id wa
[12:55pm] tpol: 0 4 8768 286276 25836 72148 0 0 53 326 4 2 10 0 87 2
[12:55pm] tpol: 1 2 8768 285912 25876 72164 0 0 4756 228 2087 2155 3 1 47 50
[12:55pm] tpol: 2 1 8768 286020 25892 72176 0 0 5618 36 1881 2347 16 0 34 47
[12:55pm] tpol: 0 2 8768 286028 25916 72172 0 0 3416 58 1584 1945 3 1 48 48
[12:55pm] tpol: 0 5 8768 280228 25932 78372 0 0 5476 26 1957 2160 6 1 44 49
[12:55pm] tpol: there... so under IO there are two columns... bi and bo
[12:56pm] tpol: bi = blocks read from disk into memory
[12:56pm] tpol: bo = blocks written to disk
[12:56pm] tpol: the first line is the average
[12:56pm] tpol: the average is based on when the server started running
[12:56pm] tpol: the server has been up for 54 days
[12:57pm] tpol: the second line is the 2 second average for the item for the past 2 seconds
[12:57pm] Bino: ok, looking, sorry
[12:57pm] Bino: got a call
[12:57pm] tpol: so you can see that the average for this performance is very low... not a lot of bi at 53
[12:58pm] tpol: but the results after that are in the thousands and are actually maxing out the read/write capability of the disks at times
[12:58pm] Bino: right, so where does that come from?
[12:58pm] Bino: its from reading active data
[12:58pm] Bino: ?
[12:59pm] tpol: its reading so much data that the data is overwritten in the cache very shortly after being written
[12:59pm] tpol: so the caches need to be larger
[1:00pm] tpol: you recently shrunk several of your caches didn't you?
[1:00pm] Bino: umm, we reduced to innodb buffer
[1:00pm] tpol: that would do it
[1:01pm] Bino: and then today we reduced the ram
[1:01pm] Bino: so the question is
[1:01pm] Bino: we JUST moved to a fractional cluster
[1:01pm] Bino: from an env with much less RAM and less buffer
[1:01pm] Bino: but we ran better on that env than this one
[1:02pm] tpol: weren't you on a shared db before?
[1:02pm] Bino: when we moved over, we were given(and charged for a lot more ram than is standard, so we're trying to get down to the standard level which should still be much more than our previous environment
[1:03pm] Bino: yeah, we were on a shared
[1:04pm] tpol: so a shared db has minimum of 4096MB of ram
[1:04pm] tpol: its shared but its possible that your database was using 80-90% of those memory buffers
[1:07pm] tpol: hmm... well if you really want to stay at this memory level you could direct some reads to the replica and that will spread out the disk IO's between two subsystems but my recommendation would be to increase memory and increase your innodb_buffer... you might not need to go to 4096 but I think you'll benefit from more than you have
[1:08pm] tpol: the *standard* memory configs are just a starting point for the fractional configurations we almost always have to tweak that for how the database is used and tuned
[1:11pm] Bino: ok, I just am confused because it seems if you come from a shared to a dedicated, you should be running awesome right from the start, there wouldn't be a need to have so much memory
[1:12pm] Bino: so you don't think optimizing the queries will solve the problem
[1:12pm] Bino: its a big expense for us at this point so we need to look at every avenue before spending on memory
[1:13pm] tpol: there are a few items that can be tuned but they are relatively infrequent compared to other items
[1:13pm] tpol: how is performance?
[1:14pm] tpol: are the queries slow to the point where application response is suffering
[1:14pm] tpol: or are we just trying to adapt to a few instances of latency?
[1:15pm] tpol: also are you working more with staging today then you were last week
[1:15pm] Bino: well, we have times where we just come to a standstill and then break through it and things are fine
[1:15pm] Bino: I haven't even been able to get on staging which is another issue, its been timing out and I'm not sure why
[1:15pm] tpol: because your staging db will press your production data out of cache
[1:15pm] Bino: I have a ticket for that
[1:16pm] Bino: I didn't know they were even related
[1:16pm] tpol: hmm... odd, I've seen at least a couple staging queries
[1:16pm] Bino: I thought they were totally separate
[1:18pm] tpol: its hosted on your fractional db... usually we ask about that during the deploy because we can put your staging db on to a shared staging instance if you'd prefer
[1:18pm] tpol: but if you aren't really using it then it shouldn't be impacting you
[1:19pm] Bino: we use it just to test things before prod
[1:20pm] smergulhao joined the chat room.
[1:22pm] tpol: we can also increase the buffer sizes gradually until we reach a point where performance is stabilized, but each change will require a database restart
[1:26pm] Bino: right, so this is what we've been doing over the past week
[1:28pm] Bino: we went from 2.5 to 1.5 I believe
[1:28pm] Bino: and then we just brought RAM down to 2.5 from 4
[1:32pm] tpol: reviewing the config it looks like key_buffer is well sized... 128M I wouldn't want to reduce it much from that given the ~20M of MyISAM tables you have and the use for temporary tables
[1:32pm] tpol: that and reducing it wouldn't give us much
[1:36pm] tpol: Bino: you know.. I don't think this is too far "over the edge" given the iowait charting in nagios
[1:37pm] tpol: 512M in the innodb_buffer might make a world of difference
[1:37pm] Bino: yeah?
[1:37pm] Bino: ok, do that and see how it goes
[1:37pm] Bino: and we can leave RAM where it is on the db slices?
[1:37pm] Bino: please btw
[1:37pm] tpol: no... we'd have to bump them by 512M to accomodate
[1:37pm] tpol: sec
[1:38pm] tpol: Bino: what happened thursday?
[1:38pm] Bino: I think that's when we reduced the buffer
[1:38pm] Bino: why? what are you seeing
[1:39pm] tpol: sec... friday morning 10:10
[1:39pm] Bino: not really sure, related to what? a big spike or something?
[1:41pm] tpol: ok... posted a pic
[1:42pm] tpol: in the ticket
[1:42pm] Bino: would that be due to a db restart
[1:42pm] Bino: ?
[1:43pm] Bino: I believe that's what it is
[1:43pm] Bino: we reduced the buffer and restarted the db
[1:43pm] tpol: hmm... I show a restart on Wed, thurs, frid and mon
[1:43pm] Bino: yeah, I seem to remember we were down much longer on friday
[1:44pm] Bino: but other than that I can't think what it could be
[1:45pm] tpol: the config was edited on the 20th and Erik said thats when the buffer size change went in
[1:45pm] tpol: looking at the io trend for wednesday which appears to be ~normal for the prior days of tracking on this db the IOs for Friday and the weekend were higher
[1:46pm] tpol: monday and tuesday they are considerably higher
[1:48pm] tpol: but the general usage trend (shaving off the spikes) shows a similar level to what what occurring on friday
[1:48pm] tpol: can you tell if your traffic has been slightly higher the last two days compared to friday... that would be my suspicion
[1:48pm] Bino: def
[1:49pm] Bino: mon is the highest of the week and gradually trends downward to the end of the week with sat and sun being very low
[1:49pm] tpol: oh
[1:49pm] tpol: yep... thats about right then
[1:54pm] tpol: so I think more memory will help you and because you are sensitive to the costs I would say start with 512M
[1:55pm] tpol: if you don't want to go with memory try tweaking those two queries and we'll see where that gets you
[2:08pm] Bino: ok, so we'll tweak those two queries and I'll discuss the costs with James
[2:08pm] Bino: thanks alot for your help!
[2:08pm] Bino: I appreciate it
[2:21pm] tpol: Bino: sure thing...
[2:21pm]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment