makmanalp/README.md

## README.md

      
    Raw
  

              README.md
            
          
    TLDR: well intentioned but ultimately unhelpful IMHO. Here's why:

It's easy to make a judgement about bad query plans based on an extremely biased sample: To give you a sense of the variety of queries we have: as of today there are over 180k unique query fingerprints at HubSpot. Let's ignore the trivial ones: about 18k unique query fingerprints do > 1000 queries/sec. To be sure, query planner bugs are real, and I'm currently fairly sure we've hit one here (details later) but of the total a miniscule amount is /truly/ (more on this later) query planner silliness.
By contrast, humans can be quite bad at figuring out what index a query needs and will compare dismally to the above success rate if they start doing FORCE INDEX on everything manually. I mess it up often. I see smart, competent, experienced engineers mess it up quite literally every day. People have attempted to codify rules for this exhaustively - every time I scroll through that page I get a dose of humility. (Yes, there are rules of thumb and ways we can assist ...)
Lots of things that materially affect the "best" query plan change often in a production environment, and humans cannot keep up. Read more about transient plans here. Data size is definitely not the only thing. Few examples off the top of my head:

The specific values / arguments bound to placeholders in the query: A dao method can have more than 1 query plan depending on the arguments passed in via user input, and they can be both good, each for their own set of arguments.
Changing the batch size / limit size in a query.
Change in data size: e.g. jobs doing lots of deletes or inserts. Or change in the cardinality of a field or distribution of values: e.g. you purge every deleted=True.
Changes to the schema that seem unrelated.
The available memory, a change in what happens to be in memory.
The version of MySQL and the features it supports.


It's often not a "bad" query plan: A lot of the things we fix via USE INDEX are scenarios where there were only bad index choices. When mysql has no good options, the query planner won't have a clear winner and it can be harder for it to make the right choice and/or easer for it to have an unstable plan / flip flop scenario. In the case of a true transient plan, it might be that one was not quite bad enough to hit capacity limits and another is slightly worse enough that it tips us over at high query rates so it seems like a flip is always terrible: in reality we can't count how often they're happening harmlessly in the wild. Usually we make people add the proper, optimal index that was missing from the start, and it's a clear winner. But often that's not fast enough during a critsit, so we resort to USE/FORCE INDEX even though it's not strictly necessary, and worse than having an optimal one.
It's often not a query plan change: Change in query plan is often a last resort diagnosis we entertain when we run out of ideas, and "it might be this" doesn't always mean it is. The truth is we don't have a good way today of knowing what a plan /was/ for a given query execution in the past (this can be improved a bit, but it's complicated), and it's rare that we are lucky enough to have definitive, smoking-gun proof.
There's nicer tools to deal with most of these scenarios - some of which are about to be considered for Q4. In this case for example:

We could have caught this before the query before the PR was merged: e.g. running an explain shows a clearly wrong index (we have an index with all the fields, but it's not using it) being used pretty clearly, and we can do a PR invasion to check more basic stuff like this in some cases - so it could be something along the lines of "let's make sure engineers have an index in mind for each new query".
We could have caught this after the query was shipped to PROD:  There were signs that this query was trouble as early as a week ago: every time this query executed, we got a throttle spike. These signs are hard for humans to spot, but easier for a computer to say "hey, this query is new, and by the way ..." and send that over to a human.
With QoS (unlucky it wasn't rolled out here yet), we would have deprioritized these and let user queries through: less of these queries would have made it out of the queue, impact would have been much lower.