brandur/re-re-gh-ost.md

## re-re-gh-ost.md

      
    Raw
  

              re-re-gh-ost.md
            
          
    RE: RE: gh-ost


Post #1: http://githubengineering.com/gh-ost-github-s-online-migration-tool-for-mysql/
Post #2: https://brandur.org/fragments/gh-ost
Post #3: http://code.openark.org/blog/mysql/mysql-vs-postgresql-gh-ost-perspective

Hi Shlomi,
First of all, my apologies for the tone of my original post. It was written
without much reflection, and reading back through it, I see that it does in
fact read as highly critical. You're correct in invoking XKCD #438.
I've tried to address a few of the points that you brought up in your second
post.
Misdirection

I want to start with the last part of your article first. Regarding
misdirection. You quoted this line from my writing:

"there's a level of seemingly willful misdirection here that I just can't
wrap my head around"

But I think it's important to read the entire paragraph for context here:

I understand that sometimes you’re stuck on a legacy system, and sometimes
that may require the development and application of solutions that would be
inadvisable under normal conditions. But what kills me here is that there
isn’t one place throughout GitHub’s write-up on Gh-ost where they mention
that the entire project was built to address deficiencies in MySQL, and that
there might be a better way if you’re not already locked into a big MySQL
cluster. In fact, they’ll even go out of their way to champion MySQL over
better alternatives [1] despite being fully aware how problematic its use can
be. There’s a level of seemingly willful misdirection here that I just can’t
wrap my head around.

I wrote that because most of the people browsing Hacker News and reading posts
from the likes GitHub Engineering blog are almost certainly very technical, but
they probably don't have much experience in operating a large production
database like the kind that you'd find at a company like yours or mine. The
next time they're evaluating a data store for their new project or company,
part of the more subjective part of their evaluation criteria will be what they
remember reading good things about over the last couple years.
To some degree, every post that goes out to the engineering blogs of any high
profile companies like GitHub, Stripe, Heroku, Square, etc. are sales pitches
in that if they're convincing enough, people will remember and emulate them.
For that reason, I'd posit that it's our responsibility as writers to
articulate the nuances of technology. If a post champions some use of a new
tool, it should address the trade-offs and other considerations that go along
with it. To give you an example, I work at Stripe, and we could write a long
post about the techniques we use to operate Mongo at large scale with a good
degree of success, but if we did, I would campaign strongly for a reasoned
section on how and why Mongo is a deeply flawed data store, and with a
conclusion that if you starting fresh, there's a very good chance that you
shouldn't use it.
Reading your original post on gh-ost, it would be easy for a database novice
reading it to take away that their best for starting a new system might be a
MySQL cluster augmented with gh-ost to allow online migrations. That may have
been intentional on your part, but I would disagree with that premise because a
system like Postgres will get you most of those upsides, but without the
significant cost of installing and developing the expertise necessary to
operate a new (rather complex) component in your stack.
You characterized my original write-up as a "Why not PostgreSQL?" post, and
it's true that I do mention Postgres quite a few times because that's the
system that I happen to know best, but my intention was to only use it as a
counterexample. I'd consider myself a Postgres critic in some sense having run
into and written about some of its more degenerate behavior as well [2].
To clarify my position, my issue with the write-up on gh-ost is that
there is no acknowledgement that most (not all) of the problems that it was
designed to solve are available for free just by using a system that isn't
MySQL. If I did focus on a particular technology in my post, that seems like a
minor overstep compared to not even tangentially acknowledging that
alternatives might exist like in the original (especially when they might be
quite a bit more operationally efficient than running a separate system to help
add or remove columns).
To go back to my Mongo example, this would be a little like me writing about
how we designed a custom locking system on top of Mongo to solve the problem
that multiple requests on the same resources could conflict with each other
(which we did). However, if I did so, I'd consider myself to be remiss if I
didn't also mention that this mechanic was only required because Mongo
doesn't give you isolation (in the ACID sense), and that using any RDMS
developed in the last 20+ years would have gotten us this literally
out-of-the-box (and in most cases, in a way that's better and faster) with very
few caveats.
It's hugely interesting to see posts by companies like yours that describe how
they're tackling some of their hardest engineering problems with clever
solution like gh-ost. However, I think we need to adequately weigh the message
that they might be sending and at least put some effort into talking about
downsides and alternatives.
Technical Details

Mostly covering the sections "the particular claim and advice" and "moreover":

The post does not mention that PostgreSQL supports online schema changes for
a limited set of operations. I went ahead to double check with the PostgreSQL
documentation. I love the documentation! It is detailed and honest. I'm
thoroughly satisfied that PostgreSQL only supports a limited set of online
operations. I go against the fact the post does not mention that, and leads
us to "understand" PostgreSQL has a magic wand.

Agreed about the Postgres docs! They're great.
Also agreed that I should have been more explicit about the downsides of
Postgres in this respect. But I'd still posit, that for most of the most common
operations that you'll need to perform as part of day-to-day schema
maintenance, Postgres does have the equivalent of "a magic wand". Like you
already mentioned, that includes:

Adding a column (with some caveats)
Dropping a column
Adding an index
Dropping an index


However a large part of our migrations also consists of adding columns with
default values, changing data types (e.g. From INT to BIGINT), changing
character characteristics (e.g. length of text column), space reclamation,
and others. These changes are blocking in PostgreSQL.

Right on. These can be a little more finicky in cases. To address the ones you
listed:
Changing Data Types (INT to BIGINT)

I've had to do this exact thing quite a few times now, and wish that SQL could
be changed to remove this footgun given that using INT isn't likely to save
much space in the long run (or none at all on a 64-bit machine running
Postgres, modulo the case where it's involved in a composite index), but can
and often will cause serious problems later in a system's lifecycle.
I'd be the first to admit that this is one of those operations, along with
those like adding a new column with a default value, that Postgres doesn't
handle any better than MySQL (and to be fair, these are all very tricky cases
from the database's perspective).
However, I'd also say that it's one of the easier cases to handle with the help
of your application-level code:

Create new column (remember, in Postgres an ~instantaneous operation!).
Start double-writing to the new column from code (or as a database trigger).
Migrate current values to new column. If you're concerned about utilization
(i.e. for say a large table), write this as an application-level migration
that's throttled rather than as a single UPDATE statement.
Start reading from the new column from code.
Drop old column (also ~instantaneous; it'll get cleaned up on the next
vacuum).

To me this sounds quite a bit easier than writing and maintaining a separate
system to abstract the problem away, but we may disagree there.
Changing Length of Text Column

I'd have to say that this one once again falls under the "magic wand" category.
VARCHAR and the arbitrarily-long TEXT are implemented with the same type
under the hood. Postgres will invoke TOAST [3] as necessary to store oversized
strings. Good practice here is to use TEXT always and constrain length with a
check that can be changed trivially [4].
Space Reclamation

Right. Reclaiming space on a database will still require a VACUUM FULL which
is something you can never run in production for a table of non-trivial size.
I'd acknowledge this as a feature that gh-ost gets you, but that said, I've
never found a particular need to do it.
I'd be curious to hear about why you've found the need to run this frequently
(although I'm well aware that I sound like a Postgres apologist by that
statement).
MySQL's Online DDL


As mentioned in the gh-ost announcement, MySQL as of 5.6 does indeed support
online, non blocking alter table. In fact, it supports many more variants of
online alter table than PostgreSQL does (however, noticeable difference is
that PostgreSQL makes those changes transactional whereas MySQL does not).

Admittedly I haven't looked at this much, but I thought that you yourself seem
to have provided pretty good justification not to use that mechanism in your
original post:

MySQL’s Online DDL for InnoDB is only “online” on the server on which it is
invoked. Replication stream serializes the alter which causes replication
lag. An attempt to run it individually per-replica results in much of the
management overhead mentioned above. The DDL is uninterruptible; killing it
halfway results in long rollback or with data dictionary corruption. It does
not play “nice”; it cannot throttle or pause on high load. It is a commitment
into an operation that may exhaust your resources.

Correct me if I'm wrong on that point.
Throttling


gh-ost addresses that as well. Its throttling gives super powers over the
migration process, kicking in throttling based on master load, replication
lag, various human controlled criteria, effectively making it very
lightweight on the master.

I'd be the first to acknowledge that throttling is certainly a good thing, and
it's cool that it's built into gh-ost!
That said, I've had very good luck at companies I've worked at just baking
throttling into application-level migrations. This also has the additional
advantage of giving you access to all your domain models and additional
functionality provided by your source code so that you're not just dealing with
database primitives. It's even quite easy to implement with just a periodic
sleep, although a more sophisticated approach is recommended.
This of course isn't going to work in some cases like CREATE INDEX, but I've
had very good luck Postgres' CONCURRENTLY keyword in that case; I've used it
to raise hundreds of indexes on large tables with no noticable user impact
based on our measurements.
Summary

Just a few points in recap:


I'm admittedly a bad person for writing reaction posts that are worded too
aggressively. It's easy to forget there's a human on the other end.


As writers working for tech companies that are generally held in quite high
regard, I think it's our responsibility to disclose the trade-offs such as
they are for solutions that we're presenting because they may be weighed
heavily by readers.


That doesn't necessarily mean that we shouldn't write about how we're solving
interesting problems. There are a lot of people out there (maybe the majority
of us and certainly including me) who are working in stacks that they would
never choose again if given the chance to green field. Unfortunately, that's
reality, and we have to do our best with it.


Unfortunately, I would stand by my original position in that for many uses of
gh-ost, especially the more common ones, that Postgres does present a "magic
wand" in comparison. It may also have its own disadvantages, but to my eye
they seem to be fairly diminishing compared to what you might face with MySQL
[5].


My next obvious step here is to add an addendum to my original post addressing
its deficiencies and oversteps, which I'll do posthaste [6].
Once again, apologies for the tone of the original article, and thanks for
taking the time to write a more detailed follow-up. Let's indeed get beers
sometime.
Brandur
[1] https://twitter.com/vmg/status/757987482478776320
[2] https://brandur.org/postgres-queues
[3] https://www.postgresql.org/docs/9.6/static/storage-toast.html
[4] https://dba.stackexchange.com/questions/20974/should-i-add-an-arbitrary-length-limit-to-varchar-columns
[5] Uber had a recent post on this subject, but I didn't find it particularly
convincing because even though some of their misgivings were valid, they
also failed to address any trade-offs.
[6] https://github.com/brandur/sorg/pull/35/files