Skip to content

Instantly share code, notes, and snippets.

@brandur
Last active May 20, 2021 16:17
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brandur/1374c9266c1d9dc32464695df84d9699 to your computer and use it in GitHub Desktop.
Save brandur/1374c9266c1d9dc32464695df84d9699 to your computer and use it in GitHub Desktop.
RE: RE: gh-ost

RE: RE: gh-ost

Hi Shlomi,

First of all, my apologies for the tone of my original post. It was written without much reflection, and reading back through it, I see that it does in fact read as highly critical. You're correct in invoking XKCD #438.

I've tried to address a few of the points that you brought up in your second post.

Misdirection

I want to start with the last part of your article first. Regarding misdirection. You quoted this line from my writing:

"there's a level of seemingly willful misdirection here that I just can't wrap my head around"

But I think it's important to read the entire paragraph for context here:

I understand that sometimes you’re stuck on a legacy system, and sometimes that may require the development and application of solutions that would be inadvisable under normal conditions. But what kills me here is that there isn’t one place throughout GitHub’s write-up on Gh-ost where they mention that the entire project was built to address deficiencies in MySQL, and that there might be a better way if you’re not already locked into a big MySQL cluster. In fact, they’ll even go out of their way to champion MySQL over better alternatives [1] despite being fully aware how problematic its use can be. There’s a level of seemingly willful misdirection here that I just can’t wrap my head around.

I wrote that because most of the people browsing Hacker News and reading posts from the likes GitHub Engineering blog are almost certainly very technical, but they probably don't have much experience in operating a large production database like the kind that you'd find at a company like yours or mine. The next time they're evaluating a data store for their new project or company, part of the more subjective part of their evaluation criteria will be what they remember reading good things about over the last couple years.

To some degree, every post that goes out to the engineering blogs of any high profile companies like GitHub, Stripe, Heroku, Square, etc. are sales pitches in that if they're convincing enough, people will remember and emulate them.

For that reason, I'd posit that it's our responsibility as writers to articulate the nuances of technology. If a post champions some use of a new tool, it should address the trade-offs and other considerations that go along with it. To give you an example, I work at Stripe, and we could write a long post about the techniques we use to operate Mongo at large scale with a good degree of success, but if we did, I would campaign strongly for a reasoned section on how and why Mongo is a deeply flawed data store, and with a conclusion that if you starting fresh, there's a very good chance that you shouldn't use it.

Reading your original post on gh-ost, it would be easy for a database novice reading it to take away that their best for starting a new system might be a MySQL cluster augmented with gh-ost to allow online migrations. That may have been intentional on your part, but I would disagree with that premise because a system like Postgres will get you most of those upsides, but without the significant cost of installing and developing the expertise necessary to operate a new (rather complex) component in your stack.

You characterized my original write-up as a "Why not PostgreSQL?" post, and it's true that I do mention Postgres quite a few times because that's the system that I happen to know best, but my intention was to only use it as a counterexample. I'd consider myself a Postgres critic in some sense having run into and written about some of its more degenerate behavior as well [2].

To clarify my position, my issue with the write-up on gh-ost is that there is no acknowledgement that most (not all) of the problems that it was designed to solve are available for free just by using a system that isn't MySQL. If I did focus on a particular technology in my post, that seems like a minor overstep compared to not even tangentially acknowledging that alternatives might exist like in the original (especially when they might be quite a bit more operationally efficient than running a separate system to help add or remove columns).

To go back to my Mongo example, this would be a little like me writing about how we designed a custom locking system on top of Mongo to solve the problem that multiple requests on the same resources could conflict with each other (which we did). However, if I did so, I'd consider myself to be remiss if I didn't also mention that this mechanic was only required because Mongo doesn't give you isolation (in the ACID sense), and that using any RDMS developed in the last 20+ years would have gotten us this literally out-of-the-box (and in most cases, in a way that's better and faster) with very few caveats.

It's hugely interesting to see posts by companies like yours that describe how they're tackling some of their hardest engineering problems with clever solution like gh-ost. However, I think we need to adequately weigh the message that they might be sending and at least put some effort into talking about downsides and alternatives.

Technical Details

Mostly covering the sections "the particular claim and advice" and "moreover":

The post does not mention that PostgreSQL supports online schema changes for a limited set of operations. I went ahead to double check with the PostgreSQL documentation. I love the documentation! It is detailed and honest. I'm thoroughly satisfied that PostgreSQL only supports a limited set of online operations. I go against the fact the post does not mention that, and leads us to "understand" PostgreSQL has a magic wand.

Agreed about the Postgres docs! They're great.

Also agreed that I should have been more explicit about the downsides of Postgres in this respect. But I'd still posit, that for most of the most common operations that you'll need to perform as part of day-to-day schema maintenance, Postgres does have the equivalent of "a magic wand". Like you already mentioned, that includes:

  • Adding a column (with some caveats)
  • Dropping a column
  • Adding an index
  • Dropping an index

However a large part of our migrations also consists of adding columns with default values, changing data types (e.g. From INT to BIGINT), changing character characteristics (e.g. length of text column), space reclamation, and others. These changes are blocking in PostgreSQL.

Right on. These can be a little more finicky in cases. To address the ones you listed:

Changing Data Types (INT to BIGINT)

I've had to do this exact thing quite a few times now, and wish that SQL could be changed to remove this footgun given that using INT isn't likely to save much space in the long run (or none at all on a 64-bit machine running Postgres, modulo the case where it's involved in a composite index), but can and often will cause serious problems later in a system's lifecycle.

I'd be the first to admit that this is one of those operations, along with those like adding a new column with a default value, that Postgres doesn't handle any better than MySQL (and to be fair, these are all very tricky cases from the database's perspective).

However, I'd also say that it's one of the easier cases to handle with the help of your application-level code:

  1. Create new column (remember, in Postgres an ~instantaneous operation!).
  2. Start double-writing to the new column from code (or as a database trigger).
  3. Migrate current values to new column. If you're concerned about utilization (i.e. for say a large table), write this as an application-level migration that's throttled rather than as a single UPDATE statement.
  4. Start reading from the new column from code.
  5. Drop old column (also ~instantaneous; it'll get cleaned up on the next vacuum).

To me this sounds quite a bit easier than writing and maintaining a separate system to abstract the problem away, but we may disagree there.

Changing Length of Text Column

I'd have to say that this one once again falls under the "magic wand" category. VARCHAR and the arbitrarily-long TEXT are implemented with the same type under the hood. Postgres will invoke TOAST [3] as necessary to store oversized strings. Good practice here is to use TEXT always and constrain length with a check that can be changed trivially [4].

Space Reclamation

Right. Reclaiming space on a database will still require a VACUUM FULL which is something you can never run in production for a table of non-trivial size. I'd acknowledge this as a feature that gh-ost gets you, but that said, I've never found a particular need to do it.

I'd be curious to hear about why you've found the need to run this frequently (although I'm well aware that I sound like a Postgres apologist by that statement).

MySQL's Online DDL

As mentioned in the gh-ost announcement, MySQL as of 5.6 does indeed support online, non blocking alter table. In fact, it supports many more variants of online alter table than PostgreSQL does (however, noticeable difference is that PostgreSQL makes those changes transactional whereas MySQL does not).

Admittedly I haven't looked at this much, but I thought that you yourself seem to have provided pretty good justification not to use that mechanism in your original post:

MySQL’s Online DDL for InnoDB is only “online” on the server on which it is invoked. Replication stream serializes the alter which causes replication lag. An attempt to run it individually per-replica results in much of the management overhead mentioned above. The DDL is uninterruptible; killing it halfway results in long rollback or with data dictionary corruption. It does not play “nice”; it cannot throttle or pause on high load. It is a commitment into an operation that may exhaust your resources.

Correct me if I'm wrong on that point.

Throttling

gh-ost addresses that as well. Its throttling gives super powers over the migration process, kicking in throttling based on master load, replication lag, various human controlled criteria, effectively making it very lightweight on the master.

I'd be the first to acknowledge that throttling is certainly a good thing, and it's cool that it's built into gh-ost!

That said, I've had very good luck at companies I've worked at just baking throttling into application-level migrations. This also has the additional advantage of giving you access to all your domain models and additional functionality provided by your source code so that you're not just dealing with database primitives. It's even quite easy to implement with just a periodic sleep, although a more sophisticated approach is recommended.

This of course isn't going to work in some cases like CREATE INDEX, but I've had very good luck Postgres' CONCURRENTLY keyword in that case; I've used it to raise hundreds of indexes on large tables with no noticable user impact based on our measurements.

Summary

Just a few points in recap:

  • I'm admittedly a bad person for writing reaction posts that are worded too aggressively. It's easy to forget there's a human on the other end.

  • As writers working for tech companies that are generally held in quite high regard, I think it's our responsibility to disclose the trade-offs such as they are for solutions that we're presenting because they may be weighed heavily by readers.

  • That doesn't necessarily mean that we shouldn't write about how we're solving interesting problems. There are a lot of people out there (maybe the majority of us and certainly including me) who are working in stacks that they would never choose again if given the chance to green field. Unfortunately, that's reality, and we have to do our best with it.

  • Unfortunately, I would stand by my original position in that for many uses of gh-ost, especially the more common ones, that Postgres does present a "magic wand" in comparison. It may also have its own disadvantages, but to my eye they seem to be fairly diminishing compared to what you might face with MySQL [5].

My next obvious step here is to add an addendum to my original post addressing its deficiencies and oversteps, which I'll do posthaste [6].

Once again, apologies for the tone of the original article, and thanks for taking the time to write a more detailed follow-up. Let's indeed get beers sometime.

Brandur

[1] https://twitter.com/vmg/status/757987482478776320

[2] https://brandur.org/postgres-queues

[3] https://www.postgresql.org/docs/9.6/static/storage-toast.html

[4] https://dba.stackexchange.com/questions/20974/should-i-add-an-arbitrary-length-limit-to-varchar-columns

[5] Uber had a recent post on this subject, but I didn't find it particularly convincing because even though some of their misgivings were valid, they also failed to address any trade-offs.

[6] https://github.com/brandur/sorg/pull/35/files

@shlomi-noach
Copy link

Hi Brandur,

Thank you for your consideration!
I've noted in my post that your original post has been modified as well as provided link to this gist.

A few comments and clarifications:

I hear your concern about popular posts talking about some technology, and your wish that they would mention the deficiencies in said technology. If you read my personal blog you'll find rantings on MySQL behavior all the way back to 2009. I do not think a post such as the one I published on the GitHub engineering blog had much place for listing deficiencies. I would certainly think that the part where I review the 5.6 online alter behavior is a pointer to a deficiency: I describe an existing feature in 5.6, then proceed to explain its limitations and why we would not use it.
But, regardless, I don't see that the GitHub post was a glorifying song for MySQL. Indeed, you read it as "GitHub solved a problem that is not solved within MySQL itself", and I think this is self explanatory; novices should infer the same.

To me this sounds quite a bit easier than writing and maintaining a separate system to abstract the problem away, but we may disagree there.

We may indeed disagree. gh-ost (or other existing online-schema-migration tools, for that matter) is a one time development effort, which generically and transparently solves the problem without involving a developer. Those are quite a few steps you're describing, involving at least two code deployments and further steps in between.
An online operation just does the operation without changes to code. Once you have it, your future migrations are solved. Isn't this the essence of automation?

...but I thought that you yourself seem to have provided pretty good justification not to use that mechanism in your original post

I did. It was the MySQL deficiency I was describing. And yet it's an online operation that your original post claimed did not exist. If this feature did the job it was meant to do, gh-ost would not need to exist. At lower scales, it may just be good enough (though operationally there's overhead as one would want to execute it per server).

I'd be curious to hear about why you've found the need to run this frequently

Not so frequently; for some tables we may load with lots of data, then delete lots of data, diskpace reclamation can be beneficial. This is also useful when you decide to extract some table data to another table as a one time operation (say, move all handled-events to an events_history table, then reclaim space of those rows).

gh-ost is not this bloated code you imagine it to be. It's not trivial, but it was still built by a few months of a developer's work, and we think it was well invested time.

Kind regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment