Skip to content

Instantly share code, notes, and snippets.

@sebastianwebber
Last active February 19, 2024 11:44
Show Gist options
  • Star 26 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save sebastianwebber/5b67fb2866dbc300dab225ded5f28618 to your computer and use it in GitHub Desktop.
Save sebastianwebber/5b67fb2866dbc300dab225ded5f28618 to your computer and use it in GitHub Desktop.
Compilation of the Uber Facts on PostgreSQL to MySQL Migration

Uber facts

Original posts/information

Key points

  • ~50GB MySQL Application
  • Main motivation: PostGis
  • Migration made with a custom tool(xml2pgcopy) and mysqldump on 45min
  • Final architecture layout: Python app -> PGBouncer -> HAProxy -> PostgreSQL

Key points

  • Single [and large] database to store all data (excluding analytics)
  • Hard to scale, no sharding at all
  • With the change to MySQL, they implement sharding with...
  • ...Append only database

Key points

  • Too many engineers (about 1500) and caotic
  • Database timeline:
    • 2009-2010 - PHP + MySQL
    • break in 2 pieces:
      • dispatch: NodeJS + MongoDB
      • API: python + SQLAlchemy + MySQL
    • Feb 2012: dispatch swaps MongoDB for Redis
    • Feb 2013: Api switched to Postgres
    • Mar 2014: New Python services use MySQL, "schemaless begins"
    • sep 2014: First schemaless - starts migration from postgres
  • Chaos engineering
  • On May 11, API goes down for 16hours and read get slow queries for more 24 hours:
    • Disk full by wal archiving failed (caused by wrong permissions on S3) . Randon engineer delete old WAL and restart postgres. postgres does not restart.
    • All standby fail to become a new master. Consulting write a c program to make it works.
  • "databases can be killed and no users will notice"

Key points

  • uDestroy and failure tests
  • More details about database timeline and mostly the same presentation at All your base 2015

About PostgreSQL

Key points
  • Old and monolithic backend application written in Python
  • "PostgreSQL limitations":
    • Inefficient architecture for writes
    • Inefficient data replication
    • Issues with table corruption
    • Poor replica MVCC support
    • Difficulty upgrading to newer releases
Replication Issues
  • PostgreSQL Replication: For crash recovery purposes, the database already maintains a write-ahead log (WAL) and uses it to implement two-phase commit.
  • Write amplification
  • Postgres replication data stream quickly becomes extremely verbose, potentially occupying a large amount of bandwidth
  • Cascading replication limits the inter–data center bandwidth requirements to the amount of replication required between just the master and a single replica, even if there are many replicas in the second data center. During a routine master database promotion to increase database capacity, we ran into a Postgres 9.2 bug.
  • Postgres does not have true replica MVCC support
  • If a streaming replica has an open transaction, updates to the database are blocked if they affect rows held open by the transaction. (hot_standby_feedback?)
  • pglogical is still not an option for people running on older Postgres releases
Data Corruption issues
  • During a routine master database promotion to increase database capacity, we ran into a Postgres 9.2 bug: Replicas followed timeline switches incorrectly, causing some of them to misapply some WAL records.
Postgres Upgrades
  • Because replication records work at the physical level, it’s not possible to replicate data between different general availability releases of Postgres.
Buffer Pool issues
  • To increase performance, Postgres allows the kernel to automatically cache recently accessed disk data via the page cache
  • accessing data via the page cache is actually somewhat expensive compared to accessing RSS memory
Connection Issues
  • MySQL Connections theads vs PostgreSQL connection process
  • Postgres seems to simply have poor support for handling large connection counts
  • PostgreSQL need a external connection pool like PGBouncer
  • Occasional application bugs in backend services that caused them to open more active connections and these bugs have caused extended downtimes

Comnunity responses in Blogs/Mail Lists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment