Skip to content

Instantly share code, notes, and snippets.

@lfittl
Last active May 10, 2023 12:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lfittl/d54dc3e9754b975c936e to your computer and use it in GitHub Desktop.
Save lfittl/d54dc3e9754b975c936e to your computer and use it in GitHub Desktop.
Discussion in #citus on freenode about Citus' consistency guarantees (and the documentation about it)
[...]
<kellabyte> what happens if the master goes down?
<posi> So if citus master goes down, how does one recover?
<posi> kellabyte: https://www.citusdata.com/citusdb-architecture-whitepaper
<kellabyte> yeah, since all reads/writes must go through the master, what happens?
<kellabyte> ah thank you
<posi> "The master node keeps only metadata tables which are typically small (a few MBs in size). The metadata table can be replicated and quickly restored if the master node ever experiences a failure."
<ozgun> @posi Methods that use "logical replication" from PG to Citus will work. @andres noted that if your source table has "partitions" and the schema needs to change, you'll need to pay some attention
<lfittl_> @kellabyte https://www.citusdata.com/docs/citus/5.0/admin_guide/cluster_management.html#master-node-failures probably answers the question best
<lfittl_> (Use Postgres streaming replication is afaik the best approach here, but @ozgun knows best)
<kellabyte> ah interesting
<kellabyte> what does citus use for replica replication?
<kellabyte> something custom or?
<kellabyte> I've been poking around and I haven't been able to find information on the consistency guarantees of replicas
<kellabyte> (I may be blind)
<ozgun> @kellabyte, *if* your use-case is analytical, the best way to think of the master node is Hadoop's NameNode and JobTracker
<ozgun> for operational workloads, reads and writes go through the master
<kellabyte> right, but if a node is down citus retries on a replicas to cover the shard coverage set
<ozgun> (operational = CRUD)
<kellabyte> what are the guarantees of consistency to replicas?
<ozgun> yup, if a node is down and Citus fails to write to a shard, it will mark the shard as inactive
<kellabyte> I guess what I'm not clear is what are the isolation guarantees across instances and replicas, which basically resolves to C in CAP
<ozgun> (digging urls on current architecture and replication model)
<kellabyte> sweet thanks :)
<posi> kellabyte: I assume it would have to support the postgresql isolation
<kellabyte> ok, so I'm curious how thats guaranteed across replicas
<ozgun> This slide deck lays out the current replication model: http://slideplayer.com/slide/4896815/
<posi> i'd assume it's all 2pc
<kellabyte> for example, is replication linearizable? if not then you can get stale reads depending on what shard replica is selected
<posi> kellabyte: I don't think they do any async rep
<ozgun> We currently don't do 2pc on writes
<lfittl_> @kellabyte afaik, since the reads would happen from the master, you don't have a stale reads problem
<posi> ok
<posi> wow
<posi> i have no idea then
<lfittl_> but again @ozgun is the expert here :-)
<posi> yea i guess the master can know
<posi> all the mvc info, but serialzable reads still seem tough
<kellabyte> all reads are from master and don't go to replicas?
<kellabyte> so you can't scale reads?
<kellabyte> that doesn't seem useful, I'm guessing thats not correct
<ozgun> Short reads or reads that we need to parallelize across shards?
<kellabyte> define short reads? I'm not sure what the distinction is
<ozgun> If your read touches a single shard (for example, fetch a value), we define that as a short read
<kellabyte> ok, but a shard still has replicas
<kellabyte> and various reads may talk to different replicas of the same shard
<ozgun> If your read requires more complex analytics and you need to aggregate results from multiple machines, then we define it as multi-shard
<lfittl_> @ozgun can you do reads from streaming replication read slaves (of the master) as well?
<kellabyte> so replica consistency is a properly that should affect single-shard or multi-shard reads
<ozgun> you can. I think kellabyte is asking about staleness of results?
<kellabyte> because 1 read may talk to 1 replica, while another may talk to another
<kellabyte> yes, isolation properties in ACID or C in CAP
<kellabyte> aka is this linearizable, whats the replication protocol guarantees vs the query isolation guarantees
<jasonmp85> read-your-own writes is probably the best description
<kellabyte> so its session level guarantees?
<kellabyte> so that means a session can't fall back to a replica shard if a shard is down?
<jasonmp85> sorry, have standup in 1m, but I can probably describe this shortly
<kellabyte> ok sweet thanks :)
<kellabyte> that slide deck doesn't talk about the replication model, it just shows the topology
<kellabyte> doesn't talk anything about the replication protocol, transactional guarantees etc
<jasonmp85> ok, this might get a bit verbose, but I want to clarify how we got to this architecture and where we're going next
<kellabyte> awesome :) I like verbose
<jasonmp85> up until now, Citus (née CitusDB) didn't support DML statements. the bread-and-butter it offered was an SQL query interface to plain-jane PostgreSQL instances running within a cluster. a master node acted as the coordinator of this cluster, keeping track of metadata and planning distributed jobs to answer user queries (distributed joins, filters, sorts, transforms, etc.)
<jasonmp85> data was loaded (and shards created) using a \stage command, which took input data, split it into shards, and atomically created those shards and replicas. don't read into the word "replica": they are just copies of each other
<jasonmp85> to meet the needs of customers who were hacking in their own DML query workarounds, we wrote and open-sourced pg_shard, which was essentially an overlay extension that understood Citus metadata and shard behavior, but could act alone.
<jasonmp85> as of Citus 5.0, that work is integrated into Citus itself, so you get (a) the distributed query/job planning, (b) \stage "bulk load" ingest, and (c) the DML support offered by pg_shard
<jasonmp85> to get to your question (finally :-\), the pg_shard DML propagation essentially uses statement-level replication to send incoming statements to the affected nodes
<kellabyte> I'll let you finish and then I'll ask questions :)
<jasonmp85> this has several drawbacks: no multi-shard DML support, no multi-statement transaction support, and utilizes shard locks on the master to ensure safe commutability of operations (i.e. no racing UPDATEs applying to replicas in different order)
<jasonmp85> when pg_shard was written, the replication mechanisms present in PostgreSQL offered little beyond being able to completely (byte-for-byte) replicate a machine
<jasonmp85> but 9.5 brought LLSR (logical log streaming replication) (thanks andres :-D), which offers the potential for streaming subsets of data in a more efficient (and less manual) fashion
<jasonmp85> so: you can see your own writes; if a replica fails during your write, it is marked as unhealthy until a repair operation is run, and if a replica fails during a read, you will fail over to the next replica, guaranteed to see your own writes, but perhaps will see stale data wrt other sessions' racing writes
<andres> jasonmp85: 9.4 actually ;)
<jasonmp85> oh shoot, right
<jasonmp85> (9.4 was released shortly after pg_shard. 9.5 just came out)
andres vanishes back into his cave, hiding after the sudden highlight
<jasonmp85> kellabyte: so there's our history. the inception was really "load a bunch of data and use distributed processing to get to it and it's "just SQL"", but the DML overlay came along with simple single-statement-single-shard distribution
<kellabyte> right, makes sense
<kellabyte> are replicated writes synchronous and completed before the ack of the write returns to the client?
<jasonmp85> but since we're moving to masterless, consistency is becoming more present in our minds. we are likely not going to target full ACID, or global linearizability, but we want to improve our isolation and replication picture
<jasonmp85> yes
<jasonmp85> obviously there's the "client–server need 2PC" inconsistency question that aphyr brought up in his PostgreSQL post a while back (if the connection drops, the client doesn't know whether the write occurred)
<kellabyte> right
<jasonmp85> but if you get an ack, your write went through (subject to the PostgreSQL settings on your remote nodes, i.e. if you turn off sync or something we can't do much)
<kellabyte> right, so you say you can do read your own writes because its guaranteed your read won't be before the write on all replicas because replicas have to ack synchronously before the read can happen
<jasonmp85> yes
<kellabyte> I think some documentation about this would be wonderful, going through the information it's REALLY unclear what the system properties are
<jasonmp85> if you're thinking "OLTP/ORM use case", i.e. "PostgreSQL semantics, distributed", I obviously can't rule out anything, but that's unlikely to be what we're targeting (and you'd probably be better off with 2PC, a global transaction/snapshot manager/something like PostgreSQL-XC if that works for yoU)
<kellabyte> jasonmp85: thank you very much for that description and answers, like I said I would form that into a page somewhere, it's really important information
<jasonmp85> yup
<jasonmp85> open a GitHub issue ?
<kellabyte> sure :)
<kellabyte> I have some thoughts about the masterless stuff too I might post after some thoughts
<kellabyte> I'm a distributed systems junky so this stuff is fun to me lol
<jasonmp85> I think this kind of thing might be obliquely inferred from what we highlight in our use cases (i.e. reading into what's not there), but that's obviously not terribly explicit. do you have an example of a project that "does this right", in your opinion?
<jasonmp85> like, is it a section in their documentation ("System Properties"?) or just a one-off Wiki node for those who grok all these properties words, or?
<kellabyte> hmm, some database technologies actually write papers or paper-like documentation of how their replication protocols work and the guarantees
<kellabyte> some its just linked to a wiki page
<jasonmp85> Hm, I suppose PostgreSQL itself has e.g. http://www.postgresql.org/docs/9.5/static/transaction-iso.html
<kellabyte> yeah, there's 2 levels of things you can write (which you can combine if you wish)
<kellabyte> you can talk about guarantees in terms of isolation levels or terms like linearizable and all those etc
<kellabyte> and you can further explain how the guarantees are implemented in explaining the replication protocol, query planner, etc
<kellabyte> like for example HyperDex has documentation about how Hyperspace Hashing works, and how it provides ACID guarantees by explaining how hyperspace hashing + it's replication protocol works
<jasonmp85> heh. i went to look up HyperDex, but it had a visited hyperlink color? losing track of things I read.
<jasonmp85> ok, so (a) distribution model, (b) replication model, (c) fallouts thereof
<jasonmp85> (how they interact with reads and writes)
<kellabyte> yeah
<jasonmp85> and yeah, we track all issues using GitHub and waffle.io, so if you want to write up a quick GitHub issue it'll show up in our board (and clearly have an external requestor)
<kellabyte> awesome :)
<kellabyte> I'll do that now!
<kellabyte> thanks for the great answers, I think that information has great value for many others
<jasonmp85> yup; thanks for the conversation starter… ok who's next :-P
<kellabyte> lol
<kellabyte> also again congrats on the release and OSS'ing!
<jasonmp85> thanks!
<kellabyte> you all must be excited :)
<kellabyte> and kudos on having an IRC channel
<jasonmp85> yeah, we've been looking forward to this for a while. lots of work over the past year to remodularize into an extension, integrate pg_shard's functionality, etc.
<kellabyte> yeah I bet
<tef> jasonmp85: re serializability, what sort of weaker consistency guarantees are you aiming for? read your writes? or something like bailis' atomic visibility
<kellabyte> sounds like it has read your writes already
<jasonmp85> yeah, DMLs are already read-your-own. I don't think we want to _relax_ that at any rate
<kellabyte> configurable relaxation might be nice at some point, but I would focus on making a correct system first before making a looser one
<jasonmp85> the RAMP stuff's been on my mind as we go toward masterless, but a difference between us and some other systems (as an extreme example, voltdb) is that we aren't going to impose query-language constraints or particular paradigms (above and beyond PostgreSQL SQL itself) on users
<kellabyte> especially since people who pick up "distributed" postgres will assume it's getting strongly consistent guarantees since postgres is ACID
<tef> jasonmp85: and as a total aside, have you seen tapir ? irenezhang.net/papers/tapir-tr-v2.pdf it is an interesting approach to serializability
<kellabyte> voltdb does a lot of things to ensure transaction guarantees, like a lot of specific things you need to avoid
<kellabyte> because the server has to generate them
<jasonmp85> but basically I can't make any concrete promises re: future isolation levels (though I am comfortable discussing our current one)
<jasonmp85> yeah
<jasonmp85> and no, i haven't seen the tapir paper, though it's going in my folder now
<andres> I think the current use-cases for citus actually make looser consistency models a lot more interesting, if that allows for higher ingest rates.
<kellabyte> rand() and date() calls for example can't be in SQL statements
<kellabyte> (VoltDB)
<kellabyte> it replicates a command log, so if it replays those commands its going to get different values
<jasonmp85> yeah, if you have a point where everything is turned into a single log it's a lot easier to guarantee it'll be linearlizable :-D
<kellabyte> yeah exactly and you can do batched transfers nicely
<kellabyte> because you're acking batches not statements
<andres> kellabyte: That's where I'm personally arguing to go for ;)
<kellabyte> I like that approach as well
<kellabyte> and is simpler to do things like lower replication guarantees by allowing that to be async and ack prematurely
<kellabyte> a more complicated replication system gets really tough to add different guarantees
<kellabyte> and since you can do batch transfers, much more WAN friendly
<kellabyte> so you could do something like synchronous log replication inside a LAN and asynchronous log replication over WAN
<jasonmp85> yeah the "ack synchronously after on a disk in this DC, asynchronously replicate to another DC" pattern is a common one
<jasonmp85> right
<kellabyte> yeah
<kellabyte> too many DB's you need to loosen the local guarantees to make it work over a LAN
<kellabyte> or you have to manage 2 entirely different replication systems
<kellabyte> pain in the ass
<kellabyte> https://github.com/citusdata/citus/issues/403
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment