Skip to content

Instantly share code, notes, and snippets.

@jakcharlton
Last active December 31, 2015 14:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jakcharlton/8000864 to your computer and use it in GitHub Desktop.
Save jakcharlton/8000864 to your computer and use it in GitHub Desktop.
Rethink concepts

A Datacentre is a group of Servers

Servers can be grouped in a Datacentre

A Server (Instance) is a single Rethink process

A Database is a logical grouping for Tables - Tables may sit on different Servers

A Shard is a partition of a Table

A Shard can be allocated to a specific Server

Each Shard has a Master and one or more Replicas

The Master is the one to send the ACK of a write

You can set the number of replicas (=copy) of a Table per datacenter

The few constraints that I am aware of are:

  • The masters of one table has to be in the same datacenter
  • You cannot have ask for more replicas in a datacenter than you have servers in this datacenter
  • Oh, and there is also the "ack" parameter
  • the ack number is set per table and per datacenter
  • It's the number of writes that have to be flushed to disk before a write is acknowledge
  • So number of acks <= number of replicas <= number of servers in a datacenter

You create one database per project, then you create your table inside this database. If you have too much data, and want to spread the load, you shard your table. If you want to have a back up, you set more than 1 replica

If at some point, you want to scale (because you have too much data - or because the load is too much), you shard your table

By default the master can be anywhere - you can force it to be in a specific datacenter though. You can end up with masters on multiple datacenters (when you don't assign a primary datacenter)

You can Create a Database and a Table from ReQL

You cannot Shard a Table from ReQL (you must use the Web interface or CLI to do this at present). You cannot automatically create balanced shards with the CLI for the moment. With the CLI you have to define the split points, the web interface infer where the good split points are.

An API is planned to allow sharding from ReQL, but hasn't yet been designed.

Rethink queries are automatically routed to the appropriate server. If you write it goes to the master, if you do a map reduce it will reduce on each shard then join the result. This is an automatic process.

@jakcharlton
Copy link
Author

So, you can shard Tables, but not Databases
[5:26pm] jakcharlton_: If we are creating Databases with ReQL as a customer creates a project, can we shard from ReQL too?
[5:26pm] neumino: jakcharlton_, you cannot shard from ReQL for the moment
[5:26pm] jakcharlton_: (as we would have something like 15-20 tables per Database)
[5:26pm] neumino: We are thinking about it, but we haven't settle an API yet
[5:26pm] jakcharlton_: Can sharing be set to automatic?
[5:26pm] jakcharlton_: sharding*
[5:26pm] neumino: You can write a script to shard with the CLI
[5:26pm] jakcharlton_: I swear OSX autocorrect hates me
[5:27pm] neumino: You can shard with the web interface
[5:27pm] neumino: But you cannot automatically create balanced shards with the CLI for the moment
[5:27pm] jakcharlton_: CLI / Web are both manual processes - think Basecamp - when a customer sets a new project up, we were intending to create a new database
[5:27pm] neumino: It's range shards, so with the CLI you have to define the split points
[5:27pm] neumino: The web interface infer where the good split points are
[5:28pm] neumino: jakcharlton_, you can write a bash script to use the CLI
[5:28pm] jakcharlton_: Ugggg ....
[5:28pm] neumino: I have personnally not done it, but some people have
[5:28pm] jakcharlton_: Thats pretty much my nightmare scenario with a database
[5:28pm] neumino: We are going to have hash shards
[5:28pm] neumino: (soon I think)
[5:28pm] jakcharlton_: hash shards?
[5:28pm] neumino: And in this case, you will just have to set the number of shards, not the split points
[5:29pm] neumino: range shard = a shard where the primary key is between two keys
[5:29pm] neumino: for hash shard, you hash the primary key, and define in which shard it falls

@jakcharlton
Copy link
Author

You create one database per project
[5:31pm] neumino: Then you create your table inside this database
[5:31pm] neumino: If you have too much data, and want to spread the load, you shard your table
[5:31pm] neumino: if you want to have a back up, you set more than 1 replica

If at some point, you want to scale it (because you have too much data - or because the load is too much), you shard your table
[5:34pm] jakcharlton_: OK - so if it had 2 datacentrs (east./west coast for example) - it automatically knows where to put the master for a new database-table-shard?
[5:35pm] neumino: By default the master can be anywhere
[5:35pm] neumino: You can force it to be in a datacenter though

neumino: Ok, I just though about something, you can have master on multiple datacenter
[5:35pm] neumino: (when you don't assign a primary datacenter)
[5:36pm] jakcharlton_: OK - thats what I was wondering
[5:36pm] neumino: jakcharlton_, when you create a table, you can set in which datacenter the table is
[5:36pm] neumino: with the datacenter option

@jakcharlton
Copy link
Author

Am I right in remembering that Rethink splits queries in parallel? Is that across servers or shards?
[5:42pm] neumino: The query is automatically routed to the appropriate server
[5:42pm] neumino: If you write it goes to the master
[5:43pm] neumino: If you do a map reduce it will reduce on each shard the join the result
[5:43pm] neumino: It basically do it for you, you don't have to worry about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment