Skip to content

Instantly share code, notes, and snippets.

@fabiog1901
Last active October 29, 2020 12:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fabiog1901/3514f2144e3c7fb485a4818b451f4eac to your computer and use it in GitHub Desktop.
Save fabiog1901/3514f2144e3c7fb485a4818b451f4eac to your computer and use it in GitHub Desktop.

CockroachDB Platform Provisioning and Sizing Best Practices

Terminology

Sizing:

  • a process of estimating platform resource requirements (CPU, RAM, disk, network) of a planned application based on technical descriptions of the application’s requirements
  • is an ”educated guess” rather than a precise calculation
  • CRDB is intrinsically elastic - relieves the pressure to get the sizing “exactly right”
  • done for pricing / budgetary estimates and to right-size a POC environment (procurement dependent, non-cloud)

Node:

  • is an instance of a cockroach server process
  • e.g. not VM, not server, not OS instance, not container

vCPU:

  • the same way AWS/GCP defines a vCPU...

Provisioning:

  • a set of best practices for allocating underlying resources
  • an effective way to promote good practices is with per-vCPU resource provisioning ratios, a.k.a. “rules of thumb”

True Concurrency:

  • a characteristic of a workload where database clients open connections and execute SQL statements back-to-back, as in a stress test, as opposed to connections with a significant share of idle time, such as interactive user connections
  • alternatively, “true” concurrency may be referred to as the number of always-active connections or as the number of actively executing statements

Cluster Sizing Tasks

Determining the cluster composition (what kind of VMs and how many) can be divided into 3 Tasks:

  1. Establish per-vCPU resource provisioning ratios
  2. Estimate the required CockroachDB cluster size in total vCPUs
  3. Select VM size (in vCPUs) as a “building block” of the cluster, thereby determining the number of Nodes in the CockroachDB cluster

1. Establish per-vCPU resource provisioning ratios

Provisioning

Why RAM, Disk, and Net per-VCPU?

  • simple, ease to follow
  • yields predictably positive customer operating experience
  • common industry practice and cost model adopted by public cloud providers and Industry Standard Servers (ISS) hardware vendors

Balancing Act:

  • Under-provisioning results in SLA violations
  • Over-provisioning increases costs

Err on the side of over-provisioning, at reasonable costs

Guidance, not Requirement

  • For Production environments
  • For great majority of customers
  • Experts can use guidance as baseline and adjust from there; e.g. development environments

Keep all main resources, below, balanced so no resource bottlenecks leaving other resources underutilized

  • CPU
  • RAM
  • Storage
  • Networking

Compute Provisioning

RAM-to-vCPU ratio is 4GB per vCPU

  • for production environments
  • the lowest acceptable RAM-to-vCPU ratio is 2GB per vCPU, which is only suitable for testing.

The True Concurrency-to-vCPU ratio should not exceed 4

  • Users are encouraged to implement connection pooling or similar techniques to ensure the number of always-active connections per CockroachDB node does not exceed 4 times vCPU count

Storage Provisioning

Capacity: 150 GB per vCPU

  • subject to the per- Node storage capacity limit of 2.5 TB (1 TB prior to v20.1)

IOPS: 500 IOPS per vCPU

MBPS: 30 MBPS per vCPU

Dedicated CRDB data volume

  • not shared with any other IO on a VM e.g. Linux OS

2. Estimate the required CockroachDB cluster size in total vCPUs

Sizing Calculations

Size in total number of vCPUs

  • RAM, storage capacity, and storage IO is derived from the vCPU count using the Resource Provisioning Ratios

Do not size “Nodes” or VMs…

  • in scale-out MPP, it is less useful to calculate cluster sizing in “Nodes”
  • ability to handle workload depends on total # of processors (vCPUs), not the # of processes (nodes)
  • e.g. 12 nodes 4 vCPUs each will handle the same MAX workload as 4 nodes 12 vCPUs each

Assumptions is the Best Practices are followed:

  • One CRDB Node per OS instance (i.e. per server/VM)
  • All cluster nodes should be identically provisioned and configured
  • Client application connections are evenly distributed across all cluster nodes

Determine the required number of nodes to scale out the cluster to meet

  • Availability (multiplier for Capacity)
  • Storage Capacity (using 150GB / vCPU)
  • Workload Concurrency (using 4 concurrent stmts / vCPU)
  • Query Response Time

Evaluations are done independently for each type of requirement and the largest node count across all calculations becomes the suggested cluster size to meet requirements across the board

Sizing Example 1

Availability

  • Based on failure tolerance scenarios the required replication factor is 5

Capacity

  • Data Size is 5TB
  • 5TB x 5 replicas = 25TB
  • 60% utilization and 40% compression cancel each other
  • 25 TB / 150 GB/vCPU = 166 vCPUs

Workload Concurrency

  • 300 true concurrency (i.e. 300 statements concurrently executing in the cluster)
  • 300 active statements / 4 active statements per vCPU = 75 vCPUs

The required cluster size

  • MAX (166, 75) = 166 vCPUs

Sizing Example 2

Availability

  • Based on failure tolerance scenarios the required replication factor is 3

Capacity

  • Data Size is 3TB
  • 3TB x 3 replicas = 9TB
  • 60% utilization and 40% compression cancel each other
  • 9 TB / 150 GB/vCPU = 60 vCPUs

Workload Concurrency

  • 300 true concurrency (i.e. 300 statements concurrently executing in the cluster)
  • 300 active statements / 4 active statements per vCPU = 75 vCPUs

The required cluster size

  • MAX (60, 75) = 75 vCPUs

3. Select VM size (in vCPUs) as a “building block” of the cluster, thereby determining the number of Nodes in the CockroachDB cluster

VM Size / Cluster Composition

For the same total vCPUs in a cluster, is it better to have ?…

  • few large VMs or
  • many small VMs?

General VM Size Constraints

Node count constraints

  • The number of nodes in a CRDB cluster can be any number other than 2
  • The minimum number of nodes in a fault tolerant CRDB cluster is 3
  • There is a single-node mode

VM Size boundaries

  • The smallest VM Size is 2 vCPUs
  • For production deployments, the minimum VM size is 4 vCPUs
  • CRL does not test VM sizes beyond 32 vCPUs and there is rarely a strong justification for VMs larger than 32 vCPUs

Pricing Peculiarities

When More Smaller VMs is Better

Ability to handle specific failure scenarios:

  • The number of nodes must be equal to or greater than the replication factor
  • the replication factor is determined based on requirements to tolerate specific failure scenarios for a given cluster topology and VM placement per AZs/Regions
  • a larger number of smaller nodes means greater resiliency and tolerance in more failure scenarios

Maintaining response time and concurrency SLA across failure scenarios

  • In single AZ deployments the loss of a single small node has lesser impact on a cluster's ability to handle concurrent workloads while in a degraded configuration
  • In deployments across multiple AZs, with presumed requirements to tolerate an AZ failure, the size of VMs is less of a factor since planning for degraded capacity would be dominated by an entire AZ loss scenario

Maintaining response time and concurrency SLA across routine maintenance procedures

  • routine maintenance procedures, such as rolling upgrades, temporarily put the CockroachDB cluster in a degraded state
  • lesser impact of individual VM maintenance on cluster's ability to handle concurrent workload while in a degraded configuration

Elapsed time to backup/restore a database/table

Elapsed time to rebuild a node in recovery scenarios

When Fewer Larger VMs is Better

Processing hotspots

  • Larger CockroachDB nodes are able to handle hotspots better than smaller ones
  • clusters on larger nodes can better tolerate hotspots
  • Users operating clusters on smaller size VMs have to be diligent about avoiding hotspots in applications

Operational effectiveness

  • In IT operations, cost reductions are often associated with consolidations
  • same capacity cluster with fewer nodes may be easier to operate and maintain
  • a consideration with large clusters, at hundreds of nodes scale

Predictable performance across workload variations

  • Queries operating on large data sets may place a heavier strain on network transfers if the data is spread widely over many smaller nodes

Bottom line

Many small VMs is better for resiliency

Few large VMs is better for performance consistency

Punch Line: pick the largest possible VM (smallest possible cluster) that meets availability requirements (failure scenarios)

Sizing Example 1

Requirements and Calculated Size

  • Replication factor is 5
  • Data Size is 5TB
  • Concurrency is 300
  • Cluster size: 166 vCPUs

Composition

  • Suppose for AZ/Region failure tolerance scenarios, the MIN nodes is 9
  • Round up or add extra node to handle 1 node maintenance or SLA while in degraded configuration – 176 vCPUs�
  • Use 11 x 16 vCPU/64GB RAM with 2.5TB disk capable of delivering 8,000 IOPS and 480 MB/s read-write IO

Sizing Example 2

Requirements and Calculated Size

  • Replication factor is 3
  • Data Size is 3TB
  • Concurrency is 300
  • Cluster size: 75 vCPUs

Composition

  • Suppose for AZ/Region failure tolerance scenarios, the MIN nodes is 9
  • Round up or add extra node to handle 1 node maintenance or SLA while in degraded configuration – 80 vCPUs
  • Can’t Use 5 x 16 vCPU/64GB RAM with 2.5TB disk capable of delivering 8,000 IOPS and 480 MB/s read-write IO
  • Use 10 x 8 vCPU/32GB RAM with 1TB disk capable of delivering 4,000 IOPS and 240 MB/s read-write IO

CockroachDB Survival Simulation Tool

<bramgruneir.github.io>

Cluster Sizing Tasks (Recap)

Determining the cluster composition (what kind of VMs and how many) can be divided into 3 Tasks:

  • Establish per-vCPU resource provisioning ratios
  • Estimate the required CockroachDB cluster size in total vCPUs
  • Select VM size (in vCPUs) as a “building block” of the cluster, thereby determining the number of Nodes in the CockroachDB cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment