Skip to content

Instantly share code, notes, and snippets.

@letslego
Created July 13, 2015 07:14
Show Gist options
  • Save letslego/852134b16f6e42cffb40 to your computer and use it in GitHub Desktop.
Save letslego/852134b16f6e42cffb40 to your computer and use it in GitHub Desktop.
Tuning can be done at 3 different levels.
Hardware level : Looking into the RAID levels of the hardware
OS Kernel Level.
JVM level
C* config level
Methods:
1. Use *nix tools like: vmstat, iostat, stat
2. Use monitors like: htop, atop, top
Cacti / Munin / Ganglia standard tools for graphing many system level metrics.
Timeouts
=========
Make sure that the nodes are having the same system clock time by using a NTP (Network Time Protocol) server.
cross_node_timeout a setting that allows nodes to communicate timeout information to each other.
streaming_socket_timeout_in_ms. This is an important setting as it can control how much time is spent restreaming data between nodes in the event of a timeout. By default, there is no timeout in Cassandra for streaming operations. It is a good idea to set a timeout, but not too low a timeout. If a streaming operation times out, the file being streamed is started over from the beginning. As some SSTables can have a not insignificant amount of data, ensure that the value is set high enough to avoid unnecessary streaming restarts.
CommitLog
=========
In the cassandra.yaml file, there is a setting called commitlog_directory. That setting is the one that determines where the CommitLog segments get written to. Check to see which disk or partition your data_directory is set to in the cassandra.yaml file and make sure the directory of the CommitLog is set to be on a different disk or partition. By ensuring that the data_directoryand commitlog_directory are on different partitions, the CommitLog reads/writes don’t affect the overall performance of the rest of the reads on the node.
There are also a few other settings in the cassandra.yaml file that affect the performance of the CommitLog. The commitlog_sync setting can be set to either batch or sync. If thecommitlog_sync is set to batch, Cassandra will block until the write has been synced to disk. Having the commitlog_sync set to batch is usually not needed, as under most circumstances writes aren’t acknowledged until another node has the data. On the other hand, periodic is the default setting and typically is the best for performance as well. You will also need to ensure that the commitlog_sync_period_in_ms is sensible for your write workload. For durability, if you have a high-volume write system, set this to something smaller than 10,000ms (or 10s) to ensure minimal data loss between flushes.
Although the default setting of 32 for commitlog_segment_size_in_mb is a sane default, depending on your backup strategy, this may be something you want to change. If you are doing CommitLog archiving as a form of backup, choosing a more granular setting of 8 or 16 might be better. This allows a finer point-in-time restore, depending on what your volume is.
MemTables
========
There are a few basic tenets to keep in mind when adjusting MemTable thresholds:
1. Larger MemTables take memory away from caches. Since MemTables store the actual column data, they will take up at least that amount of space plus a little extra for index structure overhead. Therefore, your settings should take into account schema (ColumnFamily and column layout) in addition to overall memory.
2. Larger MemTables do not improve write performance. This is because writes are happening to memory anyway. There is no way to speed up this process unless your CommitLog and SSTables are on separate volumes. If the CommitLog and SSTables were to share a volume, they would be in contention for I/O.
3. Larger MemTables are better for unbatched writes. If you do batch writing, you will likely not see a large benefit. But if you do unbatched writes, the compaction will have a better effect on the read performance as it will do a better job of grouping like data together.
4. Larger MemTables lead to more effective compaction. Having a lot of little MemTables is bad as it leads to a lot of turnover. It also leads to a lot of additional seeks when the read requests hit memory.
The performance tuning of MemTables can double as a pressure release valve if your Cassandra nodes start to get overloaded. They shouldn’t be your only method of emergency release, but they can act as a good complement. In the cassandra.yaml file, there is a setting calledflush_largest_memtables_at. The default setting is 0.75. This setting is a percentage. What is going on under the hood is that every time a full garbage collection (GC) is completed, the heap usage is checked. If the amount of memory used is still greater than (the default) 0.75, the largest MemTables will be flushed. This setting is more effective when used under read-heavy workloads. In write-heavy workloads, there will probably be too little memory freed too late in the cycle to be of significant value. If you notice the heap filling up from MemTables frequently, you may need to either add more capacity or adjust the heap setting in the JVM.
The memtable_total_space_in_mb setting in the cassandra.yaml is usually commented out by default. When it is commented out, Cassandra will automatically set the value to one-third the size of the heap. You typically don’t need to adjust this setting as one-third of the heap is sufficient. If you are in a write-heavy environment, you may want to increase this value. Since you already know the size of the JVM heap, you can just calculate what the new size of the total space allotted for MemTables should be. Try not to be too aggressive here as stealing memory from other parts of Cassandra can have negative consequences.
The setting of memtable_flush_writers is another one that comes unset out of the box. By default, it’s set to the number of data directories defined in the cassandra.yaml. If you have a large heap size and your use case is having Cassandra under a write-heavy workload, this value can be safely increased.
On a similar note, the memtable_flush_queue_size has an effect on the speed and efficiency of a MemTable flush. This value determines the number of MemTables to allow to wait for a writer thread to flush to disk. This should be set, at a minimum, to the maximum number of secondary indexes created on a single ColumnFamily. In other words, if you have a ColumnFamily with ten secondary indexes, this value should be 10. The reason for this is that if thememtable_flush_queue_size is set to 2 and you have three secondary indexes on a ColumnFamily, there are two options available: either flush the MemTable and each index not updated during the initial flush will be out of sync with the SSTable data, or the new SSTable won’t be loaded into memory until all the indexes have been updated. To avoid either of these scenarios, it is recommended that the value of mem_table_flush_queue_size be set to ensure that all secondary indexes for a ColumnFamily can be updated at flush time.
Concurrency
========
Cassandra was designed to have faster write performance than read performance. One of the ways this is achieved is using threads for concurrency. There are two settings in the cassandra.yaml that allow control over the amount of concurrency: concurrent_reads andconcurrent_writes. By default, both of these values are set to 32. Since the writes that come to Cassandra go to memory, the bottleneck for most performance is going to the disk for reads. To calculate the best setting for concurrent_reads, the value should be 16 * $number_of_drives. If you have two drives, you can leave the default value of 32. The reason for this is to ensure that the operating system has enough room to decide which reads should be serviced at what time.
Coming back to writes, since they go directly to memory, the work is CPU bound as opposed to being I/O bound like reads. To derive the best setting for concurrent_writes, you should use 8 * number_of_cores in your system. If you have a quad-core dual-processor system, the total number of cores is eight and the concurrent_writes should be set at 64.
Durability and Consistency
==================
Something to always keep in mind when looking at the performance of your application is which trade-offs you are willing to make with regard to the durability of your data on write and the consistency of your data on read. Much of this can be achieved by setting consistency levels on reads and writes. For reference, a quorum is calculated (rounded down to a whole number) as (replication_factor/2) + 1.
We have already covered consistency levels in detail, but the theory behind when to use which consistency level at what time, known as durability, is also important. When you are working under a write-heavy workload, it is unlikely that all the data being written is so important that it needs to be verified as received by every node in a replica (QUORUM, LOCAL_QUORUM, EACH_QUORUM, or ALL). Unless your node or cluster is under a heavy load, you will probably be safe with usingCL.ANY or CL.ONE for most writes. This reduces the amount of network traffic and reduces the wait time on the application performing the write (which is typically a blocking operation to begin with). If you can decide at write time or connection time which data is important enough to require higher consistency levels in your writes, you can save quite a bit of round-trip and wait time on your write calls.
On the read side of things, you can ask yourself a similar question: How important is the accuracy of the call I am making? Since you are working under eventual consistency, it is important to remember that the latest and greatest version of the data may not always be immediately available on every node. If you are running queries that require the latest version of the data, you may want to run the query with QUORUM, LOCAL_QUORUM, EACH_QUORUM, or ALL. It is important to note when using ALL that the read will fail if one of the replicas does not respond to the coordinator. If it is acceptable for the data not to have the latest timestamp, using CL.ONE may be a good option. By default, a read repair will run in the background to ensure that for whatever query you just ran, all data is consistent.
If latency is an issue, you should also consider using CL.ONE. If consistency is more important to you, you can ensure that a read will always reflect the most recent write by using the following: (nodes_written 1 nodes_read) > replication_factor.
When thinking about consistency levels in the context of multiple data centers, it is important to remember the additional latency incurred by needing to wait for a response from the remote data center or data centers. Ideally, you want all of an application’s requests to be served from within the same data center, which will help avoid quite a bit of latency. It is important to keep in mind that even at a consistency level of ONE or LOCAL_QUORUM, a write is still sent to all replicas, including those in other data centers. In this case, the consistency level determines how many replicas are required to respond that they received the write.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment