Skip to content

Instantly share code, notes, and snippets.

@RussellSpitzer
Last active September 22, 2015 17:56
Show Gist options
  • Save RussellSpitzer/29f3dd1b465db431854d to your computer and use it in GitHub Desktop.
Save RussellSpitzer/29f3dd1b465db431854d to your computer and use it in GitHub Desktop.

Configuration Reference

Cassandra Authentication Parameters

Property Name Default Description
spark.cassandra.auth.conf.factory com.datastax.spark.connector.cql.DefaultAuthConfFactory$ name of a Scala module or class
spark.cassandra.auth.username None Login name for password authentication
spark.cassandra.auth.password None password for password authentication

Cassandra Connection Parameters

Property Name Default Description
spark.cassandra.connection.port 9042 Cassandra native connection port
spark.cassandra.connection.reconnection_delay_ms.max 60000 Maximum period of time to wait before reconnecting to a dead node
spark.cassandra.read.timeout_ms 120000 Maximum period of time to wait for a read to return
spark.cassandra.connection.keep_alive_ms 250 Period of time to keep unused connections open
spark.cassandra.connection.reconnection_delay_ms.min 1000 Minimum period of time to wait before reconnecting to a dead node
spark.cassandra.connection.host localhost Contact point to connect to the Cassandra cluster
spark.cassandra.connection.local_dc None The local DC to connect to (other nodes will be ignored)
spark.cassandra.connection.compression Compression to use (LZ4, SNAPPY or NONE)
spark.cassandra.query.retry.delay ExponentialDelay(4 seconds,1.5) The delay between subsequent retries (can be constant, like 1000; linearly increasing, like 1000+100; or exponential, like 1000*2)
spark.cassandra.connection.timeout_ms 5000 Maximum period of time to attempt connecting to a node
spark.cassandra.connection.factory com.datastax.spark.connector.cql.DefaultConnectionFactory$ Name of a Scala module or class implementing CassandraConnectionFactory providing connections to the Cassandra cluster
spark.cassandra.query.retry.count 10 Number of times to retry a timed-out query

Cassandra Dataframe Source Paramters

Property Name Default Description
spark.cassandra.table.size.in.bytes None Used by DataFrames Internally, will be updated in a future release to retreive size from C*. Can be set manually now

Cassandra SQL Context Options

Property Name Default Description
spark.cassandra.sql.keyspace None Sets the default keyspace
spark.cassandra.sql.cluster None Sets the default Cluster to inherit configuration from

Cassandra SSL Connection Options

Property Name Default Description
spark.cassandra.connection.ssl.enabled false Enable secure connection to Cassandra cluster
spark.cassandra.connection.ssl.enabledAlgorithms Set(TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA) Enable secure connection to Cassandra cluster
spark.cassandra.connection.ssl.trustStore.password None Trust store password
spark.cassandra.connection.ssl.trustStore.path None Path for the trust store being used
spark.cassandra.connection.ssl.protocol TLS SSL protocol
spark.cassandra.connection.ssl.trustStore.type JKS Trust store type

Read Tuning Parameters

Property Name Default Description
spark.cassandra.input.consistency.level LOCAL_ONE Consistency level to use when reading
spark.cassandra.input.split.size_in_mb 64 Approx amount of data to be fetched into a Spark partition
spark.cassandra.input.metrics true Sets whether to record connector specific metrics on write
spark.cassandra.input.fetch.size_in_rows 1000 Approx amount of data to be fetched into a Spark partition

Write Tuning Parameters

Property Name Default Description
spark.cassandra.output.batch.size.bytes 1024 Maximum total size of the batch in bytes. Overridden by null
spark.cassandra.output.concurrent.writes 5 Maximum number of batches executed in parallel by a single Spark task
spark.cassandra.output.metrics 1000 Sets whether to record connector specific metrics on write
spark.cassandra.output.batch.grouping.buffer.size 1000 How many batches per single Spark task can be stored in memory before sending to Cassandra
spark.cassandra.output.batch.size.rows None Number of rows per single batch. The default is 'auto' which means the connector will adjust the number of rows based on the amount of data in each row
spark.cassandra.output.batch.grouping.key Partition Determines how insert statements are grouped into batches. Available values are: * none: a batch may contain any statements * replica_set: a batch may contain only statements to be written to the same replica set * partition (default): a batch may contain only statements for rows sharing the same partition key value
spark.cassandra.output.throughput_mb_per_sec 2147483647 (Floating points allowed) Maximum write throughput allowed per single core in MB/s limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability
spark.cassandra.output.consistency.level LOCAL_ONE Consistency level for writing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment