Skip to content

Instantly share code, notes, and snippets.

@ivansmf ivansmf/README Secret
Last active May 17, 2019

Embed
What would you like to do?
How we hit one million Cassandra writes on Google Compute Engine - Reproducing Results
A. Disclaimers:
1. The scripts are offered for instructional purposes
2. Cassandra.yaml and cassandra-env.sh are edited by one of the test scripts. We DO change the default settings to gain performance, and the new settings are specifically designed to work on n1-standard-8 data nodes. For instance, we set the Java heap size to a large value that might not fit on other VM types.
B. Assumptions:
The scripts assume that your username on the target VM is the same as the local development server. More specifically, that the output of `whoami` on your development server is the same as in the VM.
C. Prerequisites:
1. Download all test scripts to a local folder and untar it.
To download: wget http://storage.googleapis.com/p3rf-downloads/cassandra_1m_writes_per_sec_gist.tgz
To untar: tar xzf cassandra_1m_writes_per_sec_gist.tgz
2. Download Cassandra binary distribution tarball into the tarballs folder. You can find detailed instructions at http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installTarball_t.html
$ curl -L http://downloads.datastax.com/community/dsc.tar.gz
3. Download the Oracle Java 7 tarball into the tarballs folder (we used server-jre-7u40-linux-x64.tar.gz). You can replace step by installing the JDK, but we did not measure the performance impact of other releases. You can find the binary download here: http://docs.oracle.com/javase/7/docs/webnotes/install/linux/linux-server-jre.html (sign up required).
4. You'll need to replace [project_name] by an actual project name or ID.
D. Creating the Cassandra cluster and data loaders:
1. Create disks.
$ gcutil --project=[project_name] adddisk --zone=us-central1-b --wait_until_complete --size_gb=1000 `for i in {1..300}; do echo -n pd1t-$i " "; done`
2. Create data nodes.
$ gcutil --project=[project_name] addinstance --zone=us-central1-b --add_compute_key_to_project --auto_delete_boot_disk --automatic_restart --use_compute_key --wait_until_running --image=debian-7-wheezy-v20131120 --machine_type=n1-standard-8 `for i in {1..300}; do echo -n cas-$i " "; done`
3. Create loaders.
$ gcutil --project=[project_name] addinstance --zone=us-central1-b --add_compute_key_to_project --auto_delete_boot_disk --automatic_restart --use_compute_key --wait_until_running --image=debian-7-wheezy-v20131120 --machine_type=n1-highcpu-8 `for i in {1..30}; do echo -n l-$i " "; done`
4. Attach the disks to data nodes.
$ for i in {1..300}; do gcutil --project=[project_name] attachdisk --zone=us-central1-b --disk=pd1t-$i cas-$i; done
5. Authorize one of the loaders to ssh and rsync everywhere. Time to complete 5 minutes.
$ gcutil --project=[project_name] ssh --zone=us-central1-b l-1
$ ssh-keygen -t rsa
$ exit
6. Download the public key
$ gcutil --project=[project_name] pull --zone=us-central1-b l-1 /home/`whoami`/.ssh/id_rsa.pub l-1.id_rsa.pub
7. Upload the key to all other VMs
$ for i in {1..30}; do gcutil --project=[project_name] push --zone=us-central1-b l-$i l-1.id_rsa.pub /home/`whoami`/.ssh/; done
$ for i in {1..300}; do gcutil --project=[project_name] push --zone=us-central1-b cas-$i l-1.id_rsa.pub /home/`whoami`/.ssh/; done
8. Authorize l-1 to ssh into every VM in the project
$ for vm in `gcutil --project=[project_name] listinstances | awk '{print $10;}' | sed ':a;N;$!ba;s/\n/ /g'`; do ssh -o UserKnownHostsFile=/dev/null -o CheckHostIP=no -o StrictHostKeyChecking=no -i /home/`whoami`/.ssh/google_compute_engine -A -p 22 `whoami`@$vm "cat /home/`whoami`/.ssh/l-1.id_rsa.pub >> /home/`whoami`/.ssh/authorized_keys" ; done
9. Generate the cluster configuration file
$ echo SUDOUSER=\"`whoami`\" >benchmark.conf; echo DATA_FOLDER=\"cassandra_data\">>benchmark.conf ; for r in `gcutil 2>/dev/null --project=[project_name] listinstances --zone=us-central1-b | awk 'BEGIN {c=0; l=0;} /cas/ { print "CASSANDRA"++c"=\""$10":"$8":/dev/sdb\"";} /l\-[0-9]/ { print "LOAD_GENERATOR"++l"=\""$10"\""; }'`; do echo $r; done >> benchmark.conf
10. Upload all test scripts to l-1
$ tar czf scripts.tgz *
$ gcutil --project=[project_name] push --zone=us-central1-b l-1 scripts.tgz /home/`whoami`
11. ssh to l-1 to setup the cluster
$ gcutil --project=[project_name] ssh --zone=us-central1-b l-1
12. unpack the scripts
$ tar xzf scripts.tgz
13. Run setup_cluster.sh. Please make sure that all nodes are up and running.
$ ./setup_cluster.sh
14. Run tests
$ ./inserts_test.sh
15. Gather results from each loader
E. Deleting the cluster:
1. Delete data nodes
$ gcutil --project=[project_name] deleteinstance --zone=us-central1-b `for i in {1..300}; do echo -n cas-$i " "; done` --force --delete_boot_pd
2. Delete data loaders
$ gcutil --project=[project_name] deleteinstance --zone=us-central1-b `for i in {1..30}; do echo -n l-$i " "; done` --force --delete_boot_pd
3. Delete disks
$ gcutil --project=[project_name] deletedisk --zone=us-central1-b `for i in {1..300}; do echo -n pd1t-$i " "; done` --force
@tzach

This comment has been minimized.

Copy link

commented Jul 1, 2014

Thanks for sharing this.
I created a bash script base on the above
https://github.com/tzach/cassandra-benchmark-gce

@pygupta

This comment has been minimized.

Copy link

commented Dec 9, 2014

Line 2: curl -L ... is missing at the end: | tar xz

@khaziwallis

This comment has been minimized.

Copy link

commented Dec 11, 2014

great effort.... But can anyone explain me, how to fix the jvm size based on processor size and speed..
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.