Skip to content

Instantly share code, notes, and snippets.

@ivansmf

ivansmf/README Secret

Last active October 7, 2020 01:57
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save ivansmf/6ec2197b69d1b7b26153 to your computer and use it in GitHub Desktop.
Save ivansmf/6ec2197b69d1b7b26153 to your computer and use it in GitHub Desktop.
How we hit one million Cassandra writes on Google Compute Engine - Reproducing Results
A. Disclaimers:
1. The scripts are offered for instructional purposes
2. Cassandra.yaml and cassandra-env.sh are edited by one of the test scripts. We DO change the default settings to gain performance, and the new settings are specifically designed to work on n1-standard-8 data nodes. For instance, we set the Java heap size to a large value that might not fit on other VM types.
B. Assumptions:
The scripts assume that your username on the target VM is the same as the local development server. More specifically, that the output of `whoami` on your development server is the same as in the VM.
C. Prerequisites:
1. Download all test scripts to a local folder and untar it.
To download: wget http://storage.googleapis.com/p3rf-downloads/cassandra_1m_writes_per_sec_gist.tgz
To untar: tar xzf cassandra_1m_writes_per_sec_gist.tgz
2. Download Cassandra binary distribution tarball into the tarballs folder. You can find detailed instructions at http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installTarball_t.html
$ curl -L http://downloads.datastax.com/community/dsc.tar.gz
3. Download the Oracle Java 7 tarball into the tarballs folder (we used server-jre-7u40-linux-x64.tar.gz). You can replace step by installing the JDK, but we did not measure the performance impact of other releases. You can find the binary download here: http://docs.oracle.com/javase/7/docs/webnotes/install/linux/linux-server-jre.html (sign up required).
4. You'll need to replace [project_name] by an actual project name or ID.
D. Creating the Cassandra cluster and data loaders:
1. Create disks.
$ gcutil --project=[project_name] adddisk --zone=us-central1-b --wait_until_complete --size_gb=1000 `for i in {1..300}; do echo -n pd1t-$i " "; done`
2. Create data nodes.
$ gcutil --project=[project_name] addinstance --zone=us-central1-b --add_compute_key_to_project --auto_delete_boot_disk --automatic_restart --use_compute_key --wait_until_running --image=debian-7-wheezy-v20131120 --machine_type=n1-standard-8 `for i in {1..300}; do echo -n cas-$i " "; done`
3. Create loaders.
$ gcutil --project=[project_name] addinstance --zone=us-central1-b --add_compute_key_to_project --auto_delete_boot_disk --automatic_restart --use_compute_key --wait_until_running --image=debian-7-wheezy-v20131120 --machine_type=n1-highcpu-8 `for i in {1..30}; do echo -n l-$i " "; done`
4. Attach the disks to data nodes.
$ for i in {1..300}; do gcutil --project=[project_name] attachdisk --zone=us-central1-b --disk=pd1t-$i cas-$i; done
5. Authorize one of the loaders to ssh and rsync everywhere. Time to complete 5 minutes.
$ gcutil --project=[project_name] ssh --zone=us-central1-b l-1
$ ssh-keygen -t rsa
$ exit
6. Download the public key
$ gcutil --project=[project_name] pull --zone=us-central1-b l-1 /home/`whoami`/.ssh/id_rsa.pub l-1.id_rsa.pub
7. Upload the key to all other VMs
$ for i in {1..30}; do gcutil --project=[project_name] push --zone=us-central1-b l-$i l-1.id_rsa.pub /home/`whoami`/.ssh/; done
$ for i in {1..300}; do gcutil --project=[project_name] push --zone=us-central1-b cas-$i l-1.id_rsa.pub /home/`whoami`/.ssh/; done
8. Authorize l-1 to ssh into every VM in the project
$ for vm in `gcutil --project=[project_name] listinstances | awk '{print $10;}' | sed ':a;N;$!ba;s/\n/ /g'`; do ssh -o UserKnownHostsFile=/dev/null -o CheckHostIP=no -o StrictHostKeyChecking=no -i /home/`whoami`/.ssh/google_compute_engine -A -p 22 `whoami`@$vm "cat /home/`whoami`/.ssh/l-1.id_rsa.pub >> /home/`whoami`/.ssh/authorized_keys" ; done
9. Generate the cluster configuration file
$ echo SUDOUSER=\"`whoami`\" >benchmark.conf; echo DATA_FOLDER=\"cassandra_data\">>benchmark.conf ; for r in `gcutil 2>/dev/null --project=[project_name] listinstances --zone=us-central1-b | awk 'BEGIN {c=0; l=0;} /cas/ { print "CASSANDRA"++c"=\""$10":"$8":/dev/sdb\"";} /l\-[0-9]/ { print "LOAD_GENERATOR"++l"=\""$10"\""; }'`; do echo $r; done >> benchmark.conf
10. Upload all test scripts to l-1
$ tar czf scripts.tgz *
$ gcutil --project=[project_name] push --zone=us-central1-b l-1 scripts.tgz /home/`whoami`
11. ssh to l-1 to setup the cluster
$ gcutil --project=[project_name] ssh --zone=us-central1-b l-1
12. unpack the scripts
$ tar xzf scripts.tgz
13. Run setup_cluster.sh. Please make sure that all nodes are up and running.
$ ./setup_cluster.sh
14. Run tests
$ ./inserts_test.sh
15. Gather results from each loader
E. Deleting the cluster:
1. Delete data nodes
$ gcutil --project=[project_name] deleteinstance --zone=us-central1-b `for i in {1..300}; do echo -n cas-$i " "; done` --force --delete_boot_pd
2. Delete data loaders
$ gcutil --project=[project_name] deleteinstance --zone=us-central1-b `for i in {1..30}; do echo -n l-$i " "; done` --force --delete_boot_pd
3. Delete disks
$ gcutil --project=[project_name] deletedisk --zone=us-central1-b `for i in {1..300}; do echo -n pd1t-$i " "; done` --force
@tzach
Copy link

tzach commented Jul 1, 2014

Thanks for sharing this.
I created a bash script base on the above
https://github.com/tzach/cassandra-benchmark-gce

@pygupta
Copy link

pygupta commented Dec 9, 2014

Line 2: curl -L ... is missing at the end: | tar xz

@khaziwallis
Copy link

great effort.... But can anyone explain me, how to fix the jvm size based on processor size and speed..
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment