Skip to content

Instantly share code, notes, and snippets.

@dtrapezoid
Last active November 21, 2016 19:21
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dtrapezoid/42c13b7daba65cee03ef to your computer and use it in GitHub Desktop.
Save dtrapezoid/42c13b7daba65cee03ef to your computer and use it in GitHub Desktop.
Becoming Friends with Cassandra Workshop Instructions

Instructions for "Becoming Friends with Cassandra"

Go here:

https://gitter.im/DataStaxCommunity/Cassandra_Spark

Where do I start?

To start communicating with Cassandra like a champ, Cassandra's Query Language (CQL) is your best bet. CQL is an SQL-like language, which makes it warm and fuzzy for those of us used to SQL, but don't let that fool you...

We still have to think about how we "converse" with Cassandra differently. When we model our data, we need to think about how we will access it before we model it. It's important to emphasize this point, because you can get into some very ugly territory if you model Cassandra like a relational database. Data Modeling with Cassandra is purely query-driven.

It's the inverse to a relational system where we model the data and think about access patterns later. CQL can make us think we're talking to a relational system, but we are not. We're talking to Cassandra, and that's a whole different bag of bananas.

Let's start!

What are we doing?

What we are going to be doing in this introductory workshop is focusing on is writing and executing create table statements, importing data into our freshly minted tables and exploring the ufos.txt dataset with some CQL select queries. We will take a look at DevCenter to help with some of this and you will also reverse engineer a table as well! WOO!

  1. In your VM, start up your terminal.
  2. ls to see what we've give you.
  3. cd into ufos_data and ls again.

You should see some .cql and .txt files. Let's begin.

Create the ufos KEYSPACE

To create the ufos KEYSPACE, you will copy and paste the below CREATE KEYSPACE statement into cqlsh.

  1. To start up CQLSH, simply type cqlsh into your shell. Sweet, you should see cqlsh> in your console now.

  2. Create the KEYSPACE:

         create KEYSPACE ufos WITH replication = {'class': 'SimpleStrategy', 'replication_factor':1}; use ufos ;
    

Yay you're in your new keyspace!

You can do DESCRIBE KEYSPACES to see all the Keyspaces and you can do DESCRIBE KEYSPACE ufos to describe the UFO Keyspace you just created.

Create a Skinny Table:

  1. Open a new tab in your shell.
  2. ls
  3. Do you see the ufo_sightings.cql file?
  4. cat that sucker. This is how we create a table!
  5. Copy and paste or type the contents of create_ufo_table.cql into the cqlsh tab/separate window! Or you can use the SOURCE command like so: SOURCE 'create_ufo_table.cql'
  6. Now let's give that table some data using the COPY command: SOURCE 'copy_ufos.cql'
  7. Done! ufos.txt is all up in our new table!

Similar to the DESCRIBE KEYSPACES command, you can do the DESCRIBE TABLES command and DESCRIBE TABLE ufo_sightings as well.

Select Statement from ufo_sightings:

  1. select * from ufo_sightings limit 5;

  2. Results:

       sighting_id | city      | duration   | posted                 | shape    | sighting_time      | state | summary
       -------------+-----------+------------+--------------------------+----------+--------------------------+-------+-----------------------------------------------------------------------------
       4317 |     Davis | 5 minutes? | 2009-01-10 00:00:00+0000 | Changing | 2008-10-25 19:00:30+0000 |    CA |                light with tail doing serpentine manuever over I-80 in Davis
      62693 |     Bixby |  5 minutes | 2015-06-05 00:00:00+0000 |    Other | 2015-05-23 02:00:30+0000 |    OK |                           Huge craft with 2 green glowing lights, no sound.
      51678 |     Kelso |     20 sec | 2009-03-19 00:00:00+0000 |    Light | 2009-02-11 22:00:30+0000 |    WA |     A bright white flash of light and a red then white light that vanished.
      77328 | San Diego | 60 seconds | 2012-09-24 00:00:00+0000 |   Sphere | 2012-09-22 16:00:45+0000 |    CA | Black sphere over San Diego airport flying against the wind north to south.
      84757 | St. Louis |   1 minute | 2012-11-19 00:00:00+0000 | Triangle | 2012-11-17 18:00:00+0000 |    MO |                                                      Silent Black Triangle.
    
  3. Our data is there! WIN!

One more skinny partition:

  1. Create a new table which will allow you to query by ufo shape. You can run this in CQLSH or you can create a ufo_sightings_by_shape.cql file then run the SOURCE command, or by typing/copying and pasting the contents of create_ufo_table_by_shape.cql in cqlsh. Here's what ufo_sightings_by_shape should contain:

     CREATE TABLE ufo_sightings_by_shape (sighting_id int, sighting_time timestamp, city text, state text, shape text, duration text, summary text, posted timestamp, primary key (shape, sighting_id));
    
  2. COPY the ufos.txt data using the copy_ufos_to_ufos_by_shape.cql:

     CREATE KEYSPACE IF NOT EXISTS ufos
     WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};
    
     USE ufos;
    
     COPY ufo_sightings_by_shape (sighting_id, sighting_time, city, state, shape, duration, summary, posted) FROM 'ufos.txt' WITH delimiter = '\t' and header = true;
    
  3. Results:

     shape | sighting_id | city             | duration    | posted                   | sighting_time            | state | summary
     -------+-------------+------------------+-------------+--------------------------+--------------------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------
     Oval |       69579 |            Tampa |   5 minutes | 2015-06-05 00:00:00+0000 | 2015-05-31 20:00:35+0000 |    FL | A thunderstorm was coming in from the east off MacDill air force base looked up and saw a red and yellow - green round object move down
     Oval |       69580 | Laukaa (Finland) | ~12 seconds | 2015-05-29 00:00:00+0000 | 2015-05-28 15:00:00+0000 |  null |                                                                   It rose up from a far away beach, hovered a bit and then disappeared.
     Oval |       69581 |   East Wenatchee |   3 minutes | 2015-05-29 00:00:00+0000 | 2015-05-27 23:00:00+0000 |    WA |                                                                                  Huge bright orange glowing ship in the sky last night.
     Oval |       69582 |          Sanborn |   5 minutes | 2015-05-29 00:00:00+0000 | 2015-05-26 23:00:30+0000 |    IA |                                                 3 reddish oval shaped with white lights inside them being followed by a black suburban.
     Oval |       69583 |        Brentwood |      1 hour | 2015-05-29 00:00:00+0000 | 2015-05-26 22:00:00+0000 |    TN |            Our neighbors contacted us and told us they saw a very bright light outside. ((NUFORC Note: Possible sighting of Venus? PD))
    

Our first wide partition:

  1. Create the table for ufo_sightings_by_shape_date

     CREATE TABLE ufo_sightings_by_shape_date (sighting_id int, sighting_time timestamp, city text, state text, shape text, duration text, summary text, posted timestamp, primary key (shape, sighting_time, sighting_id));
    

and use the associated COPY command in copy_ufo_table_by_shape_date.cql.

    CREATE KEYSPACE IF NOT EXISTS ufos
    WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};

    USE ufos;

    COPY ufo_sightings_by_shape_date (sighting_id, sighting_time, city, state, shape, duration, summary, posted) FROM 'ufos.txt' WITH delimiter = '\t' and header = true;
  1. Run this select statement:

         select sighting_time, city from  ufo_sightings_by_shape_date where shape = 'Rectangle' limit 10;
    
  2. Your results should be:

       sighting_time            | city
       -------------------------+--------------------------------------
       1936-09-15 19:00:00+0000 |                      Albany (Canada)
       1942-07-15 01:00:00+0000 |                               Oneida
       1943-10-15 11:00:00+0000 |                            Owensboro
       1948-06-01 00:00:00+0000 |                       Napan (Canada)
       1954-06-15 00:00:00+0000 |                     Korean Peninsula
       1956-10-01 19:00:00+0000 |                              Newport
       1956-11-27 22:00:00+0000 | Painsville/Willoughby/over Lake Erie
       1957-07-07 23:00:00+0000 |                                Korea
       1959-06-28 13:00:00+0000 |                                Globe
       1962-03-15 12:00:00+0000 |                 Baumholder (Germany)
    
  3. Let's try that same select statement again but with a lower case r for our shape selection: rectangle.

     sighting_time            | city
     -------------------------+--------------
     1979-01-31 11:00:00+0000 |      Houston
     1980-08-12 19:00:00+0000 |     Honolulu
     1997-08-12 23:00:30+0000 |        Fultz
     1998-06-12 21:00:00+0000 | West Chicago
    

Alright! We're getting to know our data and Cassandra! Let's kick this up a notch...

  1. Let's try a couple select statements with the circle shape but let's try a new tool called DevCenter! It's already pre-installed on your VM. Here is how you set it up:
  • Click on the "DataStax" logo, (that's the x with the nodes around it).

  • Hit "shift+control+alt+N" to start a new connection, or find the plus sign plug logo in the upper left window.

  • When the dialog box pops up, name your connection if you wish and then under contact hosts simply input localhost and hit add and then hit finish! Boom, your connection is established.

  • Specify your connection and keyspace in the upper central window.

  • Now in the upper central window you can start writing statements and seeing their output directly below. To execute a script hit (alt+F11) or the green play button at the top right of the upper central window. DevCenter will also alert you to errors in syntax and other cool, helpful matters as well!

      select sighting_time, city from  ufo_sightings_by_shape_date where shape = 'Circle' limit 10;
    
      sighting_time            | city
      --------------------------+----------------------------------
      1935-06-15 00:00:00+0000 |                      Crater Lake
      1941-07-02 11:00:30+0000 |                      Forest Home
      1944-06-01 19:00:00+0000 |            Penuela (Puerto Rico)
      1945-06-01 12:00:00+0000 | Germany or Holland (from bomber)
      1945-06-06 09:00:00+0000 |                Normandy (France)
      1947-01-01 17:00:00+0000 |                 Manama (Bahrain)
      1947-06-01 17:00:00+0000 |            Espanola (outside of)
      1947-06-20 12:00:00+0000 |                          Hayward
      1947-07-14 11:00:00+0000 |                         Richmond
      1947-10-15 21:00:00+0000 |                          Palmira
    

So Circle UFOs have been reported since the 30s. Cool!

    select sighting_time, city from  ufo_sightings_by_shape_date where shape = 'Circle' order by sighting_time DESC limit 10;

    sighting_time            | city
    --------------------------+-------------------------
    2015-06-04 22:00:30+0000 |                 Medford
    2015-06-04 00:00:30+0000 |                Bethpage
    2015-06-03 23:00:50+0000 |             Garden City
    2015-06-03 20:00:15+0000 |                  Keller
    2015-06-02 21:00:00+0000 |           Mountain City
    2015-06-02 20:00:00+0000 |            Blue Springs
    2015-05-30 22:00:00+0000 |              Providence
    2015-05-30 20:00:30+0000 |            Bloomingdale
    2015-05-30 03:00:30+0000 |                 Dauphin
    2015-05-29 23:00:00+0000 | Texas (western part of)

Oh snaps! We can order by descending to get our most recent data about circles!

    cqlsh:ufos> select sighting_time, city from  ufo_sightings_by_shape_date where shape = 'Circle' order by sighting_time ASC limit 10;

    sighting_time            | city
    --------------------------+----------------------------------
    1935-06-15 00:00:00+0000 |                      Crater Lake
    1941-07-02 11:00:30+0000 |                      Forest Home
    1944-06-01 19:00:00+0000 |            Penuela (Puerto Rico)
    1945-06-01 12:00:00+0000 | Germany or Holland (from bomber)
    1945-06-06 09:00:00+0000 |                Normandy (France)
    1947-01-01 17:00:00+0000 |                 Manama (Bahrain)
    1947-06-01 17:00:00+0000 |            Espanola (outside of)
    1947-06-20 12:00:00+0000 |                          Hayward
    1947-07-14 11:00:00+0000 |                         Richmond
    1947-10-15 21:00:00+0000 |                          Palmira

Now let's try ASC, oh dang, we get the exact same data back as the select without the order by statement. (Hint: That's because Cassandra's default order is ASC.)

Your Time to Shine:

Now it's time for you to get creative! You will be given a result and are challenged to reverse engineer the table and query written to get there. This is what you do to data model with Cassandra, you effectively think of the outcome before you model your data.

The most important thing is to have fun, it makes all the learning stick better.

THE "How do we get here?" EXERCISE:

Write the create table statements and select queries to get the following results:

      city     | sighting_time            | shape
      ----------+--------------------------+-----------
      Portland | 2015-05-15 01:00:18+0000 |     Other
      Portland | 2015-05-15 01:00:15+0000 |   Unknown
      Portland | 2015-05-01 13:00:01+0000 |    Circle
      Portland | 2015-04-30 03:00:00+0000 |     Light
      Portland | 2015-04-15 21:00:21+0000 |   Unknown
      Portland | 2015-04-11 21:00:08+0000 |  Fireball
      Portland | 2015-04-04 10:00:58+0000 |  Fireball
      Portland | 2015-03-29 21:00:30+0000 | Formation
      Portland | 2015-03-26 21:00:00+0000 |  Fireball
      Portland | 2015-02-15 18:00:45+0000 | Formation

EXERCISE SOLUTIONS:

Create:

create table ufo_sightings_by_city_date (sighting_id int, sighting_time timestamp, city text, state text, shape text, duration text, summary text, posted timestamp, primary key (city, sighting_time, sighting_id));

Copy:

COPY ufo_sightings_by_city_date (sighting_id, sighting_time, city, state, shape, duration, summary, posted) FROM 'ufos.txt' WITH delimiter = '\t' and header = true ;

Query:

SELECT sighting_time, shape from ufo_sightings_by_city_date where city = 'Portland' order by sighting_time DESC limit 10 ;

Some Reflections:

We denormalize our data with Cassandra, but as you have seen in this segment of our workshop, we can get more out of it by how we query. It's all the same data but the way we query it can lead to not only better performance but more insight. In summary, that's what's really cool about our new pal Cassandra!

To dig a little deeper:

Some reference notes about Data Modeling with Cassandra to get you started!

The method for Cassandra Data Modeling is called the Chebotko Method and here is how it goes:

  • Conceptual Data Model: The first step! This is Chen/ER Notation that describes entities, relationships, roles, keys and cardinalities. This is the abstract foundation for which you build an implementation-dependent Logical model.

    Conceptual Data Model Example

  • Logical Data Model: The second step! This is where the real world comes in, this is a diagram which shows the queries we intend to perform and the tables that get us there. A good rule of thumb is 1 table/query.

    Logical Data Model Example

  • Physical Data Model: The third and final step! Simply our CQL implementation of the Logical Data Model.

    Physical Data Model Example

Want to learn more?

Check out our online free training from DataStax here! A lot of the fundamental courses are focused on Apache Cassandra and not our closed source product. Woot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment