https://gitter.im/DataStaxCommunity/Cassandra_Spark
To start communicating with Cassandra like a champ, Cassandra's Query Language (CQL) is your best bet. CQL is an SQL-like language, which makes it warm and fuzzy for those of us used to SQL, but don't let that fool you...
We still have to think about how we "converse" with Cassandra differently. When we model our data, we need to think about how we will access it before we model it. It's important to emphasize this point, because you can get into some very ugly territory if you model Cassandra like a relational database. Data Modeling with Cassandra is purely query-driven.
It's the inverse to a relational system where we model the data and think about access patterns later. CQL can make us think we're talking to a relational system, but we are not. We're talking to Cassandra, and that's a whole different bag of bananas.
What are we doing?
What we are going to be doing in this introductory workshop is focusing on is writing and executing create table statements, importing data into our freshly minted tables and exploring the ufos.txt
dataset with some CQL select queries. We will take a look at DevCenter to help with some of this and you will also reverse engineer a table as well! WOO!
- In your VM, start up your terminal.
ls
to see what we've give you.cd
intoufos_data
andls
again.
You should see some .cql and .txt files. Let's begin.
To create the ufos KEYSPACE, you will copy and paste the below CREATE KEYSPACE statement into cqlsh.
-
To start up CQLSH, simply type
cqlsh
into your shell. Sweet, you should seecqlsh>
in your console now. -
Create the KEYSPACE:
create KEYSPACE ufos WITH replication = {'class': 'SimpleStrategy', 'replication_factor':1}; use ufos ;
Yay you're in your new keyspace!
You can do DESCRIBE KEYSPACES
to see all the Keyspaces and you can do DESCRIBE KEYSPACE ufos
to describe the UFO Keyspace you just created.
- Open a new tab in your shell.
ls
- Do you see the ufo_sightings.cql file?
cat
that sucker. This is how we create a table!- Copy and paste or type the contents of
create_ufo_table.cql
into thecqlsh
tab/separate window! Or you can use the SOURCE command like so:SOURCE 'create_ufo_table.cql'
- Now let's give that table some data using the COPY command:
SOURCE 'copy_ufos.cql'
- Done!
ufos.txt
is all up in our new table!
Similar to the DESCRIBE KEYSPACES command, you can do the DESCRIBE TABLES command and DESCRIBE TABLE ufo_sightings as well.
-
select * from ufo_sightings limit 5;
-
Results:
sighting_id | city | duration | posted | shape | sighting_time | state | summary -------------+-----------+------------+--------------------------+----------+--------------------------+-------+----------------------------------------------------------------------------- 4317 | Davis | 5 minutes? | 2009-01-10 00:00:00+0000 | Changing | 2008-10-25 19:00:30+0000 | CA | light with tail doing serpentine manuever over I-80 in Davis 62693 | Bixby | 5 minutes | 2015-06-05 00:00:00+0000 | Other | 2015-05-23 02:00:30+0000 | OK | Huge craft with 2 green glowing lights, no sound. 51678 | Kelso | 20 sec | 2009-03-19 00:00:00+0000 | Light | 2009-02-11 22:00:30+0000 | WA | A bright white flash of light and a red then white light that vanished. 77328 | San Diego | 60 seconds | 2012-09-24 00:00:00+0000 | Sphere | 2012-09-22 16:00:45+0000 | CA | Black sphere over San Diego airport flying against the wind north to south. 84757 | St. Louis | 1 minute | 2012-11-19 00:00:00+0000 | Triangle | 2012-11-17 18:00:00+0000 | MO | Silent Black Triangle.
-
Our data is there! WIN!
-
Create a new table which will allow you to query by ufo shape. You can run this in CQLSH or you can create a
ufo_sightings_by_shape.cql
file then run the SOURCE command, or by typing/copying and pasting the contents ofcreate_ufo_table_by_shape.cql
in cqlsh. Here's whatufo_sightings_by_shape
should contain:CREATE TABLE ufo_sightings_by_shape (sighting_id int, sighting_time timestamp, city text, state text, shape text, duration text, summary text, posted timestamp, primary key (shape, sighting_id));
-
COPY the ufos.txt data using the
copy_ufos_to_ufos_by_shape.cql
:CREATE KEYSPACE IF NOT EXISTS ufos WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1}; USE ufos; COPY ufo_sightings_by_shape (sighting_id, sighting_time, city, state, shape, duration, summary, posted) FROM 'ufos.txt' WITH delimiter = '\t' and header = true;
-
Results:
shape | sighting_id | city | duration | posted | sighting_time | state | summary -------+-------------+------------------+-------------+--------------------------+--------------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------- Oval | 69579 | Tampa | 5 minutes | 2015-06-05 00:00:00+0000 | 2015-05-31 20:00:35+0000 | FL | A thunderstorm was coming in from the east off MacDill air force base looked up and saw a red and yellow - green round object move down Oval | 69580 | Laukaa (Finland) | ~12 seconds | 2015-05-29 00:00:00+0000 | 2015-05-28 15:00:00+0000 | null | It rose up from a far away beach, hovered a bit and then disappeared. Oval | 69581 | East Wenatchee | 3 minutes | 2015-05-29 00:00:00+0000 | 2015-05-27 23:00:00+0000 | WA | Huge bright orange glowing ship in the sky last night. Oval | 69582 | Sanborn | 5 minutes | 2015-05-29 00:00:00+0000 | 2015-05-26 23:00:30+0000 | IA | 3 reddish oval shaped with white lights inside them being followed by a black suburban. Oval | 69583 | Brentwood | 1 hour | 2015-05-29 00:00:00+0000 | 2015-05-26 22:00:00+0000 | TN | Our neighbors contacted us and told us they saw a very bright light outside. ((NUFORC Note: Possible sighting of Venus? PD))
-
Create the table for
ufo_sightings_by_shape_date
CREATE TABLE ufo_sightings_by_shape_date (sighting_id int, sighting_time timestamp, city text, state text, shape text, duration text, summary text, posted timestamp, primary key (shape, sighting_time, sighting_id));
and use the associated COPY command in copy_ufo_table_by_shape_date.cql
.
CREATE KEYSPACE IF NOT EXISTS ufos
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};
USE ufos;
COPY ufo_sightings_by_shape_date (sighting_id, sighting_time, city, state, shape, duration, summary, posted) FROM 'ufos.txt' WITH delimiter = '\t' and header = true;
-
Run this select statement:
select sighting_time, city from ufo_sightings_by_shape_date where shape = 'Rectangle' limit 10;
-
Your results should be:
sighting_time | city -------------------------+-------------------------------------- 1936-09-15 19:00:00+0000 | Albany (Canada) 1942-07-15 01:00:00+0000 | Oneida 1943-10-15 11:00:00+0000 | Owensboro 1948-06-01 00:00:00+0000 | Napan (Canada) 1954-06-15 00:00:00+0000 | Korean Peninsula 1956-10-01 19:00:00+0000 | Newport 1956-11-27 22:00:00+0000 | Painsville/Willoughby/over Lake Erie 1957-07-07 23:00:00+0000 | Korea 1959-06-28 13:00:00+0000 | Globe 1962-03-15 12:00:00+0000 | Baumholder (Germany)
-
Let's try that same select statement again but with a lower case r for our shape selection: rectangle.
sighting_time | city -------------------------+-------------- 1979-01-31 11:00:00+0000 | Houston 1980-08-12 19:00:00+0000 | Honolulu 1997-08-12 23:00:30+0000 | Fultz 1998-06-12 21:00:00+0000 | West Chicago
Alright! We're getting to know our data and Cassandra! Let's kick this up a notch...
- Let's try a couple select statements with the circle shape but let's try a new tool called DevCenter! It's already pre-installed on your VM. Here is how you set it up:
-
Click on the "DataStax" logo, (that's the x with the nodes around it).
-
Hit "shift+control+alt+N" to start a new connection, or find the plus sign plug logo in the upper left window.
-
When the dialog box pops up, name your connection if you wish and then under contact hosts simply input localhost and hit add and then hit finish! Boom, your connection is established.
-
Specify your connection and keyspace in the upper central window.
-
Now in the upper central window you can start writing statements and seeing their output directly below. To execute a script hit (alt+F11) or the green play button at the top right of the upper central window. DevCenter will also alert you to errors in syntax and other cool, helpful matters as well!
select sighting_time, city from ufo_sightings_by_shape_date where shape = 'Circle' limit 10; sighting_time | city --------------------------+---------------------------------- 1935-06-15 00:00:00+0000 | Crater Lake 1941-07-02 11:00:30+0000 | Forest Home 1944-06-01 19:00:00+0000 | Penuela (Puerto Rico) 1945-06-01 12:00:00+0000 | Germany or Holland (from bomber) 1945-06-06 09:00:00+0000 | Normandy (France) 1947-01-01 17:00:00+0000 | Manama (Bahrain) 1947-06-01 17:00:00+0000 | Espanola (outside of) 1947-06-20 12:00:00+0000 | Hayward 1947-07-14 11:00:00+0000 | Richmond 1947-10-15 21:00:00+0000 | Palmira
So Circle UFOs have been reported since the 30s. Cool!
select sighting_time, city from ufo_sightings_by_shape_date where shape = 'Circle' order by sighting_time DESC limit 10;
sighting_time | city
--------------------------+-------------------------
2015-06-04 22:00:30+0000 | Medford
2015-06-04 00:00:30+0000 | Bethpage
2015-06-03 23:00:50+0000 | Garden City
2015-06-03 20:00:15+0000 | Keller
2015-06-02 21:00:00+0000 | Mountain City
2015-06-02 20:00:00+0000 | Blue Springs
2015-05-30 22:00:00+0000 | Providence
2015-05-30 20:00:30+0000 | Bloomingdale
2015-05-30 03:00:30+0000 | Dauphin
2015-05-29 23:00:00+0000 | Texas (western part of)
Oh snaps! We can order by descending to get our most recent data about circles!
cqlsh:ufos> select sighting_time, city from ufo_sightings_by_shape_date where shape = 'Circle' order by sighting_time ASC limit 10;
sighting_time | city
--------------------------+----------------------------------
1935-06-15 00:00:00+0000 | Crater Lake
1941-07-02 11:00:30+0000 | Forest Home
1944-06-01 19:00:00+0000 | Penuela (Puerto Rico)
1945-06-01 12:00:00+0000 | Germany or Holland (from bomber)
1945-06-06 09:00:00+0000 | Normandy (France)
1947-01-01 17:00:00+0000 | Manama (Bahrain)
1947-06-01 17:00:00+0000 | Espanola (outside of)
1947-06-20 12:00:00+0000 | Hayward
1947-07-14 11:00:00+0000 | Richmond
1947-10-15 21:00:00+0000 | Palmira
Now let's try ASC, oh dang, we get the exact same data back as the select without the order by statement. (Hint: That's because Cassandra's default order is ASC.)
Now it's time for you to get creative! You will be given a result and are challenged to reverse engineer the table and query written to get there. This is what you do to data model with Cassandra, you effectively think of the outcome before you model your data.
The most important thing is to have fun, it makes all the learning stick better.
Write the create table statements and select queries to get the following results:
city | sighting_time | shape
----------+--------------------------+-----------
Portland | 2015-05-15 01:00:18+0000 | Other
Portland | 2015-05-15 01:00:15+0000 | Unknown
Portland | 2015-05-01 13:00:01+0000 | Circle
Portland | 2015-04-30 03:00:00+0000 | Light
Portland | 2015-04-15 21:00:21+0000 | Unknown
Portland | 2015-04-11 21:00:08+0000 | Fireball
Portland | 2015-04-04 10:00:58+0000 | Fireball
Portland | 2015-03-29 21:00:30+0000 | Formation
Portland | 2015-03-26 21:00:00+0000 | Fireball
Portland | 2015-02-15 18:00:45+0000 | Formation
Create:
create table ufo_sightings_by_city_date (sighting_id int, sighting_time timestamp, city text, state text, shape text, duration text, summary text, posted timestamp, primary key (city, sighting_time, sighting_id));
Copy:
COPY ufo_sightings_by_city_date (sighting_id, sighting_time, city, state, shape, duration, summary, posted) FROM 'ufos.txt' WITH delimiter = '\t' and header = true ;
Query:
SELECT sighting_time, shape from ufo_sightings_by_city_date where city = 'Portland' order by sighting_time DESC limit 10 ;
We denormalize our data with Cassandra, but as you have seen in this segment of our workshop, we can get more out of it by how we query. It's all the same data but the way we query it can lead to not only better performance but more insight. In summary, that's what's really cool about our new pal Cassandra!
Some reference notes about Data Modeling with Cassandra to get you started!
The method for Cassandra Data Modeling is called the Chebotko Method and here is how it goes:
-
Conceptual Data Model: The first step! This is Chen/ER Notation that describes entities, relationships, roles, keys and cardinalities. This is the abstract foundation for which you build an implementation-dependent Logical model.
-
Logical Data Model: The second step! This is where the real world comes in, this is a diagram which shows the queries we intend to perform and the tables that get us there. A good rule of thumb is 1 table/query.
-
Physical Data Model: The third and final step! Simply our CQL implementation of the Logical Data Model.
Want to learn more?
Check out our online free training from DataStax here! A lot of the fundamental courses are focused on Apache Cassandra and not our closed source product. Woot!