You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CQLSH first steps: Creating keyspace, table and importing data using the copy commmand.
Download the tallest_buildings.csv to a scylla node. https://gist.github.com/gnumoreno/c24a3cc0d6633f1d38103be5f3511767
Create a keyspace "architecture"
CREATE KEYSPACE architecture
WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': '1'}
AND durable_writes = true;
Create a table "buildings" with columns: building, city, country, height_meters, height_feet, floors, year. Make the column "building" your primary key.
CREATE TABLE architecture.buildings (
building text,
city text,
country text,
height_meters smallint,
height_feet smallint,
floors tinyint,
year smallint,
PRIMARY KEY (building));
Using the COPY command, import the csv data into the table you created.
COPY architecture.buildings
FROM 'tallest_buildings.csv'
WITH HEADER = TRUE;
From the output of the COPY command, what went wrong?
CORRECTING the TABLE SCHEMA 1: TRUNCATING, ALTER, COPY with columns, COUNT
Delete all the data in the "buildings" table
TRUNCATE architecture.buildings ;
Using alter table, correct all the problems you saw in the previous exercise in you data modeling. Re-import the data again.
ALTER TABLE architecture.buildings
DROP height_meters;
ALTER TABLE architecture.buildings
DROP floors ;
ALTER TABLE architecture.buildings
ADD height_meters decimal;
ALTER TABLE architecture.buildings
ADD floors smallint;
COPY architecture.buildings (
building,
city,
country,
height_meters,
height_feet,
floors,
year
)
FROM 'tallest_buildings.csv'
WITH HEADER = TRUE;
SELECT count(*) from architecture.buildings;
Why there is a difference between the number of entries in the csv file and the number of entries in the table?
CORRECTING the TABLE SCHEMA 2: ADD CLUSTERING, COPY, COUNT, SELECT IN
To resolve the problem from the previous exercise, remove the table from the schema, create the table again with the proper data modeling and re-import the data. Make sure that this time we have the desired result!
DROP TABLE architecture.buildings ;
CREATE TABLE architecture.buildings (
building text,
city text,
country text,
height_meters decimal,
height_feet smallint,
floors smallint,
year smallint,
PRIMARY KEY (building,city));
COPY architecture.buildings (
building,
city,
country,
height_meters,
height_feet,
floors,
year
)
FROM 'tallest_buildings.csv'
WITH HEADER = TRUE;
SELECT count(*) from architecture.buildings;
Using the IN clause, select multiple buildings from the table.
SELECT * from architecture.buildings
WHERE building IN
(
'Chase Tower',
'Aon Center',
'Bank of America Plaza',
'Trump International Hotel and Tower'
);
Create a MV with the partitioning key being column "city", the clustering key being the "building" and with all columns being regular columns.
USE architecture;
CREATE MATERIALIZED VIEW building_by_city AS
SELECT * FROM buildings
WHERE building IS NOT NULL AND city IS NOT NULL
PRIMARY KEY(city, building);
SELECT * FROM building_by_city LIMIT 10;
Try to create a MV using the column "year" as the partitioning key and "building" as the clustering key, Include all other columns as regular columns.
CREATE MATERIALIZED VIEW building_by_year AS
SELECT * FROM buildings
WHERE building IS NOT NULL
AND year IS NOT NULL
PRIMARY KEY(year, building);
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot create Materialized View building_by_year without primary key columns from base buildings (city)"
Why do we get the above error?
Correct way to create a MV
CREATE MATERIALIZED VIEW building_by_year AS
SELECT * FROM buildings
WHERE building IS NOT NULL AND city IS NOT NULL
AND year IS NOT NULL
PRIMARY KEY(year, city, building);
SELECT * FROM building_by_year
WHERE year = 2012;
Why the above command works in comparison to the previous one?
Try to create a MV with the partition key being the columns: "year, height_meters, building, city) and all other columns as regular columns.
CREATE MATERIALIZED VIEW building_by_year_height_meters AS
SELECT * FROM buildings
WHERE building IS NOT NULL AND city is NOT NULL
AND year IS NOT NULL AND height_meters IS NOT NULL
PRIMARY KEY(year, height_meters, building, city);
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot include more than one non-primary key column 'height_meters' in materialized view primary key"
Let's create MVs with different sorting orders using the column "country" as the partitioning key.
CREATE MATERIALIZED VIEW building_by_country AS
SELECT * FROM buildings
WHERE building IS NOT NULL AND city IS NOT NULL
AND country IS NOT NULL
PRIMARY KEY(country, city, building)
WITH CLUSTERING ORDER BY (city DESC, building ASC);
CREATE MATERIALIZED VIEW building_by_country_reverse AS
SELECT * FROM buildings
WHERE building IS NOT NULL AND city IS NOT NULL
AND country IS NOT NULL
PRIMARY KEY(country, city, building)
WITH CLUSTERING ORDER BY (city ASC, building DESC);
Compare the MVs you created.
SELECT * FROM building_by_country
WHERE country = 'USA'
AND city IN ('Chicago', 'New York City');
SELECT * FROM building_by_country_reverse
WHERE country = 'USA'
AND city IN ('Chicago', 'New York City');
But wait! This is a full table scan. I'm glad we have very little data. We should always restrict our queries using partition keys.
SELECT * FROM menus where location = 'Warsaw';
If we had a lot of data this would perform very well.
But what if I wanted to query by other fields? Let's try.
SELECT * from menus where dish_type = 'Polish soup';
ooops!
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
SELECT * from menus where location = 'Warsaw' and dish_type = 'Polish soup';
ooops! [2]
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
If we add "ALLOW FILTERING" to the above queries they would work. But just like our first query, because we are querying regular columns it would be a full table scan - VERY INEFFICIENT!
Indexes to the rescue!
Let's take a look at our current schema and sstables. It will help us compare and contrast and further our understanding.
DESC KEYSPACE restaurant_chain;
exit
nodetool flush
cd /var/lib/scylla/data/restaurant_chain/
ls -l */*
GLOBAL INDEX - On terminal #2
docker exec -ti scylla-si /bin/bash
cqlsh
USE restaurant_chain;
CREATE INDEX ON menus(dish_type);
SELECT * from menus where dish_type = 'Polish soup';
DESC KEYSPACE restaurant_chain;
exit
nodetool flush
cd /var/lib/scylla/data/restaurant_chain/
ls -l */*
LOCAL INDEX - On terminal #3
docker exec -ti scylla-si /bin/bash
cqlsh
USE restaurant_chain;
CREATE INDEX ON menus((location),dish_type);
SELECT * from menus where location = 'Warsaw' and dish_type = 'Polish soup';
DESC KEYSPACE restaurant_chain;
exit
nodetool flush
cd /var/lib/scylla/data/restaurant_chain/
ls -l */*
Inspecting sstables for a better understanding
NOTE: The directories and files below are hypothetical. You should get the correct names from the ls -l */* (last command ran in each terminal).
Let's inspect our sstables and see how data is stored on disk.
cqlsh
USE restaurant_chain;
SELECT token(location), location, name, dish_type, price
FROM restaurant_chain.menus;
SELECT token(dish_type), dish_type, blobAsBigint(idx_token), location, name
FROM restaurant_chain.menus_dish_type_idx_index;
SELECT token(location), location, dish_type, name
FROM restaurant_chain.menus_dish_type_idx_1_index;
Compare the token values from the base table and the underlying MVs of each index and ask yourself: Why local indexes are local and why global indexes are global?
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters