Software carpentry instructor notes for this SQL lesson. Have separated the challenges into a separate gist. Note I have found the time times do not work as stated.
- 1. Prerequisites
- 2. Selecting data
- 3. Sorting and Removing Duplicates
- 4. Filtering
- 5. Calculating New Values
- 6. Missing Data
- 7. Aggregation
- 8. Combining Data
- 9. Data Hygiene
- 10. Creating and Modifying Data
- 11. Programming with Databases - Python
- 12. Programming with Databases - R
- SQLite command line shell from the Unix shell.
- SQLite.
- SQLite plugin (Firefox SQLite Manager).
- survey.db (https://github.com/swcarpentry/sql-novice-survey/raw/gh-pages/files/survey.db).
- Teaching: 10 minutes,
- Exercise: 5 minutes.
- Explain the difference between:
- a table,
- a record,
- a field.
- Explain the difference between:
- a database and
- a database manager.
- Write a query to select all values for specific fields from a single table.
- A relational database is a way to store and manipulate information.
- Databases are arranged as tables.
- Each table has columns (also known as fields) that describe the data, and
- rows (also known as records) which contain the data.
- We query the data through a database manager (a Data Base Managemen System or DBMS).
- Have many different DBMS,
- The commands to interact with the DBMS change from DBMS to DBMS the querying is mostly the same.
- databases are stored in proprietary formats but can export/import data.
- Use SQL (Structured Query Language) to query the database.
Open a terminal, identify where the survey.db
file (if no file is used a temporary file will be created) is and:
sqlite3 survey.db
.help -- no space before the command
.quit -- can also use .exit or ^D on some systems.
Go back in:
.tables -- show what tables are available
- Person: people who took readings.
- Site: locations where readings were taken.
- Survey: the actual readings.
- Visited: when readings were taken at specific sites. Make things a bit easier to read:
.mode columns -- will produce left aligned columns
.header on -- will print column headers
-- Can look inside the tables to see what the fields are
.schema Person
-- Can query the database
SELECT family, personal FROM Person; -- Semicolon terminates the statement.
-- Queries are case insensitive.
-- Convention to put SQL in upper case, names as they are.
-- Get all data
SELECT * from Person -- Forget the semicolon on purpose
-- We are declaring how we want the data
SELECT personal, family FROM Person;
SELECT id, id, id FROM Person;
- Write a query that selects only site names from the Site table.
.schema Site
SELECT name FROM Site;
- A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
- A database manager or DBMS is a program that manipulates information stored in a database.
- We write queries in a specialized language called SQL to extract information from databases.
- Use SELECT… FROM… to get values from a database table.
- SQL is case-insensitive (but data is case-sensitive).
- Teaching: 10 min
- Exercises: 10 min
- Write queries that display results in a particular order.
- Write queries that eliminate duplicate values from data.
Want to know:
- what kind of quantity measurements were taken at each site;
- which scientists took measurements on the expedition;
- the sites where each scientist took measurements.
-- Look at the surveys table
.schema Survey
-- Look at the quantities field
SELECT quant FROM Survey; -- lots of redundant data
-- Makes it hard to see the different measurements, use DISTINCT
SELECT DISTINCT quant FROM Survey;
-- Can use on multiple columns
SELECT DISTINCT taken, quant FROM Survey; -- non-adjacent duplicates removed as well
-- Now look at the scientists
.schema Person
SELECT * FROM Person;
-- Impose an order
SELECT * FROM Person ORDER BY id; -- ascending is the default
-- If we want to reverse the order
SELECT * FROM person ORDER BY id DESC;
-- Can order by serveral fields at once
SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;
-- Can see what scientist was at what site and what measurements were taken
- Write a query that selects distinct dates from the
Visited
table.
.schema Visited
SELECT DISTINCT dated FROM Visited;
- Write a query that displays the full names of the scientists in the
Person
table, ordered by family name.
.schema Person
SELECT personal, family FROM Person ORDER BY family;
- The records in a database table are not intrinsically ordered:
- if we want to display them in some order, we must specify that explicitly with
ORDER BY
.
- if we want to display them in some order, we must specify that explicitly with
- The values in a database are not guaranteed to be unique:
- if we want to eliminate duplicates, we must specify that explicitly as well using
DISTINCT
.
- if we want to eliminate duplicates, we must specify that explicitly as well using
- Teaching: 10 minutes,
- Exercise: 5 minutes.
- Write queries that select records that satisfy user-specified conditions.
- Explain the order in which the clauses in a query are executed.
Want to be able to filter data, e.g. see data that only corresponds to a given site. We use the WHERE
clause.
SELECT * FROM Visited WHERE site='DR-1';
The DBMS:
- checks each row in the Visited table that satisfies the
WHERE
. - It then uses the column names following the
SELECT
keyword to determine which columns to display.
Can use BOOLEANS to do further filtering:
.schema Visited
SELECT * FROM Visited WHERE site='DR-1' AND dated<'1930-01-01';
Note: Most database managers have a special data type for dates. In fact, many have two:
- one for dates, such as
May 31, 1971
, - one for durations, such as
31 days
.
SQLite doesn’t: instead, it stores dates as either:
- text (in the ISO-8601 standard format
YYYY-MM-DD [HH:MM:SS.SSSS]
), - real numbers (the number of days since November 24, 4714 BCE), or
- integers (the number of seconds since midnight, January 1, 1970).
To find measurements taken by either Lake or Roerich, combine the tests on their names using OR
-- Query
SELECT * FROM Survey WHERE person='lake' OR person='roe';
-- Note that although SQL is case insensitive the data is not
SELECT * FROM Survey WHERE person='Lake' OR person='Roe';
-- Alternatively can use the IN clause
SELECT * FROM Survey WHERE person IN ('lake', 'roe');
-- What does this give?
-- Salinity measurements by lake and measurement by roe
SELECT * FROM Survey WHERE quant='sal' AND person='lake' OR person='roe';
-- Be explicit
SELECT * FROM Survey WHERE quant='sal' AND (person='lake' OR person='roe');
-- Can do wildcard matches
SELECT * FROM Visited WHERE site LIKE 'DR%';
-- Can use DISTINCT with WHERE to give a second level of filtering
SELECT DISTINCT person, quant FROM Survey WHERE person='lake' OR person='roe';
DISTINCT
is applied to the values displayed in the chosen columns, not to the entire rows as they are being processed.
Queries are usually grown. Start with something simple and gradually make it more complex.
- Fix the following query - suppose we want to select all sites that lie more than 48 degrees from the poles. Our first query is:
SELECT * FROM Site WHERE (lat > -48) OR (lat < 48);
Explain why this is wrong, and rewrite the query so that it is correct.
SELECT * FROM Site WHERE (lat > -48) AND (lat < 48);
- Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from
Survey
with salinity values outside this range.
SELECT * from Survey WHERE quant="sal" AND ( reading < 0 OR reading > 1);
- Which of the following are true:
- 'a' LIKE 'a' (T)
- 'a' LIKE '%a' (T)
- 'beta' LIKE '%a' (T)
- 'alpha' LIKE 'a%%' (T)
- 'alpha' LIKE 'a%p%' (T)
- Use
WHERE
to specify conditions that records must meet in order to be included in a query’s results. - Use
AND
,OR
, andNOT
to combine tests. - Filtering is done on whole records, so conditions can use fields that are not actually displayed.
- Write queries incrementally.
- Teaching: 5 minutes,
- Exercise: 5 minutes.
- Write queries that calculate new values for each selected record.
--- Modifying readings on the fly
SELECT 1.05 * reading FROM Survey WHERE quant='rad';
-- Converting Temperatures from F to C
SELECT taken, round(5*(reading-32)/9, 2) FROM Survey WHERE quant='temp';
-- Renaming the column
SELECT taken, round(5*(reading-32)/9, 2) as Celsius FROM Survey WHERE quant='temp';
-- Concatenating the entire name
SELECT personal || ' ' || family FROM Person;
- Write a query that returns all of salinity measurements from the
Survey
table with the values divided by 100.
SELECT quant, reading/100 AS ModifiedReading FROM Survey WHERE quant="sal";
- The
UNION
operator combines the results of two queries:
SELECT * FROM Person WHERE id='dyer' UNION SELECT * FROM Person WHERE id='roe';
Use UNION
to create a consolidated list of salinity measurements in which Roerich’s, and only Roerich’s, have been corrected as described in the previous challenge. The output should be something like:
taken reading
619 0.13
622 0.09
734 0.05
751 0.1
752 0.09
752 0.416
837 0.21
837 0.225
SELECT taken, reading / 100 FROM Survey WHERE quant = 'sal' AND person = 'roe' UNION
SELECT taken, reading FROM Survey WHERE quant = 'sal' AND person != 'roe';
- The site identifiers in the Visited table have two parts separated by a
-
, i.e.:
SELECT DISTINCT site FROM Visited;
Some site identifiers are two letters long and some are three.
- The “in string” function
instr(X, Y)
returns:- the 1-based index of the first occurrence of string Y in string X, or
- 0 if Y does not exist in X.
- The substring function
substr(X, I, [L])
returns:- the substring of X starting at index I, with an optional length L.
Use these two functions to produce a list of unique major site identifiers. (For this data, the list should contain only “DR” and “MSK”).
-- Build up the query
SELECT site, instr(site,"-") FROM Visited;
SELECT site, substr(site,1,instr(site,"-")-1) FROM Visited;
SELECT site, substr(site,1,instr(site,"-")-1) AS NewSiteName FROM Visited;
- Queries can do the usual arithmetic operations on values.
- Use
UNION
to combine the results of two or more queries.
- Teaching: 15 minutes,
- Exercise: 15 minutes.
- Explain how databases represent missing information.
- Explain the three-valued logic databases use when manipulating missing information.
- Write queries that handle missing information correctly.
Data is more often incomplete - there are missing values. These are represented by null
.
-- Look at the data again.
SELECT * FROM Visited; -- 752 appears to have a missing value
-- null does not behave like other values, selecting values before 1930
SELECT * FROM Visited WHERE dated<'1930-01-01';
-- selecting after 1930
SELECT * FROM Visited WHERE dated>='1930-01-01'; -- 752 is nowhere
-- Operation on null is null
SELECT null<'1930-01-01';
SELECT 1+null; -- etc
-- So, the following produces no output
SELECT * FROM Visited WHERE dated=NULL;
-- or
SELECT * FROM Visited WHERE dated!=NULL;
-- Need to use IS NULL
SELECT * FROM Visited WHERE dated IS NULL;
-- The inverse
SELECT * FROM Visited WHERE dated IS NOT NULL;
-- If we wanted to take all salinity measurements not taken by lake
SELECT * FROM Survey WHERE quant='sal' AND person!='lake'; -- But this omits null values
-- Need to add an explicit check
SELECT * FROM Survey WHERE quant='sal' AND (person!='lake' OR person IS NULL);
Need to decide what to do with NULL
values.
Aggregation functions that combine multiple values, such as min
, max
or avg
, ignore null
values.
In the majority of cases, this is a desirable output
- Write a query that sorts the records in
Visited
by date, omitting entries for which the date is not known (i.e., is null).
-- Build the query: look at the data
SELECT * FROM Visited;
-- Ignore the NULL values
SELECT * FROM Visited WHERE dated is NOT NULL;
-- Order the data
SELECT * FROM Visited WHERE dated is NOT NULL ORDER BY dated;
- What do you expect the query:
SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL);
to produce? What does it actually produce?
- Some database designers prefer to use a [sentinel value] to mark missing data rather than null.
For example, they will use the date “0000-00-00” to mark a missing date, or -1.0 to mark a missing salinity or radiation reading (since actual readings cannot be negative). What does this simplify? What burdens or risks does it introduce?
- Databases use a special value called
NULL
to represent missing information. - Almost all operations on
NULL
produceNULL
. - Queries can test for
NULL
s usingIS NULL
andIS NOT NULL
.
- Teaching: 5 minutes,
- Exercise: 5 minutes.
- Define aggregation and give examples of its use.
- Write queries that compute aggregated values.
- Trace the execution of a query that performs aggregation.
- Explain how missing data is handled during aggregation.
-- Want to calculate ranges and averages for our data
SELECT dated FROM Visited;
-- Take a range of records and produce a single output
SELECT min(dated) FROM Visited;SELECT max(dated) FROM Visited;
-- Three others are avg, count, and sum:
SELECT avg(reading) FROM Survey WHERE quant='sal';
SELECT count(reading) FROM Survey WHERE quant='sal';
SELECT sum(reading) FROM Survey WHERE quant='sal';
-- Can perform multiple aggregations at the same time
SELECT min(reading), max(reading) FROM Survey WHERE quant='sal' AND reading<=1.0;
-- Combine raw and aggregated results
-- aggregated the field but you have not it how, DBMS picks
SELECT person, count(*) FROM Survey WHERE quant='sal' AND reading<=1.0;
-- When there are no values to aggregate you get null values back
SELECT person, max(reading), sum(reading) FROM Survey WHERE quant='missing';
-- Aggregation functions ignore NULL values so you can do:
SELECT min(dated) FROM Visited;
-- and do not have to do:
SELECT min(dated) FROM Visited WHERE dated IS NOT NULL;
-- We know the following will not work
SELECT person, count(reading), round(avg(reading), 2)
FROM Survey
WHERE quant='rad'; -- DBMS picks what (person) will be used
-- Could do it one person at a time
SELECT person, count(reading), round(avg(reading), 2)
FROM Survey
WHERE quant='rad'
AND person='dyer'; -- Change the person as we go along, tedious
-- To aggregate we use the GROUP BY clause
SELECT person, count(reading), round(avg(reading), 2)
FROM Survey
WHERE quant='rad'
GROUP BY person;
-- Can GROUP BY more than one quantity, looking at the number of
-- types of measurement made by each person and their average.
SELECT person, quant, count(reading), round(avg(reading), 2)
FROM Survey
GROUP BY person, quant;
-- Remove NULLs and use ORDER BY
SELECT person, quant, count(reading), round(avg(reading), 2)
FROM Survey
WHERE person IS NOT NULL
GROUP BY person, quant
ORDER BY person, quant;
- How many temperature readings did Frank Pabodie record, and what was their average value?
SELECT * FROM Survey;
-- Get the acronym from the Person table
SELECT * FROM Survey WHERE quant="temp" AND person="pb";
SELECT COUNT(*), AVG(reading) FROM Survey WHERE quant="temp" AND person="pb" GROUP BY person;
SELECT COUNT(*) AS Measurements, AVG(reading) AS Average FROM Survey
WHERE quant="temp" AND person="pb" GROUP BY person;
- The average of a set of values is the sum of the values divided by the number of values.
Does this mean that the avg function returns 2.0 or 3.0 when given the values 1.0, null, and 5.0?
3
-- If you don't believe this:
CREATE TABLE avg_tests(val);
INSERT INTO avg_tests (val)
VALUES
(1),
(5),
(NULL);
SELECT AVG(VAL) FROM avg_tests;
DROP TABLE avg_tests;
- Want to calculate the difference between each individual radiation reading and the average of all the radiation readings. We write the query:
SELECT reading - avg(reading) FROM Survey WHERE quant='rad';
What does this actually produce, and why?
Mixing aggregate and non-aggregate values.
-- Could do in 2 steps
SELECT avg(reading) FROM Survey WHERE quant='rad'; -- 6.5625
-- Insert the resulting value directly
SELECT reading - 6.5625 AS Deviant FROM Survey WHERE quant='rad';
Deviant
----------
3.2575
1.2375
1.8475
0.6575
-2.2125
-4.3725
-5.1025
4.6875
-- Or in one step (using a nested query)
SELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') AS Deviant
FROM Survey WHERE quant='rad';
Deviant
3.2575
1.2375
1.8475
0.6575
-2.2125
-4.3725
-5.1025
4.6875
- The function
group_concat(field, separator)
concatenates all the values in a field using the specified separator character (or ‘,’ if the separator isn’t specified). Use this to produce a one-line list of scientists’ names, such as:
William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth
Can you find a way to order the list by surname?
SELECT GROUP_CONCAT(personal||" "||family,", ") AS Names FROM Person;
SELECT GROUP_CONCAT(personal||" "||family,", ") AS Names FROM (SELECT personal, family
FROM Person ORDER BY family);
- Use aggregation functions to combine multiple values.
- Aggregation functions ignore NULL values.
- Aggregation happens after filtering.
- Use
GROUP BY
to combine subsets separately. - If no aggregation function is specified for a field, the query may return an arbitrary value for that field.
- Teaching: 20 minutes,
- Exercise: 20 minutes.
- Explain the operation of a query that joins two tables.
- Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.
- Write queries that join tables on equal keys.
- Explain what primary and foreign keys are, and why they are useful.
Sometimes we want to JOIN
data from one or more tables.
-- Look at the schema
.schema Site
.schema Visited
-- Look at some of the data
SELECT * FROM Site LIMIT 5;
SELECT * FROM Visited LIMIT 5; -- Can join on Site.name to Visited.site
-- Do a JOIN
SELECT * FROM Site JOIN Visited;
JOIN creates the cross product of two tables - gives all possible combinations (3*8) with 3
columns contributed from each table. Have not told it what to JOIN
on though.
-- Only want combinations from the same site
SELECT * FROM Site JOIN Visited ON Site.name=Visited.site;
-- We can be specific about what is JOINED
SELECT Site.lat, Site.long, Visited.dated
FROM Site JOIN Visited
ON Site.name=Visited.site;
-- Can JOIN on more than one table
.schema Visited
.schema Survey
SELECT * FROM Visited LIMIT 5;
SELECT * FROM Survey LIMIT 5; -- Can join on Visited.id and Survey.taken
-- Joining across 3 tables
SELECT Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
FROM Site JOIN Visited JOIN Survey
ON Site.name=Visited.site
AND Visited.id=Survey.taken
AND Visited.dated IS NOT NULL;
Have:
- primary keys - a value, or combination of values, that uniquely identifies each record in a table
- foreign keys - a value (or combination of values) from one table that identifies a unique record in another table Another way of saying this is that a foreign key is the primary key of one table that appears in some other table.
SQLite automatically numbers records:
.schema Person
-- but
SELECT rowid, * FROM Person;
- Write a query that lists all radiation readings from the DR-1 site.
SELECT Survey.reading FROM Survey JOIN Visited ON Survey.taken=Visited.id WHERE Visited.site="DR-1";
- Write a query that lists all sites visited by people named “Frank”.
SELECT Survey.taken FROM Survey JOIN Person ON Survey.person=Person.id WHERE Person.personal="Frank";
SELECT Visited.Site, Person.personal, Person.family FROM Visited JOIN Survey JOIN Person ON
Survey.person=Person.id WHERE Person.personal="Frank" AND Visited.id=Survey.taken;
- Describe in your own words what the following query produces:
SELECT Site.name FROM Site JOIN Visited
ON Site.lat<-49.0 AND Site.name=Visited.site AND Visited.dated>='1932-01-01';
The Site visited with a latitude less than -49 degrees later than 1932.
- Write a query that shows:
- each site with exact location (lat, long) ordered by visited date,
- followed by personal name and family name of the person who visited the site and
- The type of measurement taken and its reading.
Please avoid all null values.
Tip: you should get 15 records with 8 fields.
SELECT si.name, si.lat, si.long, vi.dated, pe.personal, pe.family,
su.quant, su.reading
FROM
Site AS si JOIN Visited AS vi JOIN Survey AS su JOIN Person as pe ON
si.name=vi.site AND vi.dated IS NOT NULL AND su.person IS NOT NULL AND
vi.id=su.taken AND su.person IS NOT NULL AND su.person=pe.id
ORDER BY vi.dated;
- Use
JOIN
to combine data from two tables. - Use
table.field
notation to refer to fields when doing joins. - Every fact should be represented in a database exactly once.
- A join produces all combinations of records from one table with records from another.
- A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
- A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
- We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
- The most common join condition is matching keys.
- Teaching: 15 minutes,
- Exercise: 15 minutes.
- Explain what an atomic value is.
- Distinguish between atomic and non-atomic values.
- Explain why every value in a database should be atomic.
- Explain what a primary key is and why every record should have one.
- Identify primary keys in database tables.
- Explain why database entries should not contain redundant information.
- Identify redundant information in databases.
Some rules:
- Every value should be atomic, i.e., not contain parts that we might want to work with separately - e.g. personal and family names are in separate columns instead of putting all in one column so that we don’t have to use substring operations to get the name’s components.
- Every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the id field in the Person table), or even a combination of values: the triple (taken, person, quant) from the Survey table uniquely identifies every measurement.
- There should be no redundant information. Want to minimise the number of places where we have to make changes, i.e. do not replicate the same data.
- The units for every value should be stored explicitly. The database we are using this does not do this - we do not know if everyone is using the same units.
-
Which of the following are atomic values? Which are not? Why?
- New Zealand
- 87 Turing Avenue
- January 25, 1971
- the XY coordinate (0.5, 3.3)
-
What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?
latitude | longitude | date | temperature |
---|---|---|---|
57.3 | -22.5 | 2015-01-09 | -14.2 |
- Every value in a database should be atomic.
- Every record should have a unique primary key.
- A database should not contain redundant information.
- Units and similar metadata should be stored with the data.
- Teaching: 20 minutes,
- Exercise: 20 minutes.
- Write statements that creates tables.
- Write statements to insert, modify, and delete records.
Look at modifying and creating data. When using .schema
command we get the info on how to create tables:
-- Create a new table
CREATE TABLE Survey2(taken integer, person text, quant real, reading real);
-- Check
.tables
We can remove a table by using (CAREFUL):
-- Delete the table
DROP TABLE Survey2;
-- Check
.tables
When creating tables you need to specify data types (for the type of data the columns are going to hold), most databases will support:
INTEGER
a signed integerREAL
a floating point numberTEXT
a character stringBLOB
a “binary large object”, such as an image
Most databases also support:
Booleans
- SQLite uses 0/1.date/time
- SQLite uses a string in a specific format.
Some support latitude and longitude as intrinsic data types, etc.
Can specify constraints:
CREATE TABLE Survey2(
taken integer not null, -- where reading taken
person text, -- may not know who took it
quant real not null, -- the quantity measured
reading real not null, -- the actual reading
primary key(taken, quant),
foreign key(taken) references Visited(id),
foreign key(person) references Person(id)
);
Some of these constraints and their formulation are DBMS specific.
Once tables have been created, we can:
- add records using
INSERT
- change records using
UPDATE
- remove records using
DELETE
-- Create a temporary table
CREATE TABLE Site2(name text, lat real, long real);
-- Check
.tables
-- Insert some values
-- General: Syntax INSERT INTO TableName(Cols,...) VALUES(x,...)
INSERT INTO Site2 values('DR-1', -49.85, -128.57);
INSERT INTO Site2 values('DR-3', -47.15, -126.72);
INSERT INTO Site2 values('MSK-4', -48.87, -123.40);
-- Check
SELECT * FROM Site2;
-- or you could populate from an existing source
INSERT INTO Site2 SELECT name, lat, long FROM Site;
Modifying existing records uses the UPDATE
command.
-- Check state
SELECT * FROM Site2;
-- Make a correction
UPDATE Site2 SET lat=-47.87, long=-122.40 WHERE name='MSK-4';
-- Check change
SELECT * FROM Site2;
To remove records you use DELETE
.
-- Create a temporary table
CREATE TABLE Person2 (id text, personal text, family text);
-- Populate it
INSERT INTO Person2 SELECT id, personal, family FROM Person;
-- Check
SELECT * FROM Person2;
-- Now
SELECT * FROM Survey; -- no danforth
-- Delete entry
DELETE FROM Person2 WHERE id = 'danforth';
-- Check
SELECT * FROM Person2;
However, if you had removed lake
instead this would have left a whole bunch of dangling pointers. We have
broken the referential integrity (all references between tables can always be resolved correctly) of our database.
There is such a thing as a cascading deletes that removes the associated references (and beyond the scope of this lesson). Can also have hybrid models where some of the data is the database but other data, e.g. files, are located outwith the databases.
- Write an SQL statement to replace all uses of null in Survey.person with the string 'unknown'.
UPDATE Survey SET person="unknown" WHERE person IS NULL;
- One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:
Taken,Temp
619,-21.5
622,-15.5
- Write the SQL
INSERT
statements that would add this data to the surveys database. - Write a small R/bash/Python program that reads this file in and prints out the SQL
INSERT
statements needed to add these records to the survey database.
Note: you will need to add an entry for Olmstead to the Person
table. If you are testing your program repeatedly, you may want to investigate SQL’s INSERT
or REPLACE
command.
So you want output:
-- Need a new person
INSERT INTO Person VALUES("olmstead","Robert","Olmstead");
-- Insert The values into the Survey table
INSERT INTO Survey(taken,person,quant,reading) VALUES(619,"olmstad","temp",-21.5);
-- or
INSERT INTO Survey VALUES(622,"olmstad","temp",-15.5);
bash script
Call this samp.sh
:
# Thanks to question 4286469 on stackoverflow
read # This skips the frist line 31911179
while IFS=, read col1 col2
do
echo "INSERT INTO Survey VALUES($col1,\"olmstad\",\"temp\",$col2);"
done
Assuming the csv file is called samp.csv
run it using:
bash samp.sh <samp.csv
R script
Can use the follwing script. It must be in the same directory as the samp.csv
file.
# Read the csv file
reading <- read.csv("./samp.csv",stringsAsFactors = FALSE)
printInsert <- function(x){
cat(paste0("INSERT INTO Survey VALUES(",x[1],',"olmstad","temp",',x[2],");","\n"))
}
# Now apply the function to each row
#apply(reading,1,printInsert) # Get an extra NULL
# Loop over the rows
for (i in 1:nrow(reading)){
cat(paste0("INSERT INTO Survey VALUES(",reading[1],',"olmstad","temp",',reading[2],");","\n"))
}
Run it using:
Rscript samp.R
assuming that Rscript
is in your path.
-
SQLite has several administrative commands that aren’t part of the SQL standard. One of these are:
.dump
, which prints the SQL commands needed to re-create the database..load
, which reads a file created by .dump and restores the database.
A colleague of yours thinks that storing dump files (which are text) in version control is a good way to track and manage changes to the database. What are the pros and cons of this approach?
Hint: records aren’t stored in any particular order.
- Teaching: 20 minutes,
- Exercise: 15 minutes.
- Write short programs that execute SQL queries.
- Trace the execution of a program that contains an SQL query.
- Explain why most database applications are written in a general-purpose language rather than in SQL.
Look at how you can use Python to access a database:
import sqlite3
connection = sqlite3.connect("survey.db") # Will usually req host, username,pasword
cursor = connection.cursor() # Keep track of where we are in the database
cursor.execute("SELECT Site.lat, Site.long FROM Site;")
results = cursor.fetchall() # Returns a list of tuples.
for r in results:
print r
cursor.close()
connection.close()
Queries in real applications will often depend on values provided by users. For example, this function takes a user’s ID as a parameter and returns their name:
import sqlite3
def get_name(database_file, person_id):
query = "SELECT personal || ' ' || family FROM Person WHERE id='" + person_id + "';"
connection = sqlite3.connect(database_file)
cursor = connection.cursor()
cursor.execute(query)
results = cursor.fetchall()
cursor.close()
connection.close()
return results[0][0]
print "full name for dyer:", get_name('survey.db', 'dyer')
Need to be careful if you expect input from a user. They could specify their name as:
dyer'; DROP TABLE Survey; SELECT '
So you will execute:
SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';
and drop one of the tables from your database. This is called an SQL injection attack.
Overcome this by using a prepared statement:
def get_name(database_file, person_id):
query = "SELECT personal || ' ' || family FROM Person WHERE id=?;"
connection = sqlite3.connect(database_file)
cursor = connection.cursor()
cursor.execute(query, [person_id])
results = cursor.fetchall()
cursor.close()
connection.close()
return results[0][0]
print "full name for dyer:", get_name('survey.db', 'dyer')
The key changes are in the query string and the execute call. Instead of formatting the query ourselves, we put question marks in the query template where we want to insert values. When we call execute, we provide a list that contains as many values as there are question marks in the query. The library matches values to question marks in order, and translates any special characters in the values into their escaped equivalents so that they are safe to use.
-
Write a Python program that creates a new database in a file called
original.db
containing a single table calledPressure
, with a single field calledreading
, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file? -
Write a Python program that creates a new database called
backup.db
with the same structure asoriginal.db
and copies all the values greater than 20.0 fromoriginal.db
tobackup.db
. Which is faster: filtering values in the query, or reading everything into memory and filtering in Python?
- General-purpose languages have libraries for accessing databases.
- To connect to a database, a program must use a library specific to that database manager.
- These libraries use a connection-and-cursor model.
- Programs can read query results in batches or all at once.
- Queries should be written using parameter substitution, not string formatting.
- Teaching: 30 minutes,
- Exercise: 15 minutes.
- Write short programs that execute SQL queries.
- Trace the execution of a program that contains an SQL query.
- Explain why most database applications are written in a general-purpose language rather than in SQL.
A short R program to read from our SQLite database:
library(RSQLite)
connection <- dbConnect(SQLite(), "survey.db") # Establish connecion with data base
results <- dbGetQuery(connection, "SELECT Site.lat, Site.long FROM Site;")
print(results)
dbDisconnect(connection) # Once we are done close the connection.
Queries in real applications will often depend on values provided by users. For example, this function takes a user’s ID as a parameter and returns their name:
library(RSQLite)
connection <- dbConnect(SQLite(), "survey.db")
getName <- function(personID) {
query <- paste0("SELECT personal family FROM Person WHERE id =='",personID, "';")
return(dbGetQuery(connection, query))
}
print(paste("full name for dyer:", getName('dyer')))
dbDisconnect(connection)
When you run this you get:
full name for dyer: William Dyer
However, if this were asking a user and a user were to supply:
dyer'; DROP TABLE Survey; SELECT '
This would result in the query:
SELECT personal family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';
You have lost a table!! This is called an SQL injection attack, and it has been used to attack thousands of programs over the years.
See XKCD 327. The safest way to deal with this threat is to replace characters like quotes with their escaped equivalents, so that we can safely put whatever the user gives us inside a string. We can do this by using a prepared statement instead of formatting our statements as strings.
library(RSQLite)
connection <- dbConnect(SQLite(), "survey.db")
getName <- function(personID) {
query <- "SELECT personal family FROM Person WHERE id == ?"
return(dbGetPreparedQuery(connection, query, data.frame(personID,,stringsAsFactors = FALSE)))
}
print(paste("full name for dyer:", getName('dyer')))
dbDisconnect(connection)
The key changes are in the query string and the dbGetQuery
call (we use dbGetPreparedQuery
instead). Instead of formatting the query ourselves, we put question marks in the query template where we want to insert values. When we call dbGetPreparedQuery
, we provide a dataframe that contains as many values as there are question marks in the query. The library matches values to question marks in order, and translates any special characters in the values into their escaped equivalents so that they are safe to use.
R’s database interface packages (like RSQLite
) all share a common set of helper functions useful for exploring databases and reading/writing entire tables at once.
To view all tables in a database, we can use dbListTables()
:
connection <- dbConnect(SQLite(), "survey.db")
dbListTables(connection)
To view all column names of a table, use dbListFields()
:
dbListFields(connection, "Survey")
To read an entire table as a dataframe, use dbReadTable()
:
dbReadTable(connection, "Person")
Finally to write an entire table to a database, you can use dbWriteTable()
. Note that we will always want to use the row.names = FALSE
argument or R will write the row names as a separate column. In this example we will write R’s built-in iris dataset as a table in survey.db
.
dbWriteTable(connection, "iris", iris, row.names = FALSE)
head(dbReadTable(connection, "iris"))
And as always, remember to close the database connection when done!
dbDisconnect(connection)
-
Write an R program that creates a new database in a file called
original.db
containing a single table calledPressure
, with a single field called reading, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file? -
Write an R program that creates a new database called
backup.db
with the same structure asoriginal.db
and copies all the values greater than 20.0 fromoriginal.db
tobackup.db
. Which is faster:- filtering values in the query, or
- reading everything into memory and filtering in R?
- Data analysis languages have libraries for accessing databases.
- To connect to a database, a program must use a library specific to that database manager.
- R’s libraries can be used to directly query or read from a database.
- Programs can read query results in batches or all at once.
- Queries should be written using parameter substitution, not string formatting.
- R has multiple helper functions to make working with databases easier.
This work is licensed under a Creative Commons Attribution 4.0 International License.