marioa/sql.md

## sql.md

      
    Raw
  

              sql.md
            
          
    SQL lesson

Software carpentry instructor notes for this SQL lesson. Have separated the challenges into a separate gist. Note I have found the time times do not work as stated.
Contents


1. Prerequisites
2. Selecting data
3. Sorting and Removing Duplicates
4. Filtering
5. Calculating New Values
6. Missing Data
7. Aggregation
8. Combining Data
9. Data Hygiene
10. Creating and Modifying Data
11. Programming with Databases - Python
12. Programming with Databases - R

1. Prerequisites


SQLite command line shell from the Unix shell.
SQLite.
SQLite plugin (Firefox SQLite Manager).
survey.db (https://github.com/swcarpentry/sql-novice-survey/raw/gh-pages/files/survey.db).

2. Selecting data


Teaching: 10 minutes,
Exercise: 5 minutes.

Objectives


Explain the difference between:

a table,
a record,
a field.


Explain the difference between:

a database and
a database manager.


Write a query to select all values for specific fields from a single table.

Intro


A relational database is a way to store and manipulate information.
Databases are arranged as tables.
Each table has columns (also known as fields) that describe the data, and
rows (also known as records) which contain the data.
We query the data through a database manager (a Data Base Managemen System or DBMS).
Have many different DBMS,

The commands to interact with the DBMS change from DBMS to DBMS the querying is mostly the same.
databases are stored in proprietary formats but can export/import data.


Use SQL (Structured Query Language) to query the database.

Open a terminal, identify where the survey.db file (if no file is used a temporary file will be created) is and:
sqlite3 survey.db
.help -- no space before the command
.quit -- can also use .exit or ^D on some systems.

Go back in:
.tables -- show what tables are available


Person: people who took readings.
Site: locations where readings were taken.
Survey: the actual readings.
Visited: when readings were taken at specific sites.
Make things a bit easier to read:

.mode columns -- will produce left aligned columns
.header on  -- will print column headers

-- Can look inside the tables to see what the fields are
.schema Person

-- Can query the database
SELECT family, personal FROM Person; -- Semicolon terminates the statement.
                                     -- Queries are case insensitive.
                                     -- Convention to put SQL in upper case, names as they are.

-- Get all data
SELECT * from Person  -- Forget the semicolon on purpose

-- We are declaring how we want the data
SELECT personal, family FROM Person;
SELECT id, id, id FROM Person;

Challenge 2.1


Write a query that selects only site names from the Site table.

.schema Site
SELECT name FROM Site;

Key points


A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
A database manager or DBMS is a program that manipulates information stored in a database.
We write queries in a specialized language called SQL to extract information from databases.
Use SELECT… FROM… to get values from a database table.
SQL is case-insensitive (but data is case-sensitive).

3. Sorting and Removing Duplicates


Teaching: 10 min
Exercises: 10 min

Objectives


Write queries that display results in a particular order.
Write queries that eliminate duplicate values from data.

Introduction

Want to know:

what kind of quantity measurements were taken at each site;
which scientists took measurements on the expedition;
the sites where each scientist took measurements.

-- Look at the surveys table
.schema Survey

-- Look at the quantities field
SELECT quant FROM Survey; -- lots of redundant data

-- Makes it hard to see the different measurements, use DISTINCT
SELECT DISTINCT quant FROM Survey;

-- Can use on multiple columns
SELECT DISTINCT taken, quant FROM Survey;  -- non-adjacent duplicates removed as well

-- Now look at the scientists
.schema Person
SELECT * FROM Person;

-- Impose an order
SELECT * FROM Person ORDER BY id; -- ascending is the default

-- If we want to reverse the order
SELECT * FROM person ORDER BY id DESC;

-- Can order by serveral fields at once
SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;

-- Can see what scientist was at what site and what measurements were taken

Challenge 3.1


Write a query that selects distinct dates from the Visited table.

.schema Visited
SELECT DISTINCT dated FROM Visited;


Write a query that displays the full names of the scientists in the Person table, ordered by family name.

.schema Person
SELECT personal, family FROM Person ORDER BY family;

Key Points


The records in a database table are not intrinsically ordered:

if we want to display them in some order, we must specify that explicitly with ORDER BY.


The values in a database are not guaranteed to be unique:

if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.


4. Filtering


Teaching: 10 minutes,
Exercise: 5 minutes.

Objectives


Write queries that select records that satisfy user-specified conditions.
Explain the order in which the clauses in a query are executed.

Introduction

Want to be able to filter data, e.g. see data that only corresponds to a given site. We use the WHERE clause.
SELECT * FROM Visited WHERE site='DR-1';

The DBMS:

checks each row in the Visited table that satisfies the WHERE.
It then uses the column names following the SELECT keyword to determine which columns to display.

Can use BOOLEANS to do further filtering:
.schema Visited
SELECT * FROM Visited WHERE site='DR-1' AND dated<'1930-01-01';

Note: Most database managers have a special data type for dates. In fact, many have two:

one for dates, such as May 31, 1971,
one for durations, such as 31 days.

SQLite doesn’t: instead, it stores dates as either:

text (in the ISO-8601 standard format YYYY-MM-DD [HH:MM:SS.SSSS]),
real numbers (the number of days since November 24, 4714 BCE), or
integers (the number of seconds since midnight, January 1, 1970).

To find measurements taken by either Lake or Roerich, combine the tests on their names using OR
-- Query
SELECT * FROM Survey WHERE person='lake' OR person='roe';

-- Note that although SQL is case insensitive the data is not
SELECT * FROM Survey WHERE person='Lake' OR person='Roe';

-- Alternatively can use the IN clause
SELECT * FROM Survey WHERE person IN ('lake', 'roe');

-- What does this give?
-- Salinity measurements by lake and measurement by roe
SELECT * FROM Survey WHERE quant='sal' AND person='lake' OR person='roe';

-- Be explicit
SELECT * FROM Survey WHERE quant='sal' AND (person='lake' OR person='roe');

-- Can do wildcard matches
SELECT * FROM Visited WHERE site LIKE 'DR%';

-- Can use DISTINCT with WHERE to give a second level of filtering
SELECT DISTINCT person, quant FROM Survey WHERE person='lake' OR person='roe';

DISTINCT is applied to the values displayed in the chosen columns, not to the entire rows as they are being processed.
Queries are usually grown. Start with something simple and gradually make it more complex.
Challenge 4.1


Fix the following query - suppose we want to select all sites that lie more than 48 degrees from the poles. Our first query is:

SELECT * FROM Site WHERE (lat > -48) OR (lat < 48);

Explain why this is wrong, and rewrite the query so that it is correct.
SELECT * FROM Site WHERE (lat > -48) AND (lat < 48);


Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from Survey with salinity values outside this range.

SELECT * from Survey WHERE quant="sal" AND ( reading < 0 OR reading > 1);


Which of the following are true:

'a' LIKE 'a' (T)
'a' LIKE '%a' (T)
'beta' LIKE '%a' (T)
'alpha' LIKE 'a%%' (T)
'alpha' LIKE 'a%p%' (T)


Key points


Use WHERE to specify conditions that records must meet in order to be included in a query’s results.
Use AND, OR, and NOT to combine tests.
Filtering is done on whole records, so conditions can use fields that are not actually displayed.
Write queries incrementally.

5. Calculating New Values


Teaching: 5 minutes,
Exercise: 5 minutes.

Objectives


Write queries that calculate new values for each selected record.

Introduction

--- Modifying readings on the fly
SELECT 1.05 * reading FROM Survey WHERE quant='rad';

-- Converting Temperatures from F to C
SELECT taken, round(5*(reading-32)/9, 2) FROM Survey WHERE quant='temp';

-- Renaming the column
SELECT taken, round(5*(reading-32)/9, 2) as Celsius FROM Survey WHERE quant='temp';

-- Concatenating the entire name
SELECT personal || ' ' || family FROM Person;

5.1 Challenge


Write a query that returns all of salinity measurements from the Survey table with the values divided by 100.

SELECT quant, reading/100 AS ModifiedReading FROM Survey WHERE quant="sal";


The UNION operator combines the results of two queries:

SELECT * FROM Person WHERE id='dyer' UNION SELECT * FROM Person WHERE id='roe';

Use UNION to create a consolidated list of salinity measurements in which Roerich’s, and only Roerich’s, have been corrected as described in the previous challenge. The output should be something like:
taken	reading
619	0.13
622	0.09
734	0.05
751	0.1
752	0.09
752	0.416
837	0.21
837	0.225

SELECT taken, reading / 100 FROM Survey WHERE quant = 'sal' AND person = 'roe' UNION 
SELECT taken, reading FROM Survey WHERE quant = 'sal' AND person != 'roe';


The site identifiers in the Visited table have two parts separated by a -, i.e.:

SELECT DISTINCT site FROM Visited;

Some site identifiers are two letters long and some are three.

The “in string” function instr(X, Y) returns:

the 1-based index of the first occurrence of string Y in string X, or
0 if Y does not exist in X.


The substring function substr(X, I, [L]) returns:

the substring of X starting at index I, with an optional length L.


Use these two functions to produce a list of unique major site identifiers. (For this data, the list should contain only “DR” and “MSK”).
-- Build up the query
SELECT site, instr(site,"-") FROM Visited;

SELECT site, substr(site,1,instr(site,"-")-1) FROM Visited;

SELECT site, substr(site,1,instr(site,"-")-1) AS NewSiteName FROM Visited;

Key points


Queries can do the usual arithmetic operations on values.
Use UNION to combine the results of two or more queries.

6. Missing Data


Teaching: 15 minutes,
Exercise: 15 minutes.

Objectives


Explain how databases represent missing information.
Explain the three-valued logic databases use when manipulating missing information.
Write queries that handle missing information correctly.

Introduction

Data is more often incomplete - there are missing values. These are represented by null.
-- Look at the data again.
SELECT * FROM Visited; -- 752 appears to have a missing value

-- null does not behave like other values, selecting values before 1930
SELECT * FROM Visited WHERE dated<'1930-01-01';

-- selecting after 1930
SELECT * FROM Visited WHERE dated>='1930-01-01'; -- 752 is nowhere

 -- Operation on null is null
 SELECT null<'1930-01-01';
 SELECT 1+null;            -- etc

 -- So, the following produces no output
 SELECT * FROM Visited WHERE dated=NULL;
 -- or
 SELECT * FROM Visited WHERE dated!=NULL;

-- Need to use IS NULL
SELECT * FROM Visited WHERE dated IS NULL;

-- The inverse
SELECT * FROM Visited WHERE dated IS NOT NULL;

-- If we wanted to take all salinity measurements not taken by lake
SELECT * FROM Survey WHERE quant='sal' AND person!='lake';   -- But this omits null values

-- Need to add an explicit check
SELECT * FROM Survey WHERE quant='sal' AND (person!='lake' OR person IS NULL);

Need to decide what to do with NULL values.
Aggregation functions that combine multiple values, such as min, max or avg, ignore null values.
In the majority of cases, this is a desirable output
Challenge 6.1


Write a query that sorts the records in Visited by date, omitting entries for which the date is not known (i.e., is null).

-- Build the query: look at the data
SELECT * FROM Visited;

-- Ignore the NULL values
SELECT * FROM Visited WHERE dated is NOT NULL;

-- Order the data
SELECT * FROM Visited WHERE dated is NOT NULL ORDER BY dated;


What do you expect the query:

SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL);

to produce? What does it actually produce?

Some database designers prefer to use a [sentinel value] to mark missing data rather than null.

For example, they will use the date “0000-00-00” to mark a missing date, or -1.0 to mark a missing salinity or radiation reading (since actual readings cannot be negative). What does this simplify? What burdens or risks does it introduce?
Key points


Databases use a special value called NULL to represent missing information.
Almost all operations on NULL produce NULL.
Queries can test for NULLs using IS NULL and IS NOT NULL.

7. Aggregation


Teaching: 5 minutes,
Exercise: 5 minutes.

Objectives


Define aggregation and give examples of its use.
Write queries that compute aggregated values.
Trace the execution of a query that performs aggregation.
Explain how missing data is handled during aggregation.

Introduction

-- Want to calculate ranges and averages for our data
SELECT dated FROM Visited;

-- Take a range of records and produce a single output
SELECT min(dated) FROM Visited;SELECT max(dated) FROM Visited;

-- Three others are avg, count, and sum:
SELECT avg(reading) FROM Survey WHERE quant='sal';
SELECT count(reading) FROM Survey WHERE quant='sal';
SELECT sum(reading) FROM Survey WHERE quant='sal';

-- Can perform multiple aggregations at the same time
SELECT min(reading), max(reading) FROM Survey WHERE quant='sal' AND reading<=1.0;

-- Combine raw and aggregated results
-- aggregated the field but you have not it how, DBMS picks
SELECT person, count(*) FROM Survey WHERE quant='sal' AND reading<=1.0; 

-- When there are no values to aggregate you get null values back
SELECT person, max(reading), sum(reading) FROM Survey WHERE quant='missing';

-- Aggregation functions ignore NULL values so you can do:
SELECT min(dated) FROM Visited;

-- and do not have to do:
SELECT min(dated) FROM Visited WHERE dated IS NOT NULL;

-- We know the following will not work
SELECT person, count(reading), round(avg(reading), 2)
FROM  Survey
WHERE quant='rad';  -- DBMS picks what (person) will be used

-- Could do it one person at a time
SELECT person, count(reading), round(avg(reading), 2)
FROM  Survey
WHERE quant='rad'
AND   person='dyer'; -- Change the person as we go along, tedious

-- To aggregate we use the GROUP BY clause
SELECT   person, count(reading), round(avg(reading), 2)
FROM     Survey
WHERE    quant='rad'
GROUP BY person;

-- Can GROUP BY more than one quantity, looking at the number of
-- types of measurement made by each person and their average.
SELECT   person, quant, count(reading), round(avg(reading), 2)
FROM     Survey
GROUP BY person, quant;

-- Remove NULLs and use ORDER BY
SELECT   person, quant, count(reading), round(avg(reading), 2)
FROM     Survey
WHERE    person IS NOT NULL
GROUP BY person, quant
ORDER BY person, quant;

Challenge 7.1


How many temperature readings did Frank Pabodie record, and what was their average value?

SELECT * FROM Survey;

-- Get the acronym from the Person table
SELECT * FROM Survey WHERE quant="temp" AND person="pb";

SELECT COUNT(*), AVG(reading) FROM Survey WHERE quant="temp" AND person="pb" GROUP BY person;

SELECT COUNT(*) AS Measurements, AVG(reading) AS Average FROM Survey 
WHERE quant="temp" AND person="pb" GROUP BY person;


The average of a set of values is the sum of the values divided by the number of values.

Does this mean that the avg function returns 2.0 or 3.0 when given the values 1.0, null, and 5.0?

  3
  
 -- If you don't believe this:
CREATE TABLE avg_tests(val);
INSERT INTO avg_tests (val)
VALUES
 (1),
 (5),
 (NULL);
 SELECT AVG(VAL) FROM avg_tests;
DROP TABLE avg_tests;


Want to calculate the difference between each individual radiation reading and the average of all the radiation readings. We write the query:

SELECT reading - avg(reading) FROM Survey WHERE quant='rad';

What does this actually produce, and why?
Mixing aggregate and non-aggregate values.
-- Could do in 2 steps
SELECT avg(reading) FROM Survey WHERE quant='rad'; -- 6.5625

-- Insert the resulting value directly
SELECT reading - 6.5625 AS Deviant FROM Survey WHERE quant='rad';
Deviant
----------
3.2575
1.2375
1.8475
0.6575
-2.2125
-4.3725
-5.1025
4.6875

-- Or in one step (using a nested query)
SELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') AS Deviant 
FROM Survey WHERE quant='rad';
Deviant
3.2575
1.2375
1.8475
0.6575
-2.2125
-4.3725
-5.1025
4.6875


The function group_concat(field, separator) concatenates all the values in a field using the specified separator character (or ‘,’ if the separator isn’t specified). Use this to produce a one-line list of scientists’ names, such as:

William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth

Can you find a way to order the list by surname?
SELECT GROUP_CONCAT(personal||" "||family,", ") AS Names FROM Person;

SELECT GROUP_CONCAT(personal||" "||family,", ") AS Names FROM (SELECT personal, family 
FROM Person ORDER BY family);

Key Values


Use aggregation functions to combine multiple values.
Aggregation functions ignore NULL values.
Aggregation happens after filtering.
Use GROUP BY to combine subsets separately.
If no aggregation function is specified for a field, the query may return an arbitrary value for that field.

8. Combining Data


Teaching: 20 minutes,
Exercise: 20 minutes.

Objectives


Explain the operation of a query that joins two tables.
Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.
Write queries that join tables on equal keys.
Explain what primary and foreign keys are, and why they are useful.

Introduction

Sometimes we want to JOIN data from one or more tables.
-- Look at the schema
.schema Site
.schema Visited

-- Look at some of the data
SELECT * FROM Site LIMIT 5;
SELECT * FROM Visited LIMIT 5; -- Can join on Site.name to Visited.site

-- Do a JOIN
SELECT * FROM Site JOIN Visited;

JOIN creates the cross product of two tables - gives all possible combinations (3*8) with 3
columns contributed from each table. Have not told it what to JOIN on though.
-- Only want combinations from the same site
SELECT * FROM Site JOIN Visited ON Site.name=Visited.site;

-- We can be specific about what is JOINED
SELECT Site.lat, Site.long, Visited.dated
FROM   Site JOIN Visited
ON     Site.name=Visited.site;

-- Can JOIN on more than one table
.schema Visited
.schema Survey
SELECT * FROM Visited LIMIT 5;
SELECT * FROM Survey LIMIT 5; -- Can join on Visited.id and Survey.taken

-- Joining across 3 tables
SELECT Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
FROM   Site JOIN Visited JOIN Survey
ON     Site.name=Visited.site
AND    Visited.id=Survey.taken
AND    Visited.dated IS NOT NULL;

Have:

primary keys - a value, or combination of values, that uniquely identifies each record in a table
foreign keys - a value (or combination of values) from one table that identifies a unique record in another table
Another way of saying this is that a foreign key is the primary key of one table that appears in some other table.

SQLite automatically numbers records:
.schema Person
 -- but
 SELECT rowid, * FROM Person;

Challenge 8.1


Write a query that lists all radiation readings from the DR-1 site.

SELECT Survey.reading FROM Survey JOIN Visited ON Survey.taken=Visited.id WHERE Visited.site="DR-1";


Write a query that lists all sites visited by people named “Frank”.

SELECT Survey.taken FROM Survey JOIN Person ON Survey.person=Person.id WHERE Person.personal="Frank";

SELECT Visited.Site, Person.personal, Person.family FROM Visited JOIN Survey JOIN Person ON
Survey.person=Person.id WHERE Person.personal="Frank" AND Visited.id=Survey.taken;


Describe in your own words what the following query produces:

SELECT Site.name FROM Site JOIN Visited
ON Site.lat<-49.0 AND Site.name=Visited.site AND Visited.dated>='1932-01-01';

The Site visited with a latitude less than -49 degrees later than 1932.

Write a query that shows:

each site with exact location (lat, long) ordered by visited date,
followed by personal name and family name of the person who visited the site and
The type of measurement taken and its reading.
Please avoid all null values. 

Tip: you should get 15 records with 8 fields.


SELECT si.name, si.lat, si.long, vi.dated, pe.personal, pe.family,
su.quant, su.reading
FROM
Site AS si JOIN Visited AS vi JOIN Survey AS su JOIN Person as pe ON
si.name=vi.site AND vi.dated IS NOT NULL AND su.person IS NOT NULL AND
vi.id=su.taken AND su.person IS NOT NULL AND su.person=pe.id
ORDER BY vi.dated;

Key points


Use JOIN to combine data from two tables.
Use table.field notation to refer to fields when doing joins.
Every fact should be represented in a database exactly once.
A join produces all combinations of records from one table with records from another.
A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
The most common join condition is matching keys.

9. Data Hygiene


Teaching: 15 minutes,
Exercise: 15 minutes.

Objectives


Explain what an atomic value is.
Distinguish between atomic and non-atomic values.
Explain why every value in a database should be atomic.
Explain what a primary key is and why every record should have one.
Identify primary keys in database tables.
Explain why database entries should not contain redundant information.
Identify redundant information in databases.

Introduction

Some rules:

Every value should be atomic, i.e., not contain parts that we might want to work with separately - e.g. personal and family names are in separate columns instead of putting all in one column so that we don’t have to use substring operations to get the name’s components.
Every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the id field in the Person table), or even a combination of values: the triple (taken, person, quant) from the Survey table uniquely identifies every measurement.
There should be no redundant information. Want to minimise the number of places where we have to make changes, i.e. do not replicate the same data.
The units for every value should be stored explicitly. The database we are using this does not do this - we do not know if everyone is using the same units.

Challenge 9.1


Which of the following are atomic values? Which are not? Why?

New Zealand
87 Turing Avenue
January 25, 1971
the XY coordinate (0.5, 3.3)


What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?


latitude
longitude
date
temperature


57.3
-22.5
2015-01-09
-14.2


Key points


Every value in a database should be atomic.
Every record should have a unique primary key.
A database should not contain redundant information.
Units and similar metadata should be stored with the data.

10. Creating and Modifying Data


Teaching: 20 minutes,
Exercise: 20 minutes.

Objectives


Write statements that creates tables.
Write statements to insert, modify, and delete records.

Introduction

Look at modifying and creating data. When using .schema command we get the info on how to create tables:
-- Create a new table
CREATE TABLE Survey2(taken integer, person text, quant real, reading real);

-- Check
.tables

We can remove a table by using (CAREFUL):
-- Delete the table
DROP TABLE Survey2;

-- Check
.tables

When creating tables you need to specify data types (for the type of data the columns are going to hold), most databases will support:

INTEGER a signed integer
REAL a floating point number
TEXT a character string
BLOB a “binary large object”, such as an image

Most databases also support:

Booleans- SQLite uses 0/1.
date/time - SQLite uses a string in a specific format.

Some support latitude and longitude as intrinsic data types, etc.
Can specify constraints:
CREATE TABLE Survey2(
    taken   integer not null, -- where reading taken
    person  text,             -- may not know who took it
    quant   real not null,    -- the quantity measured
    reading real not null,    -- the actual reading
    primary key(taken, quant),
    foreign key(taken) references Visited(id),
    foreign key(person) references Person(id)
);

Some of these constraints and their formulation are DBMS specific.
Once tables have been created, we can:

add records using INSERT
change records using UPDATE
remove records using DELETE

-- Create a temporary table
CREATE TABLE Site2(name text, lat real, long real);

-- Check
.tables

-- Insert some values
-- General: Syntax INSERT INTO TableName(Cols,...) VALUES(x,...)
INSERT INTO Site2 values('DR-1', -49.85, -128.57);
INSERT INTO Site2 values('DR-3', -47.15, -126.72);
INSERT INTO Site2 values('MSK-4', -48.87, -123.40);

-- Check
SELECT * FROM Site2;

-- or you could populate from an existing source
INSERT INTO Site2 SELECT name, lat, long FROM Site;

Modifying existing records uses the UPDATE command.
-- Check state
SELECT * FROM Site2;

-- Make a correction
UPDATE Site2 SET lat=-47.87, long=-122.40 WHERE name='MSK-4';

-- Check change
SELECT * FROM Site2;

To remove records you use DELETE.
-- Create a temporary table
CREATE TABLE Person2 (id text, personal text, family text);

-- Populate it
INSERT INTO Person2 SELECT id, personal, family FROM Person;

-- Check
SELECT * FROM Person2;

-- Now
SELECT * FROM Survey; -- no danforth

-- Delete entry
DELETE FROM Person2 WHERE id = 'danforth';

-- Check
SELECT * FROM Person2;

However, if you had removed lake instead this would have left a whole bunch of dangling pointers. We have
broken the referential integrity (all references between tables can always be resolved correctly) of our database.
There is such a thing as a cascading deletes that removes the associated references (and beyond the scope of this lesson). Can also have hybrid models where some of the data is the database but other data, e.g. files, are located outwith the databases.
Challenge 10.1


Write an SQL statement to replace all uses of null in Survey.person with the string 'unknown'.

UPDATE Survey SET person="unknown" WHERE person IS NULL;


One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:

Taken,Temp
619,-21.5
622,-15.5


Write the SQL INSERT statements that would add this data to the surveys database.
Write a small R/bash/Python program that reads this file in and prints out the SQL INSERT statements needed to add these records to the survey database.

Note: you will need to add an entry for Olmstead to the Person table. If you are testing your program repeatedly, you may want to investigate SQL’s INSERT or REPLACE command.
So you want output:
-- Need a new person
INSERT INTO Person VALUES("olmstead","Robert","Olmstead");

-- Insert The values into the Survey table
INSERT INTO Survey(taken,person,quant,reading) VALUES(619,"olmstad","temp",-21.5);
-- or
INSERT INTO Survey VALUES(622,"olmstad","temp",-15.5);

bash script
Call this samp.sh:
# Thanks to question 4286469 on stackoverflow
read # This skips the frist line 31911179
while IFS=, read col1 col2
do
    echo "INSERT INTO Survey VALUES($col1,\"olmstad\",\"temp\",$col2);"
done

Assuming the csv file is called samp.csv run it using:
bash samp.sh <samp.csv

R script
Can use the follwing script. It must be in the same directory as the samp.csv file.
# Read the csv file
reading <- read.csv("./samp.csv",stringsAsFactors = FALSE)

printInsert <- function(x){
  cat(paste0("INSERT INTO Survey VALUES(",x[1],',"olmstad","temp",',x[2],");","\n"))
}

# Now apply the function to each row
#apply(reading,1,printInsert) # Get an extra NULL

# Loop over the rows
for (i in 1:nrow(reading)){
  cat(paste0("INSERT INTO Survey VALUES(",reading[1],',"olmstad","temp",',reading[2],");","\n"))
}

Run it using:
Rscript samp.R

assuming that Rscript is in your path.


SQLite has several administrative commands that aren’t part of the SQL standard. One of these are:

.dump, which prints the SQL commands needed to re-create the database.
.load, which reads a file created by .dump and restores the database.


A colleague of yours thinks that storing dump files (which are text) in version control is a good way to track and manage changes to the database. What are the pros and cons of this approach?

Hint: records aren’t stored in any particular order.
11. Programming with Databases - Python


Teaching: 20 minutes,
Exercise: 15 minutes.

Objectives


Write short programs that execute SQL queries.
Trace the execution of a program that contains an SQL query.
Explain why most database applications are written in a general-purpose language rather than in SQL.

Introduction

Look at how you can use Python to access a database:
import sqlite3

connection = sqlite3.connect("survey.db")   # Will usually req host, username,pasword
cursor = connection.cursor()                # Keep track of where we are in the database
cursor.execute("SELECT Site.lat, Site.long FROM Site;")
results = cursor.fetchall()                 # Returns a list of tuples.
for r in results:
    print r
cursor.close()
connection.close()

Queries in real applications will often depend on values provided by users. For example, this function takes a user’s ID as a parameter and returns their name:
import sqlite3

def get_name(database_file, person_id):
    query = "SELECT personal || ' ' || family FROM Person WHERE id='" + person_id + "';"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query)
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]
    
print "full name for dyer:", get_name('survey.db', 'dyer')

Need to be careful if you expect input from a user. They could specify their name as:
dyer'; DROP TABLE Survey; SELECT '

So you will execute:
SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';

and drop one of the tables from your database. This is called an SQL injection attack.
Overcome this by using a prepared statement:
def get_name(database_file, person_id):
    query = "SELECT personal || ' ' || family FROM Person WHERE id=?;"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query, [person_id])
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]

print "full name for dyer:", get_name('survey.db', 'dyer')

The key changes are in the query string and the execute call. Instead of formatting the query ourselves, we put question marks in the query template where we want to insert values. When we call execute, we provide a list that contains as many values as there are question marks in the query. The library matches values to question marks in order, and translates any special characters in the values into their escaped equivalents so that they are safe to use.
Challenge 11.1


Write a Python program that creates a new database in a file called original.db containing a single table called Pressure, with a single field called reading, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file?


Write a Python program that creates a new database called backup.db with the same structure as original.db and copies all the values greater than 20.0 from original.db to backup.db. Which is faster: filtering values in the query, or reading everything into memory and filtering in Python?


Key points


General-purpose languages have libraries for accessing databases.
To connect to a database, a program must use a library specific to that database manager.
These libraries use a connection-and-cursor model.
Programs can read query results in batches or all at once.
Queries should be written using parameter substitution, not string formatting.

12. Programming with Databases - R


Teaching: 30 minutes,
Exercise: 15 minutes.

Objectives


Write short programs that execute SQL queries.
Trace the execution of a program that contains an SQL query.
Explain why most database applications are written in a general-purpose language rather than in SQL.

Introduction

A short R program to read from our SQLite database:
library(RSQLite)

connection <- dbConnect(SQLite(), "survey.db") # Establish connecion with data base
results    <- dbGetQuery(connection, "SELECT Site.lat, Site.long FROM Site;")
print(results)
dbDisconnect(connection)   # Once we are done close the connection.

Queries in real applications will often depend on values provided by users. For example, this function takes a user’s ID as a parameter and returns their name:
library(RSQLite)

connection <- dbConnect(SQLite(), "survey.db")

getName <- function(personID) {
  query <- paste0("SELECT personal family FROM Person WHERE id =='",personID, "';")
  return(dbGetQuery(connection, query))
}

print(paste("full name for dyer:", getName('dyer')))

dbDisconnect(connection)

When you run this you get:
full name for dyer: William Dyer

However, if this were asking a user and a user were to supply:
dyer'; DROP TABLE Survey; SELECT '

This would result in the query:
SELECT personal family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';

You have lost a table!! This is called an SQL injection attack, and it has been used to attack thousands of programs over the years.
See XKCD 327. The safest way to deal with this threat is to replace characters like quotes with their escaped equivalents, so that we can safely put whatever the user gives us inside a string. We can do this by using a prepared statement instead of formatting our statements as strings.
library(RSQLite)
connection <- dbConnect(SQLite(), "survey.db")

getName <- function(personID) {
  query <- "SELECT personal family FROM Person WHERE id == ?"
  return(dbGetPreparedQuery(connection, query, data.frame(personID,,stringsAsFactors = FALSE)))
}

print(paste("full name for dyer:", getName('dyer')))

dbDisconnect(connection)

The key changes are in the query string and the dbGetQuery call (we use dbGetPreparedQuery instead). Instead of formatting the query ourselves, we put question marks in the query template where we want to insert values. When we call dbGetPreparedQuery, we provide a dataframe that contains as many values as there are question marks in the query. The library matches values to question marks in order, and translates any special characters in the values into their escaped equivalents so that they are safe to use.
Database helper functions in R

R’s database interface packages (like RSQLite) all share a common set of helper functions useful for exploring databases and reading/writing entire tables at once.
To view all tables in a database, we can use dbListTables():
connection <- dbConnect(SQLite(), "survey.db")
dbListTables(connection)

To view all column names of a table, use dbListFields():
dbListFields(connection, "Survey")

To read an entire table as a dataframe, use dbReadTable():
dbReadTable(connection, "Person")

Finally to write an entire table to a database, you can use dbWriteTable(). Note that we will always want to use the row.names = FALSE argument or R will write the row names as a separate column. In this example we will write R’s built-in iris dataset as a table in survey.db.
dbWriteTable(connection, "iris", iris, row.names = FALSE)
head(dbReadTable(connection, "iris"))

And as always, remember to close the database connection when done!
dbDisconnect(connection)

Challenge 12.1


Write an R program that creates a new database in a file called original.db containing a single table called Pressure, with a single field called reading, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file?


Write an R program that creates a new database called backup.db with the same structure as original.db and copies all the values greater than 20.0 from original.db to backup.db. Which is faster:

filtering values in the query, or
reading everything into memory and filtering in R?


Key points


Data analysis languages have libraries for accessing databases.
To connect to a database, a program must use a library specific to that database manager.
R’s libraries can be used to directly query or read from a database.
Programs can read query results in batches or all at once.
Queries should be written using parameter substitution, not string formatting.
R has multiple helper functions to make working with databases easier.


This work is licensed under a 
Creative Commons Attribution 4.0 International License.