marioa/sql-questions.md

## sql-questions.md

      
    Raw
  

              sql-questions.md
            
          
    SQL Questions

Contents


Challenge 1.1
Challenge 2.1
Challenge 3.1
Challenge 4.1
Challenge 5.1
Challenge 6.1
Challenge 7.1
Challenge 8.1
Challenge 9.1
Challenge 10.1
Challenge 11.1
Recap of 1.1

Challenge 1.1

In an increasingly open research environment there are a growing number of databases that can contribute to researcher’s daily tasks.  SQL enables communication with some of these databases.  However, researchers using these databases have a responsibility to ensure that they comply with ethical standards before using them.


What are key issues to look out for when starting to use a new database or contributing to an existing database?

Format – is it accessible to users from other countries and language groups
Credit - are contributors credited when necessary
Commercial - is the database for profit?  If so, why?
Ethical - does the metadata detail ethical commitments (when necessary)?
Code of Standards - does the database have a code of standards and operating procedures?
Accessibility - are the designers and curators of the database accessible for questioning?
Comprehensibe - is the metadata accessible and intelligible?
Any other concerns?


We shall come back to these questions at the end of the SQL Lessons.
Challenge 2.1


Write a query that selects only site names from the Site table.

Challenge 3.1


Write a query that selects distinct dates from the Visited table.


Write a query that displays the full names of the scientists in the Person table, ordered by family name.


Challenge 4.1


Fix the following query - suppose we want to select all sites that lie more than 48 degrees from the poles. Our first query is:

SELECT * FROM Site WHERE (lat > -48) OR (lat < 48);

Explain why this is wrong, and rewrite the query so that it is correct.


Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from Survey with salinity values outside this range.


Which of the following are true:

'a' LIKE 'a'
'a' LIKE '%a'
'beta' LIKE '%a'
'alpha' LIKE 'a%%'
'alpha' LIKE 'a%p%'


5.1 Challenge


Write a query that returns all of salinity measurements from the Survey table with the values divided by 100.


The UNION operator combines the results of two queries:


SELECT * FROM Person WHERE id='dyer' UNION SELECT * FROM Person WHERE id='roe';

Use UNION to create a consolidated list of salinity measurements in which Roerich’s, and only Roerich’s, have been corrected as described in the previous challenge. The output should be something like:
taken	reading
619	0.13
622	0.09
734	0.05
751	0.1
752	0.09
752	0.416
837	0.21
837	0.225


The site identifiers in the Visited table have two parts separated by a -, i.e.:

SELECT DISTINCT site FROM Visited;

site
----------
DR-1
DR-3
MSK-4

Some site identifiers are two letters long and some are three.

The “in string” function instr(X, Y) returns:

the index of the first occurrence of string Y in string X, or
0 if Y does not exist in X.


The substring function substr(X, I, [L]) returns:

the substring of X starting at index I, with an optional length L.


Use these two functions to produce a list of unique major site identifiers. (For this data, the list should contain only “DR” and “MSK”).
Challenge 6.1


Write a query that sorts the records in Visited by date, omitting entries for which the date is not known (i.e., is null).


What do you expect the query:


SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL);

to produce? What does it actually produce?

Some database designers prefer to use a [sentinel value] to mark missing data rather than null.

For example, they will use the date “0000-00-00” to mark a missing date, or -1.0 to mark a missing salinity or radiation reading (since actual readings cannot be negative). What does this simplify? What burdens or risks does it introduce?
Challenge 7.1


How many temperature readings did Frank Pabodie record, and what was their average value?


The average of a set of values is the sum of the values divided by the number of values.

Does this mean that the avg function returns 2.0 or 3.0 when given the values 1.0, null, and 5.0?


Want to calculate the difference between each individual radiation reading and the average of all the radiation readings. We write the query:


SELECT reading - avg(reading) FROM Survey WHERE quant='rad';

What does this actually produce, and why?

The function group_concat(field, separator) concatenates all the values in a field using the specified separator character (or ‘,’ if the separator isn’t specified). Use this to produce a one-line list of scientists’ names, such as:

William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth

Can you find a way to order the list by surname?
Challenge 8.1


Write a query that lists all radiation readings from the DR-1 site.


Write a query that lists all sites visited by people named “Frank”.


Describe in your own words what the following query produces:


SELECT Site.name FROM Site JOIN Visited
ON Site.lat<-49.0 AND Site.name=Visited.site AND Visited.dated>='1932-01-01';


Write a query that shows:

each site with exact location (lat, long) ordered by visited date,
followed by personal name and family name of the person who visited the site and
the type of measurement taken and its reading.


Please avoid all null values.

Tip: you should get 15 records with 8 fields.
You may find the following image useful:

Challenge 9.1


Which of the following are atomic values? Which are not? Why?

New Zealand
87 Turing Avenue
January 25, 1971
the XY coordinate (0.5, 3.3)


What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?


latitude
longitude
date
temperature


57.3
-22.5
2015-01-09
-14.2


Challenge 10.1


Write an SQL statement to replace all uses of null in Survey.person with the string 'unknown'.
One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:

Taken,Temp
619,-21.5
622,-15.5

Write a small R/bash/Python program that reads this file in and prints out the SQL INSERT statements needed to add these records to the survey database. 

Note: you will need to add an entry for Olmstead to the Person table. If you are testing your program repeatedly, you may want to investigate SQL’s INSERT or REPLACE command.
Challenge 11.1


Write a Python program that creates a new database in a file called original.db containing a single table called Pressure, with a single field called reading, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file?


Write a Python program that creates a new database called backup.db with the same structure as original.db and copies all the values greater than 20.0 from original.db to backup.db. Which is faster: filtering values in the query, or reading everything into memory and filtering in Python?


Challenge 12.1


Write an R program that creates a new database in a file called original.db containing a single table called Pressure, with a single field called reading, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file?


Write an R program that creates a new database called backup.db with the same structure as original.db and copies all the values greater than 20.0 from original.db to backup.db. Which is faster:

filtering values in the query, or
reading everything into memory and filtering in R?


Recap of 1.1

Review the questions in Challenge 1.1.


This work is licensed under a 
Creative Commons Attribution 4.0 International License.