- Challenge 1.1
- Challenge 2.1
- Challenge 3.1
- Challenge 4.1
- Challenge 5.1
- Challenge 6.1
- Challenge 7.1
- Challenge 8.1
- Challenge 9.1
- Challenge 10.1
- Challenge 11.1
- Recap of 1.1
In an increasingly open research environment there are a growing number of databases that can contribute to researcher’s daily tasks. SQL enables communication with some of these databases. However, researchers using these databases have a responsibility to ensure that they comply with ethical standards before using them.
-
What are key issues to look out for when starting to use a new database or contributing to an existing database?
- Format – is it accessible to users from other countries and language groups
- Credit - are contributors credited when necessary
- Commercial - is the database for profit? If so, why?
- Ethical - does the metadata detail ethical commitments (when necessary)?
- Code of Standards - does the database have a code of standards and operating procedures?
- Accessibility - are the designers and curators of the database accessible for questioning?
- Comprehensibe - is the metadata accessible and intelligible?
- Any other concerns?
We shall come back to these questions at the end of the SQL Lessons.
- Write a query that selects only site names from the Site table.
-
Write a query that selects distinct dates from the
Visited
table. -
Write a query that displays the full names of the scientists in the
Person
table, ordered by family name.
- Fix the following query - suppose we want to select all sites that lie more than 48 degrees from the poles. Our first query is:
SELECT * FROM Site WHERE (lat > -48) OR (lat < 48);
Explain why this is wrong, and rewrite the query so that it is correct.
-
Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from
Survey
with salinity values outside this range. -
Which of the following are true:
- 'a' LIKE 'a'
- 'a' LIKE '%a'
- 'beta' LIKE '%a'
- 'alpha' LIKE 'a%%'
- 'alpha' LIKE 'a%p%'
-
Write a query that returns all of salinity measurements from the
Survey
table with the values divided by 100. -
The
UNION
operator combines the results of two queries:
SELECT * FROM Person WHERE id='dyer' UNION SELECT * FROM Person WHERE id='roe';
Use UNION
to create a consolidated list of salinity measurements in which Roerich’s, and only Roerich’s, have been corrected as described in the previous challenge. The output should be something like:
taken reading
619 0.13
622 0.09
734 0.05
751 0.1
752 0.09
752 0.416
837 0.21
837 0.225
- The site identifiers in the Visited table have two parts separated by a
-
, i.e.:
SELECT DISTINCT site FROM Visited;
site
----------
DR-1
DR-3
MSK-4
Some site identifiers are two letters long and some are three.
- The “in string” function
instr(X, Y)
returns:- the index of the first occurrence of string Y in string X, or
- 0 if Y does not exist in X.
- The substring function
substr(X, I, [L])
returns:- the substring of X starting at index I, with an optional length L.
Use these two functions to produce a list of unique major site identifiers. (For this data, the list should contain only “DR” and “MSK”).
-
Write a query that sorts the records in
Visited
by date, omitting entries for which the date is not known (i.e., is null). -
What do you expect the query:
SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL);
to produce? What does it actually produce?
- Some database designers prefer to use a [sentinel value] to mark missing data rather than null.
For example, they will use the date “0000-00-00” to mark a missing date, or -1.0 to mark a missing salinity or radiation reading (since actual readings cannot be negative). What does this simplify? What burdens or risks does it introduce?
-
How many temperature readings did Frank Pabodie record, and what was their average value?
-
The average of a set of values is the sum of the values divided by the number of values.
Does this mean that the avg function returns 2.0 or 3.0 when given the values 1.0, null, and 5.0? -
Want to calculate the difference between each individual radiation reading and the average of all the radiation readings. We write the query:
SELECT reading - avg(reading) FROM Survey WHERE quant='rad';
What does this actually produce, and why?
- The function
group_concat(field, separator)
concatenates all the values in a field using the specified separator character (or ‘,’ if the separator isn’t specified). Use this to produce a one-line list of scientists’ names, such as:
William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth
Can you find a way to order the list by surname?
-
Write a query that lists all radiation readings from the DR-1 site.
-
Write a query that lists all sites visited by people named “Frank”.
-
Describe in your own words what the following query produces:
SELECT Site.name FROM Site JOIN Visited
ON Site.lat<-49.0 AND Site.name=Visited.site AND Visited.dated>='1932-01-01';
- Write a query that shows:
- each site with exact location (lat, long) ordered by visited date,
- followed by personal name and family name of the person who visited the site and
- the type of measurement taken and its reading.
Please avoid all null values.
Tip: you should get 15 records with 8 fields.
You may find the following image useful:
-
Which of the following are atomic values? Which are not? Why?
- New Zealand
- 87 Turing Avenue
- January 25, 1971
- the XY coordinate (0.5, 3.3)
-
What is the primary key in this table? I.e., what value or combination of values uniquely identifies a record?
latitude | longitude | date | temperature |
---|---|---|---|
57.3 | -22.5 | 2015-01-09 | -14.2 |
- Write an SQL statement to replace all uses of null in Survey.person with the string 'unknown'.
- One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:
Taken,Temp
619,-21.5
622,-15.5
Write a small R/bash/Python program that reads this file in and prints out the SQL INSERT
statements needed to add these records to the survey database.
Note: you will need to add an entry for Olmstead to the Person
table. If you are testing your program repeatedly, you may want to investigate SQL’s INSERT
or REPLACE
command.
-
Write a Python program that creates a new database in a file called
original.db
containing a single table calledPressure
, with a single field calledreading
, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file? -
Write a Python program that creates a new database called
backup.db
with the same structure asoriginal.db
and copies all the values greater than 20.0 fromoriginal.db
tobackup.db
. Which is faster: filtering values in the query, or reading everything into memory and filtering in Python?
-
Write an R program that creates a new database in a file called
original.db
containing a single table calledPressure
, with a single field called reading, and inserts 100,000 random numbers between 10.0 and 25.0. How long does it take this program to run? How long does it take to run a program that simply writes those random numbers to a file? -
Write an R program that creates a new database called
backup.db
with the same structure asoriginal.db
and copies all the values greater than 20.0 fromoriginal.db
tobackup.db
. Which is faster:- filtering values in the query, or
- reading everything into memory and filtering in R?
Review the questions in Challenge 1.1.
This work is licensed under a Creative Commons Attribution 4.0 International License.