Intro to SQL
This guide was made for the hands-on SQL workshop at IRE 2016 in New Orleans. It uses bridge inspection data from the Federal Highway Administration's Bridge Inventory Database. You can download the Louisiana data here, courtesy of IRE.
The rest of the hands-on SQL class material is located here: http://bit.ly/hands_on_sql_ire2016
Selecting your data
The two most basic SQL commands are
FROM. For every SQL query, you will always need to state which columns you want and the table where those columns are located.
Select all columns from the
bridge15_la table. The
* is a wildcard character, which means everything in this context.
SELECT * FROM bridge15_la;
Now let's select specific columns from the
bridge15_la table. These columns are the bridge status, sufficiency rating number, feature, structure, average daily traffic, year the bridge was built and the most recent inspection date. The table includes many more columns, but these are the columns that we will focus our attention on.
SELECT stat, suffrtno, feature, strcture, avdayno, year, inspdate FROM bridge15_la;
If the column names aren't clear or are difficult to remember, you can rename the column headers to make your output easier to understand. Use the
AS command to rename a column.
SELECT feature, stat AS status, year AS year_built, avdayno AS average_daily_traffic, suffrtno AS sufficiency_rating_number, strcture AS structure, inspdate AS most_recent_inspection_date FROM bridge15_la;
We'll stick with the spreadsheet's default column names, but remember this trick if you ever get bogged down in too many abbreviations.
Try searching for various columns to explore the data. You can rearrange the order of column names in the
SELECT statement to your liking.
Filtering the data
Now we're going to look at one of the most useful parts of SQL. The
WHERE command lets you filter your data based on any number of criteria. It a row matches the given criteria, that row is returned.
For example, you could limit the rows to only those that are in Orleans Parish using the following query. The output will include all rows where the statement
cnty = '071' is true.
SELECT * FROM bridge15_la WHERE cnty = '071';
In addition to equals (
=), other common comparison operators include does not equal (
!=), greater than (
>), less than (
<), greater than or equal (
>=) and less than or equal (
<=). Of course, your data must be numeric in order to use mathematical operators such as
<. That wouldn't make much sense when comparing two words or phrases.
SELECT * FROM bridge15_la WHERE suffrtno < 50;
Perhaps you only know a part of the text that you are seeking. SQL offers a useful command that lets you search by pieces of text. The
% here acts as a wildcard character, meaning it can represent zero or more characters. In other words, any
feature cells starting with "I-10" will match this filter.
SELECT * FROM bridge15_la WHERE feature LIKE 'I-10%';
You could also use it to only show features that end in "I-10."
SELECT * FROM bridge15_la WHERE feature LIKE '%I-10';
And if you wanted to find "I-10" anywhere, you could use the wildcard
% at both the start and the end.
SELECT * FROM bridge15_la WHERE feature LIKE '%I-10%';
You might have noticed that other features refer to I-10 as "I10" or "I 10." To capture those rows as well, we can add additional filters using the
SELECT * FROM bridge15_la WHERE feature LIKE '%I-10%' OR feature LIKE '%I 10%' OR feature LIKE '%I10%';
As long as the
feature value matches at least one of those patterns, that row will be returned. With
OR, rows are returned as long as they match at least one of the filters.
According to the data's documentation, a bridge with
stat equal to 1 is structurally deficient and a bridge with
stat equal to 2 is functionally obsolete. Let's find those bridges using two filters combined with the
SELECT feature, stat, suffrtno FROM bridge15_la WHERE stat = '1' OR stat = '2';
You can accomplish the same query using the
IN operator, as shown below. This query says that as long as the
stat value is somewhere in that list of values, the row is a match and will be returned. I find this syntax a little bit easier to manage once you start searching for more than a few values. The downside to using
IN is that you can no longer make use of
LIKE, so only use
IN when you have full-text matches.
SELECT feature, stat, suffrtno FROM bridge15_la WHERE stat IN ('1', '2');
The data documentation also states that, of those bridges with
stat equal to 1 or 2, any bridge that also has a sufficiency rating (
suffrtno) less than 50 is eligible for replacement or rehabilitation. These are the worst of the worst bridges.
To check if a bridge is structurally deficient or functionally obsolete and has a low sufficiency rating, we can use the
SELECT feature, stat, suffrtno FROM bridge15_la WHERE (stat = '1' OR stat = '2') AND suffrtno < 50;
This means that a row will only be returned if the
stat field is either "1" or "2", while also having a sufficiency rating below 50. The
OR operator only requires one true value, but the
AND operator requires true values from all comparisons.
Notice the use of parentheses around the
OR operator in the above query. This groups the result of that comparison, which is then used in the
AND comparison. The parentheses help to stay organized.
Sorting your data
Since we're exploring the worst bridges, it might help to rank those bridges from worst to best. This is where the
ORDER BY operator helps. You can select the column which will determine the order of the rows.
This query orders the results in descending order, based on the
SELECT feature, stat, suffrtno FROM bridge15_la WHERE (stat = '1' OR stat = '2') AND suffrtno < 50 ORDER BY suffrtno DESC;
The default setting for
ORDER BY is ascending order. In the above query, ascending order could be achieved by writing either
...ORDER BY suffrtno ASC or
...ORDER BY suffrtno.
LIMIT command forces your query to only return the specified number of rows. This is commonly used in conjunction with
ORDER BY to show a small set of ranked rows ("The 10 worst bridges in Louisiana").
This query orders the results in ascending order (the default order), based on the
suffrtno values, and only returns the first 10 values. Because we are sorting from low to high (bad to good) sufficiency rating numbers, and limiting the results to the first 10, this query returns the 10 worst bridges (at least according to these columns).
SELECT feature, stat, suffrtno FROM bridge15_la WHERE (stat = '1' OR stat = '2') AND suffrtno < 50 ORDER BY suffrtno LIMIT 10;
SQL offers built-in functions to perform basic calculations on your data.
AVG are some common ones. You can read more about them here: https://www.sqlite.org/lang_aggfunc.html
Return the number of rows matching your query. This is especially useful when combined with
WHERE statements to understand how many rows match your filters.
SELECT COUNT(*) FROM bridge15_la; WHERE cnty = '071';
Return the average value for the column specified.
SELECT AVG(suffrtno) FROM bridge15_la;
Return the greatest value for the column specified.
SELECT MAX(suffrtno) FROM bridge15_la;
Return the smallest value for the column specified.
SELECT MIN(suffrtno) FROM bridge15_la;
A few more notes
The capitalization of SQL syntax words is not necessary, but helps to differentiate between SQL commands and other information. I find it easier to scan this way too. This is also why I include the new lines for each successive SQL command. They make reading easier but are not necessary.
The semicolon at the end of each command is not required by all SQL software, but it is by many so it's a good habit to get into.
As your queries grow more and more complex, it might help to write comments within your SQL code to note what a particular line does or explain why you are writing a query in the first place. Your future self with be grateful when you revisit your code.
If you are familiar with other programming languages, then you are probably familiar with the idea of comments in your code. These are lines that are not executed and only exist for people reading the code.
In SQL, you can write comments in two ways. For a single line, you can use two hyphens (
--) to begin your comment. For example:
SELECT stat, suffrtno -- suffrtno stands for "sufficiency rating number" FROM bridge15_la;
Another more flexible way to write comments is using the
/* Comments here. */ syntax. These can be used for a single line or multiple lines. For example:
/* Everything inside here is a comment and won't be executed in the SQL query. This query accomplishes two things: - Filters the data to only include bridges labeled "structurally deficient" or "functionally obsolete." - Filters the data even further to only include the remaining bridges that also have sufficiency ratings below 50. */ SELECT feature, stat, suffrtno -- suffrtno is the sufficiency rating number FROM bridge15_la WHERE (stat = '1' OR stat = '2') AND suffrtno < 50; /* These are some bad bridges. */
Stay aware of the different types of data in your tables. Common types include integers (whole numbers), floats (numbers with decimals), booleans (True or False), text and dates.
This is very important when you have data containing a leading zero (e.g. zip code 07712). If you were to convert that to an integer (7,712), it would lose its meaning. Conversely, you should make sure numeric data is stored as numbers and not text so that you can make use of mathematical operators such as
Differences in SQL syntaxes
The various flavors of SQL (SQLite, MySQL, PostgreSQL, SQL Server, etc.) all have slightly different syntaxes, but they are mostly the same when it comes to basic usage. This can be annoying when switching between the SQL languages, but the good news is that they all have been around for decades. That means most syntax fixes are well-documented and only a quick Google search away.
One confusing point with SQL and programming languages in general is the idea of
NULL. In databases, you can declare whether or not a column allows
NULL entries, meaning whether or not they can lack any values. This is a subtle but significant difference between an empty value. An empty value means the emptiness is reported, whereas a
NULL value means nothing is reported at all. It is the lack of anything.
You can filter based on whether a cell is
NULL or not using
IS NOT as the comparison operators, instead of
!=. Again, this is because
NULL is not really equal to anything; it's the absence of any value.
SELECT * FROM bridge15_la WHERE stat IS NULL;
Exporting your data
If you are using SQLite Manager, you can export your query results to a CSV file. Unfortunately, it doesn't seem to be possible to carry the column names with the table, which is extremely inconvenient makes the entire action not as useful. However, you could always export the data, open it in Excel and then reapply the column headers. This might be worthwhile when using SQLite Manager to pare down the amount of data you want to then use in Excel. Other SQL management software is much better about this.
Once you are comfortable with these commands, try using the
GROUP BY and
GROUP BY is a powerful command that lets you aggregate similar data and answer questions such as, "For each county, what is the worst bridge?" This is similar to Excel's Pivot Tables.
JOIN allows you to link different tables in your database, which is helpful when you are trying to link two separate data sets. For example, imagine you had one table
bridges that listed every bridge's inspection rating and you also have a table
school_bus_routes that has data on all school buses and the bridges on which they travel. You could perform all of the queries listed above on the
bridges table and then join those results on the
school_bus_routes table to see whether school buses drive over the worst-rated bridges. (They do.)