lastobelus/mj_sql_notes.md

## mj_sql_notes.md

      
    Raw
  

              mj_sql_notes.md
            
          
    Tools

https://github.com/kripken/sql.js
can load from sqllite dbs

data is everywhere -- how is is it stored?
most popular: relational database

Relational Database


stores each kind of data in a table


kind of like storing data in a spreadsheet, where a row represents an item of data and columns store properties of that item.


data items in a particular table all have the same set of properties


relational databases make it easy to form relationships between tables


they do this by identifying which set of properties can uniquely identify each row, and using this as a key for the item, like the keys in a hash table.


in web application frameworks, the key is usually abstracted to an integer id field, and relational databases have tools to create those for us and ensure that it is always unique even under heavy concurrent use


using an integer for the primary key solves a lot of issues where you think you have isolated fields that uniquely identify rows or data items, but then end up having collisions, for example using first_name+last_name in a people table would eventually run into the problem of two people with the same name.


for example, if we had a students table and a courses table, we could form a relationship between them with a   student_courses table that has a row for each course each student is taking, with two properties, the user primary key (user_id)  and the student primary key (student_id)


what would happen if we just put all a students courses in their row in the users table? what properties would we use? how many properties would we need? What would happen if a course name changed?


duplication in property names (course_1_name, course_2_name) is a sign of what is called denormalized data. Another sign is property names that contain the name of another table, or just properties that contain data that can be found in another table.


sometimes later in the lifecycle of an application we end up needing to deliberately denormalize some data for performance, but it is generally always best to start off with a normalized data structure. Abstract serialized ids are actually an example of denormalization that improves both performance and makes development easier, so denormalization isn't necessarily always evil


SQL


structured query language
Basics:
CREATE TABLE groceries (id INTEGER, name TEXT, quantity INTEGER, aisle INTEGER);
INSERT INTO groceries VALUES (1, "Bananas", 4);
INSERT INTO groceries VALUES (1, "Peanut Butter", 1);
INSERT INTO groceries VALUES (1, "Chocolate Bars", 2);

Select

SELECT [NAME] FROM groceries;
SELECT * FROM groceries;
SELECT * FROM groceries ORDER by aisle;
SELECT * FROM groceries where aisle > 5 ORDER by aisle;
Aggregate Functions

SELECT SUM(quanity) FROM groceries;
SELECT SUM(quanity) FROM groceries GROUP BY aisle;
SELECT aisle, SUM(quanity) FROM groceries GROUP BY aisle;
behind the scenes: sql engine first does group by aisle, then summed quantity in each group, and made a row for each group with its SUM(quantity)
AND/OR operator

CREATE TABLE exercise_logs
(id INTEGER PRIMARY KEY AUTOINCREMENT,
type TEXT,
minutes INTEGER,
calories INTEGER,
heart_rate INTEGER);
INSERT INTO exercise_logs(type, minutes, calories, heart_rate) VALUES ("biking", 30, 100, 110)
SELECT * FROM exercise_logs WHERE calories > 50 AND minutes < 30;
SELECT * FROM exercise_logs WHERE calories > 50 OR heart_rate > 100;
IN

SELECT * FROM exercise_logs WHERE type IN ("biking", "hiking", "tree climbing", "rowing");
SELECT * FROM exercise_logs WHERE type NOT IN ("biking", "hiking", "tree climbing", "rowing");

so far, like chaining OR statements

CREATE TABLE drs_favourites
(id INTEGER PRIMARY KEY AUTOINCREMENT,
type TEXT,
reason TEXT);
SELECT type FROM drs_favourites;
SELECT * FROM exercise_logs WHERE type IN (...);

doesn't stay up to date if drs_favourites changes
use the IN operator directly with the results of a SELECT query:
SELECT * FROM exercise_logs WHERE type IN (SELECT type FROM drs_favourites);
called subqueries
can be more complex
SELECT * FROM exercise_logs WHERE type IN (
SELECT type FROM drs_favourites WHERE reason = "Increases caridiovascular health.");
what if we forgot period?
SQL gives us a way to do inexact matches with LIKE
SELECT * FROM exercise_logs WHERE type IN (
SELECT type FROM drs_favourites WHERE reason LIKE "%cardiovascular%");

SELECT type, SUM(calories) FROM exercise_logs GROUP BY type;


we can give the column a name
SELECT type, SUM(calories) AS total_calories FROM exercise_logs GROUP BY type;


makes queries easier to read, but also has another use that we will use later


how would we filter the result to only show activities where I've burned more than 150 calories total across all times doing that exercise?


First Cut
SELECT type, SUM(calories) AS total_calories FROM exercise_logs
WHERE calories > 150
GROUP BY type;
-- doesn't work, because only finds single rows where calories > 150, not the sum
HAVING

SELECT type, SUM(calories) AS total_calories FROM exercise_logs
GROUP BY type
HAVING total_calories > 150;


HAVING applies condition to the aggregate table
SELECT type, AVG(calories) AS avg_calories FROM exercise_logs
GROUP BY type
HAVING avg_calories > 100


how would we see all the exercises we did at least twice?


we can also use aggregate functions in having
SELECT type FROM exercise_logs
GROUP BY type
HAVING COUNT(*) >= 2;


more queries


can use math in conditions
SELECT COUNT(*) FROM exercise_logs WHERE heart_rate > 220 - 30;

SELECT COUNT(*) FROM exercise_logs WHERE
heart_rate > ROUND(0.5 * (220 - 30))
AND heart_rate < ROUND(0.9 * (220 - 30));
CASE statement -- create virtual columns

start with
SELECT type, heart_rate FROM exercise_logs;
add case
SELECT type, heart_rate,
CASE
WHEN heart_rate > 220-30 THEN "above max"
WHEN heart_rate > ROUND(0.9 * (220 - 30)) THEN "above target"
WHEN heart_rate > ROUND(0.5 * (220 - 30)) THEN "within target"
ELSE "below target"
END as "hr_zone"
FROM exercise_logs;

use virtual columns in GROUP BY
SELECT COUNT(*),
CASE
WHEN heart_rate > 220-30 THEN "above max"
WHEN heart_rate > ROUND(0.9 * (220 - 30)) THEN "above target"
WHEN heart_rate > ROUND(0.5 * (220 - 30)) THEN "within target"
ELSE "below target"
END as "hr_zone"
FROM exercise_logs
GROUP BY hr_zone;

JOINS

CREATE TABLE students
(id INTEGER PRIMARY KEY AUTOINCREMENT,
first_name TEXT,
last_name TEXT,
email TEXT,
phone TEXT,
birthdate TEXT);
CREATE TABLE student_grades
(id INTEGER PRIMARY KEY AUTOINCREMENT,
student_id INTEGER,
test TEXT,
grade INTEGER);
SELECT * from student_grades;

would like to see the name etc. in there
student_id relates student_grades to students

many joins

Cross Join


if you have a math background, cross join is the cartesian product. If you don't have a math background, forget I said that
like multiplying the tables, so there will be a row for every combination of rows (all possible row tuples)
usually not very useful, but forms the foundation for understanding the others


SELECT * from student_grades, students;
(there is an explicit syntax too:
SELECT * from student_grades CROSS JOIN students)
Inner Join


join tables based on a predicate (condition)
it's called a predicate because of SQL's mathematics heritage: Set Theory & Predicate Logic. You don't have to have a math background to get good at SQL, but it definitely helps if you enjoy solving logic puzzles.
each row tuple  must satisfy the condition
take the cross join, and eliminate all the rows that don't satisfy the predicate

SELECT * FROM students
INNER JOIN student_grades
ON students.id = student_grades.student_id;

ON is the predicate, looks the same as a WHERE condition
the most common type of predicate is using = to relate two id columns, but you can use any conditional logic, and do some pretty magical stuff.
inner joins are how the most common relationships are implemented: belongs_to, has_one, has_many, and (using a join table) has_and_belongs_to_many, and when we work with these relationships in Rails we won't usually have to think about or use the SQL joins at all
INNER keyword is optional

PLEASE DONT DO THIS:
there is an implicit notation, deprecated a LONG time ago. The only reason we'll show it is that people still do it because the DBs have never stopped supporting it.
SELECT *
FROM students, student_grades
WHERE student.id = student_grades.student_id;
this actually confused me a lot back in the day
We can use our other tools in conjunction with joins


example: find all the student grades > 90
SELECT first_name, last_name, email, test, grade
FROM students
INNER JOIN student_grades
ON students.id = student_grades.student_id
WHERE grade > 90


what to do with duplicate column names? for example if student had a grade column for overall grade.


prefix columns with table name:
SELECT students.first_name, students.last_name, students.email, students.test, student_grades.grade
FROM students
INNER JOIN student_grades
ON students.id = student_grades.student_id
WHERE grade > 90


Outer Joins

CREATE TABLE student_projects
(id INTEGER PRIMARY KEY AUTOINCREMENT,
student_id INTEGER,
title TEXT);
SELECT students.first_name, students.last_name, student_projects.title
FROM students
INNER JOIN student_projects
ON students.id = student_projects.student_id


only returns rows for students with projects


what if we want a list for all students, even if they don't have a project yet?


Outer Join


SELECT students.first_name, students.last_name, student_projects.title
FROM students
LEFT JOIN student_projects
ON students.id = student_projects.student_id

returns a row for every student (the left table) with null in the columns frrom student_projects if there was not a student_projects row matching the predicate
RIGHT OUTER JOIN is the same but returns a row for every row in the right table. Can use LEFT OUTER JOIN and switch the table order
FULL OUTER JOIN returns a row for every row in both tables. Not available in all systems, but systems that don't have it have the UNION operator that you can use to combine a left & right outer join.

Self Join

ALTER TABLE students ADD buddy_id INTEGER;
add some buddy_ids, pointing at other students
SELECT id, first_name, last_name, buddies.email as buddy_email
FROM student s
JOIN students buddies
ON students.buddy_id = buddies.id
Using joins together

CREATE TABLE projec_pairs
(id INTEGER PRIMARY KEY AUTOINCREMENT,
project1_id INTEGER,
project2_id INTEGER)
people who did project 1 should review the projects of people who did project 2
SELECT * from project_pairs;

not readable, just a list of numbers

SELECT * FROM project_pairs
JOIN student_projects;

too many rows, only see name once

SELECT * FROM project_pairs
JOIN student_projects a
JOIN student_projects b;
too many rows, but we see both names of the pairs. We need ON clauses
SELECT a.title AS project1, b.title AS project2 FROM project_pairs
JOIN student_projects a
ON project_pairs.project1_id = a.id  // join project_pairs with student_projects on the project1_id, so we get only one row per project1
JOIN student_projects b
ON project_pairs.project_2_id = b.id; // join project_pairs with student_projects again on the project2_id, so we get only one row per [project1_id, project2_id] tuple
UPDATE

UPDATE student_grades SET grade = 95
WHERE student_id = 3
AND test = "Quiz 1"

what happens if we forget the AND?
the great pitfall of live SQL. There's no way back (unless you wrapped in a transaction and can rollback)
so do a count first: SELECT COUNT(*) FROM student_grades WHERE...

DELETE

DROP TABLE