Skip to content

Instantly share code, notes, and snippets.

@lastobelus
Created January 26, 2016 03:28
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lastobelus/6f8e490e6c9a07839e71 to your computer and use it in GitHub Desktop.
Save lastobelus/6f8e490e6c9a07839e71 to your computer and use it in GitHub Desktop.

Tools

https://github.com/kripken/sql.js can load from sqllite dbs

  • data is everywhere -- how is is it stored?
  • most popular: relational database

Relational Database

  • stores each kind of data in a table

  • kind of like storing data in a spreadsheet, where a row represents an item of data and columns store properties of that item.

  • data items in a particular table all have the same set of properties

  • relational databases make it easy to form relationships between tables

  • they do this by identifying which set of properties can uniquely identify each row, and using this as a key for the item, like the keys in a hash table.

  • in web application frameworks, the key is usually abstracted to an integer id field, and relational databases have tools to create those for us and ensure that it is always unique even under heavy concurrent use

  • using an integer for the primary key solves a lot of issues where you think you have isolated fields that uniquely identify rows or data items, but then end up having collisions, for example using first_name+last_name in a people table would eventually run into the problem of two people with the same name.

  • for example, if we had a students table and a courses table, we could form a relationship between them with a student_courses table that has a row for each course each student is taking, with two properties, the user primary key (user_id) and the student primary key (student_id)

  • what would happen if we just put all a students courses in their row in the users table? what properties would we use? how many properties would we need? What would happen if a course name changed?

  • duplication in property names (course_1_name, course_2_name) is a sign of what is called denormalized data. Another sign is property names that contain the name of another table, or just properties that contain data that can be found in another table.

  • sometimes later in the lifecycle of an application we end up needing to deliberately denormalize some data for performance, but it is generally always best to start off with a normalized data structure. Abstract serialized ids are actually an example of denormalization that improves both performance and makes development easier, so denormalization isn't necessarily always evil

SQL

  • structured query language Basics: CREATE TABLE groceries (id INTEGER, name TEXT, quantity INTEGER, aisle INTEGER); INSERT INTO groceries VALUES (1, "Bananas", 4); INSERT INTO groceries VALUES (1, "Peanut Butter", 1); INSERT INTO groceries VALUES (1, "Chocolate Bars", 2);

Select

SELECT [NAME] FROM groceries; SELECT * FROM groceries; SELECT * FROM groceries ORDER by aisle; SELECT * FROM groceries where aisle > 5 ORDER by aisle;

Aggregate Functions

SELECT SUM(quanity) FROM groceries; SELECT SUM(quanity) FROM groceries GROUP BY aisle; SELECT aisle, SUM(quanity) FROM groceries GROUP BY aisle; behind the scenes: sql engine first does group by aisle, then summed quantity in each group, and made a row for each group with its SUM(quantity)

AND/OR operator

CREATE TABLE exercise_logs (id INTEGER PRIMARY KEY AUTOINCREMENT, type TEXT, minutes INTEGER, calories INTEGER, heart_rate INTEGER);

INSERT INTO exercise_logs(type, minutes, calories, heart_rate) VALUES ("biking", 30, 100, 110) SELECT * FROM exercise_logs WHERE calories > 50 AND minutes < 30; SELECT * FROM exercise_logs WHERE calories > 50 OR heart_rate > 100;

IN

SELECT * FROM exercise_logs WHERE type IN ("biking", "hiking", "tree climbing", "rowing"); SELECT * FROM exercise_logs WHERE type NOT IN ("biking", "hiking", "tree climbing", "rowing");

  • so far, like chaining OR statements

CREATE TABLE drs_favourites (id INTEGER PRIMARY KEY AUTOINCREMENT, type TEXT, reason TEXT);

SELECT type FROM drs_favourites; SELECT * FROM exercise_logs WHERE type IN (...);

  • doesn't stay up to date if drs_favourites changes
  • use the IN operator directly with the results of a SELECT query: SELECT * FROM exercise_logs WHERE type IN (SELECT type FROM drs_favourites);
  • called subqueries
  • can be more complex SELECT * FROM exercise_logs WHERE type IN ( SELECT type FROM drs_favourites WHERE reason = "Increases caridiovascular health.");
  • what if we forgot period?
  • SQL gives us a way to do inexact matches with LIKE SELECT * FROM exercise_logs WHERE type IN ( SELECT type FROM drs_favourites WHERE reason LIKE "%cardiovascular%");

SELECT type, SUM(calories) FROM exercise_logs GROUP BY type;

  • we can give the column a name SELECT type, SUM(calories) AS total_calories FROM exercise_logs GROUP BY type;

  • makes queries easier to read, but also has another use that we will use later

  • how would we filter the result to only show activities where I've burned more than 150 calories total across all times doing that exercise?

First Cut SELECT type, SUM(calories) AS total_calories FROM exercise_logs WHERE calories > 150 GROUP BY type; -- doesn't work, because only finds single rows where calories > 150, not the sum

HAVING

SELECT type, SUM(calories) AS total_calories FROM exercise_logs GROUP BY type HAVING total_calories > 150;

  • HAVING applies condition to the aggregate table SELECT type, AVG(calories) AS avg_calories FROM exercise_logs GROUP BY type HAVING avg_calories > 100

  • how would we see all the exercises we did at least twice?

  • we can also use aggregate functions in having SELECT type FROM exercise_logs GROUP BY type HAVING COUNT(*) >= 2;

more queries

  • can use math in conditions SELECT COUNT(*) FROM exercise_logs WHERE heart_rate > 220 - 30;

SELECT COUNT(*) FROM exercise_logs WHERE heart_rate > ROUND(0.5 * (220 - 30)) AND heart_rate < ROUND(0.9 * (220 - 30));

CASE statement -- create virtual columns

start with SELECT type, heart_rate FROM exercise_logs;

add case SELECT type, heart_rate, CASE WHEN heart_rate > 220-30 THEN "above max" WHEN heart_rate > ROUND(0.9 * (220 - 30)) THEN "above target" WHEN heart_rate > ROUND(0.5 * (220 - 30)) THEN "within target" ELSE "below target" END as "hr_zone" FROM exercise_logs;

  • use virtual columns in GROUP BY SELECT COUNT(*), CASE WHEN heart_rate > 220-30 THEN "above max" WHEN heart_rate > ROUND(0.9 * (220 - 30)) THEN "above target" WHEN heart_rate > ROUND(0.5 * (220 - 30)) THEN "within target" ELSE "below target" END as "hr_zone" FROM exercise_logs GROUP BY hr_zone;

JOINS

CREATE TABLE students (id INTEGER PRIMARY KEY AUTOINCREMENT, first_name TEXT, last_name TEXT, email TEXT, phone TEXT, birthdate TEXT);

CREATE TABLE student_grades (id INTEGER PRIMARY KEY AUTOINCREMENT, student_id INTEGER, test TEXT, grade INTEGER);

SELECT * from student_grades;

  • would like to see the name etc. in there
  • student_id relates student_grades to students

many joins

Cross Join

  • if you have a math background, cross join is the cartesian product. If you don't have a math background, forget I said that
  • like multiplying the tables, so there will be a row for every combination of rows (all possible row tuples)
  • usually not very useful, but forms the foundation for understanding the others

SELECT * from student_grades, students; (there is an explicit syntax too: SELECT * from student_grades CROSS JOIN students)

Inner Join

  • join tables based on a predicate (condition)
  • it's called a predicate because of SQL's mathematics heritage: Set Theory & Predicate Logic. You don't have to have a math background to get good at SQL, but it definitely helps if you enjoy solving logic puzzles.
  • each row tuple must satisfy the condition
  • take the cross join, and eliminate all the rows that don't satisfy the predicate

SELECT * FROM students INNER JOIN student_grades ON students.id = student_grades.student_id;

  • ON is the predicate, looks the same as a WHERE condition
  • the most common type of predicate is using = to relate two id columns, but you can use any conditional logic, and do some pretty magical stuff.
  • inner joins are how the most common relationships are implemented: belongs_to, has_one, has_many, and (using a join table) has_and_belongs_to_many, and when we work with these relationships in Rails we won't usually have to think about or use the SQL joins at all
  • INNER keyword is optional

PLEASE DONT DO THIS: there is an implicit notation, deprecated a LONG time ago. The only reason we'll show it is that people still do it because the DBs have never stopped supporting it.

SELECT * FROM students, student_grades WHERE student.id = student_grades.student_id;

this actually confused me a lot back in the day

We can use our other tools in conjunction with joins

  • example: find all the student grades > 90 SELECT first_name, last_name, email, test, grade FROM students INNER JOIN student_grades ON students.id = student_grades.student_id WHERE grade > 90

  • what to do with duplicate column names? for example if student had a grade column for overall grade.

  • prefix columns with table name: SELECT students.first_name, students.last_name, students.email, students.test, student_grades.grade FROM students INNER JOIN student_grades ON students.id = student_grades.student_id WHERE grade > 90

Outer Joins

CREATE TABLE student_projects (id INTEGER PRIMARY KEY AUTOINCREMENT, student_id INTEGER, title TEXT);

SELECT students.first_name, students.last_name, student_projects.title FROM students INNER JOIN student_projects ON students.id = student_projects.student_id

  • only returns rows for students with projects

  • what if we want a list for all students, even if they don't have a project yet?

  • Outer Join

SELECT students.first_name, students.last_name, student_projects.title FROM students LEFT JOIN student_projects ON students.id = student_projects.student_id

  • returns a row for every student (the left table) with null in the columns frrom student_projects if there was not a student_projects row matching the predicate
  • RIGHT OUTER JOIN is the same but returns a row for every row in the right table. Can use LEFT OUTER JOIN and switch the table order
  • FULL OUTER JOIN returns a row for every row in both tables. Not available in all systems, but systems that don't have it have the UNION operator that you can use to combine a left & right outer join.

Self Join

ALTER TABLE students ADD buddy_id INTEGER; add some buddy_ids, pointing at other students

SELECT id, first_name, last_name, buddies.email as buddy_email FROM student s JOIN students buddies ON students.buddy_id = buddies.id

Using joins together

CREATE TABLE projec_pairs (id INTEGER PRIMARY KEY AUTOINCREMENT, project1_id INTEGER, project2_id INTEGER)

people who did project 1 should review the projects of people who did project 2

SELECT * from project_pairs;

  • not readable, just a list of numbers

SELECT * FROM project_pairs JOIN student_projects;

  • too many rows, only see name once

SELECT * FROM project_pairs JOIN student_projects a JOIN student_projects b;

too many rows, but we see both names of the pairs. We need ON clauses

SELECT a.title AS project1, b.title AS project2 FROM project_pairs JOIN student_projects a ON project_pairs.project1_id = a.id // join project_pairs with student_projects on the project1_id, so we get only one row per project1 JOIN student_projects b ON project_pairs.project_2_id = b.id; // join project_pairs with student_projects again on the project2_id, so we get only one row per [project1_id, project2_id] tuple

UPDATE

UPDATE student_grades SET grade = 95 WHERE student_id = 3 AND test = "Quiz 1"

  • what happens if we forget the AND?
  • the great pitfall of live SQL. There's no way back (unless you wrapped in a transaction and can rollback)
  • so do a count first: SELECT COUNT(*) FROM student_grades WHERE...

DELETE

DROP TABLE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment