Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save beckyconning/9a9bd9cb09c6cc6d193ae778e8122fb1 to your computer and use it in GitHub Desktop.
Save beckyconning/9a9bd9cb09c6cc6d193ae778e8122fb1 to your computer and use it in GitHub Desktop.

An intuitive, non-lossy and generalised way of combining information from disparate rows even on highly heterogenous data?

tl;dr:

what if when you had one or more tables a computer was like "hey some of these columns contain values that match up. want me to show you what its like when i put the rows that have one or more of these matching values together?"

an introductory example

the table below is made up of rows containing only sales data, rows containing only customer data and rows containing both.

 user_id | price  | id |          email
---------+--------+----+--------------------------
       1 | $30.00 |    |
       5 | $50.00 |    |
       7 | $20.00 |    |
         |        |  1 | hadil@example.com
         |        |  5 | saiid@example.com
         |        |  2 | fahir@example.com
       6 | $60.00 |  6 | oma@example.com
       8 | $40.00 |  8 | nasim@example.com
         |        |  8 | nasim.hassan@example.com
       9 | $40.00 |  9 | farah@example.com
       9 | $70.00 |    |
      10 | $80.00 |    | majid@example.com
         |        | 10 | majid.seif@example.com

notice how some of the values of the user id and id columns are equal despite being in different rows. what if we could bring the information where this is the case together simply by selecting these columns?

users could pick these columns and be shown the combined rows as well as the rows for which there are no other rows with matching values in any of the selected columns. they could then explicitly filter out any rows or columns they don't want.

 user_id | price  |       email       | price  | id |          email           
---------+--------+-------------------+--------+----+--------------------------
       1 | $30.00 |                   |        |  1 | hadil@example.com
       5 | $50.00 |                   |        |  5 | saiid@example.com
       6 | $60.00 | oma@example.com   | $60.00 |  6 | oma@example.com
       7 | $20.00 |                   |        |    | 
       8 | $40.00 | nasim@example.com | $40.00 |  8 | nasim@example.com
       8 | $40.00 | nasim@example.com |        |  8 | nasim.hassan@example.com
       9 | $40.00 | farah@example.com | $40.00 |  9 | farah@example.com
       9 | $70.00 |                   | $40.00 |  9 | farah@example.com
      10 | $80.00 | majid@example.com |        | 10 | majid.seif@example.com
         |        |                   |        |  2 | fahir@example.com 

they could do this without having to think about different types of joins. do so without having to deal with duplicated information from rows for which there are no other rows with matching values in any of the selected columns. do so without accidentally loosing information. they could do all of this simply by selecting columns.

users wouldn't need to guess which columns to select as in most cases the sets of columns for which there is at least one such combination possible could be shown faster than the complete results.

consider the tables above. do users want to loose the email they had for a customer at the time of sale just because the user changed their email? even if they do, do they always want to loose the information that they had that old email? it might be relevant for compliance. they might be required to remove it from their records.

why not inner joins

one solution we might consider is inner joins. however as an experience inner joins implicitly loose the information from the rows for which there are no other rows with matching values in the chosen columns.

why not outer joins

another solution we might consider is full outer joins. full outer joins contain the inner join as well as rows for the cross of both where each side is null.

this is ok for joins which reference different tables but for self joins they create a lot of rows that contain the same information as the rows that matched the search conditions. the experience proposed here doesn't require users to worry about whether its a self join or not. it also doesn't require users to be concerned with the tables that the columns they select are from.

specification

an engineering specification for this feature might be:

  1. the rows in the cartesian product of the specified tables for which the specified columns are equal
  2. the rows in the specified tables where the specified columns contain values not present in these columns in 1

prototype

here is a rough prototype of this in sql. i wonder what performance and scalability characteristics this same experience could take on if it were implemented differently.

postgres=# SELECT * FROM joins_example;
 user_id | price  | id |          email           
---------+--------+----+--------------------------
       1 | $30.00 |    | 
       5 | $50.00 |    | 
       7 | $20.00 |    | 
         |        |  1 | hadil@example.com
         |        |  5 | saiid@example.com
         |        |  2 | fahir@example.com
       6 | $60.00 |  6 | oma@example.com
       8 | $40.00 |  8 | nasim@example.com
         |        |  8 | nasim.hassan@example.com
       9 | $40.00 |  9 | farah@example.com
       9 | $70.00 |    | 
      10 | $80.00 |    | majid@example.com
         |        | 10 | majid.seif@example.com
(13 rows)

postgres=# SELECT * FROM joins_example AS a INNER JOIN joins_example AS b ON a.user_id = b.id;
 user_id | price  | id |       email       | user_id | price  | id |          email           
---------+--------+----+-------------------+---------+--------+----+--------------------------
       1 | $30.00 |    |                   |         |        |  1 | hadil@example.com
       5 | $50.00 |    |                   |         |        |  5 | saiid@example.com
       6 | $60.00 |  6 | oma@example.com   |       6 | $60.00 |  6 | oma@example.com
       8 | $40.00 |  8 | nasim@example.com |         |        |  8 | nasim.hassan@example.com
       8 | $40.00 |  8 | nasim@example.com |       8 | $40.00 |  8 | nasim@example.com
       9 | $40.00 |  9 | farah@example.com |       9 | $40.00 |  9 | farah@example.com
       9 | $70.00 |    |                   |       9 | $40.00 |  9 | farah@example.com
      10 | $80.00 |    | majid@example.com |         |        | 10 | majid.seif@example.com
(8 rows)

postgres=# SELECT                                                                                              
postgres-#   w.user_id,
postgres-#   w.price,
postgres-#   w.id,              
postgres-#   w.email,                  
postgres-#   NULL AS user_id,                                   
postgres-#   NULL AS price,                                    
postgres-#   NULL AS id,                                                                                              
postgres-#   NULL AS email
postgres-# FROM joins_example AS w
postgres-# LEFT JOIN (SELECT
postgres(#   x.user_id
postgres(# FROM joins_example AS x
postgres(# INNER JOIN joins_example AS y
postgres(#   ON x.user_id = y.id) AS z
postgres-#   ON z.user_id = w.user_id
postgres-#   OR z.user_id = w.id
postgres-# WHERE z.user_id IS NULL;
 user_id | price  | id |       email       | user_id | price | id | email 
---------+--------+----+-------------------+---------+-------+----+-------
       7 | $20.00 |    |                   |         |       |    | 
         |        |  2 | fahir@example.com |         |       |    | 
(2 rows)

postgres=# SELECT                      
  w.user_id,
  w.price,
  w.id,
  w.email,
  NULL AS user_id,
  NULL AS price,
  NULL AS id,
  NULL AS email
FROM joins_example AS w
LEFT JOIN (SELECT
  x.user_id
FROM joins_example AS x
INNER JOIN joins_example AS y
  ON x.user_id = y.id) AS z
  ON z.user_id = w.user_id
  OR z.user_id = w.id
WHERE z.user_id IS NULL
UNION
SELECT
  *
FROM joins_example a
INNER JOIN joins_example b
  ON a.user_id = b.id;
 user_id | price  | id |       email       | user_id | price  | id |          email           
---------+--------+----+-------------------+---------+--------+----+--------------------------
       1 | $30.00 |    |                   |         |        |  1 | hadil@example.com
       5 | $50.00 |    |                   |         |        |  5 | saiid@example.com
       6 | $60.00 |  6 | oma@example.com   |       6 | $60.00 |  6 | oma@example.com
       7 | $20.00 |    |                   |         |        |    | 
       8 | $40.00 |  8 | nasim@example.com |       8 | $40.00 |  8 | nasim@example.com
       8 | $40.00 |  8 | nasim@example.com |         |        |  8 | nasim.hassan@example.com
       9 | $40.00 |  9 | farah@example.com |       9 | $40.00 |  9 | farah@example.com
       9 | $70.00 |    |                   |       9 | $40.00 |  9 | farah@example.com
      10 | $80.00 |    | majid@example.com |         |        | 10 | majid.seif@example.com
         |        |  2 | fahir@example.com |         |        |    | 
(10 rows)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment