Skip to content

Instantly share code, notes, and snippets.

@aepyornis
Last active August 29, 2015 14:17
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save aepyornis/83b62585e1a884ac8ad8 to your computer and use it in GitHub Desktop.
Save aepyornis/83b62585e1a884ac8ad8 to your computer and use it in GitHub Desktop.

A day in SQL HELL: or how I learned to like (or at least appreciate) SQL by ziggy

In short: the secret to who owns nyc will probably be found in a really complicated sql statement

Here's how this all started. I was working on my project to analyze heat/hot water complaints. I downloaded all the complaints for this heat season (there's over 200,00), geocoded them to obtain the bbls, and then imported them into postgres. I wanted to match complaints with their owners in order to find out whose responsible for not providing tenants' with essential services.

So I fired up PSQL and started typing out some queries.

SELECT bbl, 
       address, 
       Count(bbl) AS total_complaints 
    FROM complaints 
    GROUP BY bbl, 
          address 
    ORDER  BY total_complaints DESC; 

This query returns a sorted listed of buildings with most number of complaints:

bbl address total_complaints
1021740175 34 ARDEN STREET 1730
3017780038 554 DE KALB AVENUE 1049
3061207502 9511 SHORE ROAD 920
2032630007 3810 BAILEY AVENUE 748
2037140054 1025 BOYNTON AVENUE 538
2032920019 2968 PERRY AVENUE 400
2031960010 30 BUCHANAN PLACE 398
...

Although there are over 200,000 complaints this heat season, there are only 31,531 unique buildings.

Naturally, I wanted to know who were the people and companies behind these buildings. I have two ownership sources easily accessible: pluto and DOB permits. I brought those tables into my database. Now I have three tables:

  • complaints (~200,000)
  • pluto (~800,000)
  • dobjobs form years 2010-2015 (~400,000)

They all have one field in common - BBL - a unique identifier for NYC tax lots. I thought, "great! We just need to do a join. T his should be easy" ...but as it turns out....

    SELECT complaints.bbl AS bbl,
        complaints.address AS address,
        pluto.ownername AS OWNER,
        count(complaints.bbl) AS total_complaints
    FROM complaints,
        pluto
    WHERE complaints.bbl = pluto.bbl
    GROUP BY complaints.bbl,
        complaints.address,
        pluto.ownername
    ORDER BY total_complaints DESC;

This joins pluto to the complaints data so we can figure out the owner for each building. Note the GROUP BY part of the sql statement. If you want to sort by an Aggregate Expression (such as count) you must use GROUP BY and all the columns (except for count) must appear in the GROUP BY part. This results in:

bbl address owner total_complaints
1021740175 34 ARDEN STREET 24-32 ARDEN ASSOCIATE 1730
3017780038 554 DE KALB AVENUE ALFREDO MORALES 1049
3061207502 9511 SHORE ROAD THE 9511 SHORE RD CON 920
2032630007 3810 BAILEY AVENUE 3810 LLC 748
2037140054 1025 BOYNTON AVENUE 1025 BOYNTON AVENUERE 538
2032920019 2968 PERRY AVENUE 2968 PERRY LLC 400
2031960010 30 BUCHANAN PLACE UNIVERSITY AVENUE, L. 398
2024590034 888 GRAND CONCOURSE TINY FIESTA REALTY AS 394
...

Works perfectly! There are over 200 buildings with more than 50 complaints and 2 with over 1000! So let's try to join dobjobs now. Dobjobs contains both the owner's business and actual name, which is why I find it an appealing data source to use.

    SELECT complaints.bbl AS bbl,
       complaints.address AS address,
       dobjobs.ownername AS jobs_owner,
       dobjobs.ownerbusinessname AS jobs_business,
       count(complaints.bbl) AS total_complaints
    FROM complaints
    LEFT JOIN dobjobs ON complaints.bbl = dobjobs.bbl
    GROUP BY complaints.bbl,
         complaints.address,
         dobjobs.ownerbusinessname,
         dobjobs.ownername
    ORDER BY total_complaints DESC
bbl address jobs_owner jobs_business total_complaints
2039447501 2090 EAST TREMONT AVENUE CHARLES TUCCI PPC PROPERTY SERVICE LLC 27520
1009720001 18 STUYVESANT OVAL MARCO M MATTIA ROSE ASSOCIATES INC. MANAGER 24722
1009720001 400 EAST 20 STREET MARCO M MATTIA ROSE ASSOCIATES INC. MANAGER 8942
1009720001 272 1 AVENUE MARCO M MATTIA ROSE ASSOCIATES INC. MANAGER 8942
1014207501 1111 3 AVENUE BRIAN FALLON MH RESIDENTIAL 1 LLC 8575
1009720001 530 EAST 20 STREET MARCO M MATTIA ROSE ASSOCIATES INC. MANAGER 8416
1009720001 18 STUYVESANT OVAL PAUL CENZOPRANO TISHMAN/SPEYER PCVST MANAG LLC 6862
2039447501 2100 EAST TREMONT AVENUE CHARLES TUCCI PPC PROPERTY SERVICE LLC 6800
1009720001 281 AVENUE C MARCO M MATTIA ROSE ASSOCIATES INC. MANAGER 6312
1009720001 18 STUYVESANT OVAL PAUL CENZOPRANO TISHMAN SPEYER PCVST MANAG LLC 6204
1009720001 5 STUYVESANT OVAL MARCO M MATTIA ROSE ASSOCIATES INC. MANAGER 5786

An unexpected result occurred. Now the total_complaints is different from before. There are multiple BBLs for the same building in both the dobjobs and complaints tables a MANY to MANY join. I believe Postgres executes the SELECT count(bbl) after the join, resulting in a count of all the times the BBL appears in both tables. I need to ensure that the correct count appears. The solution lies in doing a subquery. In the FROM part of a SQL query you are allowed to SELECT from a SELECT statement. (and presumably you can SELECT from a SELECT from a SELECT too)

SELECT c.bbl AS bbl,
       c.address AS address,
       dobjobs.ownername AS jobs_owner,
       dobjobs.ownerbusinessname AS jobs_business,
       c.total_complaints AS total_complaints
 FROM
   (SELECT complaints.address,
          count(complaints.bbl) AS total_complaints,
          complaints.bbl
     FROM complaints
     GROUP BY complaints.address,
            complaints.bbl) AS c
LEFT JOIN dobjobs ON c.bbl = dobjobs.bbl
ORDER BY total_complaints DESC;
bbl address jobs_owner jobs_business total_complaints
1021740175 34 ARDEN STREET 1730
3017780038 554 DE KALB AVENUE RAMON MENDEZ MENDEZ DELI 1049
3061207502 9511 SHORE ROAD USAMA EDRIS N/A 920
3061207502 9511 SHORE ROAD NGOON L GOON N/A 920
3061207502 9511 SHORE ROAD GINA CIAIO G & M EQUIPMENT LEASING 920
3061207502 9511 SHORE ROAD PETER KRAUS C0-BUILDERS ASSOCIATES LLC 920
2032630007 3810 BAILEY AVENUE MICHAEL GOLDBERG 3810 LLC 748
2032630007 3810 BAILEY AVENUE MICHAEL GOLDBERG 3810 LLC 748
2037140054 1025 BOYNTON AVENUE ERAN GOHARI 1025 BOYNTON AVENUE LLC 538
2032920019 2968 PERRY AVENUE MEYER BRECHER 1750 GRAND LLC 400

Now the result lines up with our initial count. The first row is blank, which is okay. This is because not every BBL can be found in the dobjobs database. A LEFT JOIN keeps all rows from the table on the left (in our case complaints) and puts NULLs when a join cannot be made. HOWEVER, another problem as appeared. We now have duplicate BBLs. Time for another subquery. We will join with a subquery of dobjobs using DISTINCT to ensure we only have one BBL.

  SELECT c.bbl AS bbl,
      c.address AS address,
      jobs.ownername AS jobs_owner,
      jobs.ownerbusinessname AS jobs_business,
      c.total_complaints AS total_complaints
   FROM
    (SELECT complaints.address,
         count(complaints.bbl) AS total_complaints,
         complaints.bbl
       FROM complaints
       GROUP BY complaints.address,
           complaints.bbl) AS c
   LEFT JOIN
       (SELECT DISTINCT ON (bbl) bbl,
                     ownername,
                     ownerbusinessname
       FROM dobjobs
       ORDER BY bbl) AS jobs ON c.bbl = jobs.bbl
   ORDER BY total_complaints DESC

RESULT:

bbl address jobs_owner jobs_business total_complaints
1021740175 34 ARDEN STREET 1730
3017780038 554 DE KALB AVENUE RAMON MENDEZ MENDEZ DELI 1049
3061207502 9511 SHORE ROAD NGOON L GOON N/A 920
2032630007 3810 BAILEY AVENUE MICHAEL GOLDBERG 3810 LLC 748
2037140054 1025 BOYNTON AVENUE ERAN GOHARI 1025 BOYNTON AVENUE LLC 538
2032920019 2968 PERRY AVENUE MEYER BRECHER 1750 GRAND LLC 400
2031960010 30 BUCHANAN PLACE KAMALJIT SINGH B H & B CONSTRUCTION INC. 398
2024590034 888 GRAND CONCOURSE ADNAN ALSHABBI DELI-GROCERY STORE 394
4013120010 72-10 41 AVENUE BRADLEY LEWART MASSACHUSETTES LEASING CO 389
2033100081 2955 GRAND CONCOURSE JOSHUA SIEW BARONS GATE TOWERS LLC 382
2042920001 2074 WALLACE AVENUE VITO SACCHETTI BRADY LLC 356
2039447501 2090 EAST TREMONT AVENUE CHARLES TUCCI PPC PROPERTY SERVICE LLC 344
2026100012 530 EAST 169 STREET KARAM SINGH FORDHAM FULTON REALTY CORP 300

Wonderful. Complaints joined with dobjobs, with the correct count. Here I paused for a while. I thought I was done. But after staring at the results for long enough I realized a problem with this. The "DISTINCT ON (bbL)" subquery was picking distinct rows essentially randomly. Because I want to make sure the owner listed is the most up-to-date, I need to postgres to not pick a random BBL to join but instead choose the most current one. An easy fix? I thought I'd tack on a ORDER BY to my subquery and be done with it. So I tried to modify the subquery -- the part between the "LEFT JOIN" and "AS job ON" above to this:

   SELECT DISTINCT ON (bbl) bbl,
                  ownername,
                  ownerbusinessname,
                  latestactiondate
   FROM dobjobs
   ORDER BY latestactiondate DESC;

Yet, this produced an error:

ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Hmm. According to the docs "the DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group." Fine, but how to get the most recent distinct BBL? Thanks to this stackoverflow question, the solution lies in yet another subquery:

  SELECT *
   FROM
       (SELECT DISTINCT ON (bbl) bbl,
                     ownername,
                     ownerbusinessname,
                     latestactiondate
       FROM dobjobs
       ORDER BY bbl) AS dob
   ORDER BY dob.latestactiondate DESC;

Place that subquery back into our previous query and we get:

SELECT c.bbl AS bbl,
       c.address AS address,
       jobs.ownername AS jobs_owner,
       jobs.ownerbusinessname AS jobs_business,
       jobs.latestactiondate AS jobs_date,
       c.total_complaints AS total_complaints
    FROM
        (SELECT complaints.address,
          count(complaints.bbl) AS total_complaints,
          complaints.bbl
        FROM complaints
        GROUP BY complaints.address,
                    complaints.bbl) AS c
    LEFT JOIN
        (SELECT *
         FROM
            (SELECT DISTINCT ON (bbl) bbl,
                         ownername,
                         ownerbusinessname,
                         latestactiondate
            FROM dobjobs
            ORDER BY bbl) AS dob
        ORDER BY dob.latestactiondate DESC ) 
	        AS jobs ON c.bb = jobs.bbl
ORDER BY total_complaints DESC;

Which produces:

bbl address jobs_owner jobs_business jobs_date total_complaints
1021740175 34 ARDEN STREET 1730
3017780038 554 DE KALB AVENUE RAMON MENDEZ MENDEZ DELI 2012-01-24 1049
3061207502 9511 SHORE ROAD NGOON L GOON N/A 2010-12-15 920
2032630007 3810 BAILEY AVENUE MICHAEL GOLDBERG 3810 LLC 2013-11-25 748
2037140054 1025 BOYNTON AVENUE ERAN GOHARI 1025 BOYNTON AVENUE LLC 2014-07-09 538
2032920019 2968 PERRY AVENUE MEYER BRECHER 1750 GRAND LLC 2013-01-22 400
2031960010 30 BUCHANAN PLACE PETER MAGISTRO UNIVERSITY AVENUE LP 2013-10-23 398
2024590034 888 GRAND CONCOURSE ADNAN ALSHABBI DELI-GROCERY STORE 2010-08-20 394
4013120010 72-10 41 AVENUE BRADLEY LEWART MASSACHUSETTES LEASING CO 2012-03-15 389
2033100081 2955 GRAND CONCOURSE JOSHUA SIEW BARONS GATE TOWERS LLC 2013-05-30 382
2042920001 2074 WALLACE AVENUE NICHOLAS CHIMENTI BROTHER-SISTER LLC 2013-05-03 356
2039447501 2090 EAST TREMONT AVENUE LORRAINE HOSTY COLGATE ENTERPRISES CORP 2012-09-26 344
2026100012 530 EAST 169 STREET RAJ SINGH C/O - ERAN GOHARI 2014-07-07 300

So finally we can also join the pluto data to get the table I was after. Complaints with ownership information from two sources, correctly sorted by complaints and ensuring we get the most up-to-date owner from the dob table:

SELECT c.bbl AS bbl,
       c.address AS address,
       jobs.ownername AS jobs_owner,
       jobs.ownerbusinessname AS jobs_business,
       jobs.latestactiondate AS jobs_date,
       pluto.ownername AS pluto_owner,
       pluto.unitsres AS units_res,
       c.total_complaints AS total_complaints
FROM
   (SELECT complaints.address,
          count(complaints.bbl) AS total_complaints,
          complaints.bbl
    FROM complaints
    GROUP BY complaints.address,
             complaints.bbl) AS c
    LEFT JOIN
        (SELECT *
         FROM
            (SELECT DISTINCT ON (bbl) bbl,
                         ownername,
                         ownerbusinessname,
                         latestactiondate
            FROM dobjobs
            ORDER BY bbl) AS dob
        ORDER BY dob.latestactiondate DESC ) AS jobs ON c.bbl = jobs.bbl
LEFT JOIN pluto ON c.bbl = pluto.bbl
ORDER BY total_complaints DESC;

Result:

bbl address jobs_owner jobs_business jobs_date pluto_owner units_res total_complaints
1021740175 34 ARDEN STREET 24-32 ARDEN ASSOCIATE 67 1730
3017780038 554 DE KALB AVENUE RAMON MENDEZ MENDEZ DELI 2012-01-24 ALFREDO MORALES 6 1049
3061207502 9511 SHORE ROAD NGOON L GOON N/A 2010-12-15 THE 9511 SHORE RD CON 104 920
2032630007 3810 BAILEY AVENUE MICHAEL GOLDBERG 3810 LLC 2013-11-25 3810 LLC 57 748
2037140054 1025 BOYNTON AVENUE ERAN GOHARI 1025 BOYNTON AVENUE LLC 2014-07-09 1025 BOYNTON AVENUERE 61 538
2032920019 2968 PERRY AVENUE MEYER BRECHER 1750 GRAND LLC 2013-01-22 2968 PERRY LLC 31 400
2031960010 30 BUCHANAN PLACE PETER MAGISTRO UNIVERSITY AVENUE LP 2013-10-23 UNIVERSITY AVENUE, L. 47 398
2024590034 888 GRAND CONCOURSE ADNAN ALSHABBI DELI-GROCERY STORE 2010-08-20 TINY FIESTA REALTY AS 81 394
4013120010 72-10 41 AVENUE BRADLEY LEWART MASSACHUSETTES LEASING CO 2012-03-15 MASSACHUSETTS LEASING 143 389
2033100081 2955 GRAND CONCOURSE JOSHUA SIEW BARONS GATE TOWERS LLC 2013-05-30 BARONSGATE TOWERS LLC 76 382
2042920001 2074 WALLACE AVENUE NICHOLAS CHIMENTI BROTHER-SISTER LLC 2013-05-03 BRADY LLC 191 356
2039447501 2090 EAST TREMONT AVENUE LORRAINE HOSTY COLGATE ENTERPRISES CORP 2012-09-26 PARKCHESTER NORTH CON 3857 344
2026100012 530 EAST 169 STREET RAJ SINGH C/O - ERAN GOHARI 2014-07-07 FORDHAM FULTON REALTY 321 300
1017740051 142 EAST 126 STREET DIMITRIOUS TSOKONAS TSOKONAS 2010-05-18 2085 LEX OPERATING CO 30 275
2031910045 2645 MORRIS AVENUE 2645 REALTY ASSOCIATE 32 275
1022190001 101 POST AVENUE MORRIS PILLAR N/A 2011-03-17 POST AVENUE ASSOCIATE 121 261
2032500071 2856 WEBB AVENUE NISAR AHMED N/A 2013-10-10 DRAN VATAJ 25 256
2032200001 2500 WEBB AVENUE 2500 WEBB LLC 57 255
..
The final queries, visually, looks like this:

image of query

@aguestuser
Copy link

beautiful! i love posts like this that map out a person's path to learning! one of the best ways (IMHO) to find one's own way into tackling a related question. thanks ziggy! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment