Here's how this all started. I was working on my project to analyze heat/hot water complaints. I downloaded all the complaints for this heat season (there's over 200,00), geocoded them to obtain the bbls, and then imported them into postgres. I wanted to match complaints with their owners in order to find out whose responsible for not providing tenants' with essential services.
So I fired up PSQL and started typing out some queries.
SELECT bbl,
address,
Count(bbl) AS total_complaints
FROM complaints
GROUP BY bbl,
address
ORDER BY total_complaints DESC;
This query returns a sorted listed of buildings with most number of complaints:
bbl | address | total_complaints |
---|---|---|
1021740175 | 34 ARDEN STREET | 1730 |
3017780038 | 554 DE KALB AVENUE | 1049 |
3061207502 | 9511 SHORE ROAD | 920 |
2032630007 | 3810 BAILEY AVENUE | 748 |
2037140054 | 1025 BOYNTON AVENUE | 538 |
2032920019 | 2968 PERRY AVENUE | 400 |
2031960010 | 30 BUCHANAN PLACE | 398 |
... |
Although there are over 200,000 complaints this heat season, there are only 31,531 unique buildings.
Naturally, I wanted to know who were the people and companies behind these buildings. I have two ownership sources easily accessible: pluto and DOB permits. I brought those tables into my database. Now I have three tables:
- complaints (~200,000)
- pluto (~800,000)
- dobjobs form years 2010-2015 (~400,000)
They all have one field in common - BBL - a unique identifier for NYC tax lots. I thought, "great! We just need to do a join. T his should be easy" ...but as it turns out....
SELECT complaints.bbl AS bbl,
complaints.address AS address,
pluto.ownername AS OWNER,
count(complaints.bbl) AS total_complaints
FROM complaints,
pluto
WHERE complaints.bbl = pluto.bbl
GROUP BY complaints.bbl,
complaints.address,
pluto.ownername
ORDER BY total_complaints DESC;
This joins pluto to the complaints data so we can figure out the owner for each building. Note the GROUP BY part of the sql statement. If you want to sort by an Aggregate Expression (such as count) you must use GROUP BY and all the columns (except for count) must appear in the GROUP BY part. This results in:
bbl | address | owner | total_complaints |
---|---|---|---|
1021740175 | 34 ARDEN STREET | 24-32 ARDEN ASSOCIATE | 1730 |
3017780038 | 554 DE KALB AVENUE | ALFREDO MORALES | 1049 |
3061207502 | 9511 SHORE ROAD | THE 9511 SHORE RD CON | 920 |
2032630007 | 3810 BAILEY AVENUE | 3810 LLC | 748 |
2037140054 | 1025 BOYNTON AVENUE | 1025 BOYNTON AVENUERE | 538 |
2032920019 | 2968 PERRY AVENUE | 2968 PERRY LLC | 400 |
2031960010 | 30 BUCHANAN PLACE | UNIVERSITY AVENUE, L. | 398 |
2024590034 | 888 GRAND CONCOURSE | TINY FIESTA REALTY AS | 394 |
... |
Works perfectly! There are over 200 buildings with more than 50 complaints and 2 with over 1000! So let's try to join dobjobs now. Dobjobs contains both the owner's business and actual name, which is why I find it an appealing data source to use.
SELECT complaints.bbl AS bbl,
complaints.address AS address,
dobjobs.ownername AS jobs_owner,
dobjobs.ownerbusinessname AS jobs_business,
count(complaints.bbl) AS total_complaints
FROM complaints
LEFT JOIN dobjobs ON complaints.bbl = dobjobs.bbl
GROUP BY complaints.bbl,
complaints.address,
dobjobs.ownerbusinessname,
dobjobs.ownername
ORDER BY total_complaints DESC
bbl | address | jobs_owner | jobs_business | total_complaints |
---|---|---|---|---|
2039447501 | 2090 EAST TREMONT AVENUE | CHARLES TUCCI | PPC PROPERTY SERVICE LLC | 27520 |
1009720001 | 18 STUYVESANT OVAL | MARCO M MATTIA | ROSE ASSOCIATES INC. MANAGER | 24722 |
1009720001 | 400 EAST 20 STREET | MARCO M MATTIA | ROSE ASSOCIATES INC. MANAGER | 8942 |
1009720001 | 272 1 AVENUE | MARCO M MATTIA | ROSE ASSOCIATES INC. MANAGER | 8942 |
1014207501 | 1111 3 AVENUE | BRIAN FALLON | MH RESIDENTIAL 1 LLC | 8575 |
1009720001 | 530 EAST 20 STREET | MARCO M MATTIA | ROSE ASSOCIATES INC. MANAGER | 8416 |
1009720001 | 18 STUYVESANT OVAL | PAUL CENZOPRANO | TISHMAN/SPEYER PCVST MANAG LLC | 6862 |
2039447501 | 2100 EAST TREMONT AVENUE | CHARLES TUCCI | PPC PROPERTY SERVICE LLC | 6800 |
1009720001 | 281 AVENUE C | MARCO M MATTIA | ROSE ASSOCIATES INC. MANAGER | 6312 |
1009720001 | 18 STUYVESANT OVAL | PAUL CENZOPRANO | TISHMAN SPEYER PCVST MANAG LLC | 6204 |
1009720001 | 5 STUYVESANT OVAL | MARCO M MATTIA | ROSE ASSOCIATES INC. MANAGER | 5786 |
An unexpected result occurred. Now the total_complaints is different from before. There are multiple BBLs for the same building in both the dobjobs and complaints tables a MANY to MANY join. I believe Postgres executes the SELECT count(bbl) after the join, resulting in a count of all the times the BBL appears in both tables. I need to ensure that the correct count appears. The solution lies in doing a subquery. In the FROM part of a SQL query you are allowed to SELECT from a SELECT statement. (and presumably you can SELECT from a SELECT from a SELECT too)
SELECT c.bbl AS bbl,
c.address AS address,
dobjobs.ownername AS jobs_owner,
dobjobs.ownerbusinessname AS jobs_business,
c.total_complaints AS total_complaints
FROM
(SELECT complaints.address,
count(complaints.bbl) AS total_complaints,
complaints.bbl
FROM complaints
GROUP BY complaints.address,
complaints.bbl) AS c
LEFT JOIN dobjobs ON c.bbl = dobjobs.bbl
ORDER BY total_complaints DESC;
bbl | address | jobs_owner | jobs_business | total_complaints |
---|---|---|---|---|
1021740175 | 34 ARDEN STREET | 1730 | ||
3017780038 | 554 DE KALB AVENUE | RAMON MENDEZ | MENDEZ DELI | 1049 |
3061207502 | 9511 SHORE ROAD | USAMA EDRIS | N/A | 920 |
3061207502 | 9511 SHORE ROAD | NGOON L GOON | N/A | 920 |
3061207502 | 9511 SHORE ROAD | GINA CIAIO | G & M EQUIPMENT LEASING | 920 |
3061207502 | 9511 SHORE ROAD | PETER KRAUS | C0-BUILDERS ASSOCIATES LLC | 920 |
2032630007 | 3810 BAILEY AVENUE | MICHAEL GOLDBERG | 3810 LLC | 748 |
2032630007 | 3810 BAILEY AVENUE | MICHAEL GOLDBERG | 3810 LLC | 748 |
2037140054 | 1025 BOYNTON AVENUE | ERAN GOHARI | 1025 BOYNTON AVENUE LLC | 538 |
2032920019 | 2968 PERRY AVENUE | MEYER BRECHER | 1750 GRAND LLC | 400 |
Now the result lines up with our initial count. The first row is blank, which is okay. This is because not every BBL can be found in the dobjobs database. A LEFT JOIN keeps all rows from the table on the left (in our case complaints) and puts NULLs when a join cannot be made. HOWEVER, another problem as appeared. We now have duplicate BBLs. Time for another subquery. We will join with a subquery of dobjobs using DISTINCT to ensure we only have one BBL.
SELECT c.bbl AS bbl,
c.address AS address,
jobs.ownername AS jobs_owner,
jobs.ownerbusinessname AS jobs_business,
c.total_complaints AS total_complaints
FROM
(SELECT complaints.address,
count(complaints.bbl) AS total_complaints,
complaints.bbl
FROM complaints
GROUP BY complaints.address,
complaints.bbl) AS c
LEFT JOIN
(SELECT DISTINCT ON (bbl) bbl,
ownername,
ownerbusinessname
FROM dobjobs
ORDER BY bbl) AS jobs ON c.bbl = jobs.bbl
ORDER BY total_complaints DESC
RESULT:
bbl | address | jobs_owner | jobs_business | total_complaints |
---|---|---|---|---|
1021740175 | 34 ARDEN STREET | 1730 | ||
3017780038 | 554 DE KALB AVENUE | RAMON MENDEZ | MENDEZ DELI | 1049 |
3061207502 | 9511 SHORE ROAD | NGOON L GOON | N/A | 920 |
2032630007 | 3810 BAILEY AVENUE | MICHAEL GOLDBERG | 3810 LLC | 748 |
2037140054 | 1025 BOYNTON AVENUE | ERAN GOHARI | 1025 BOYNTON AVENUE LLC | 538 |
2032920019 | 2968 PERRY AVENUE | MEYER BRECHER | 1750 GRAND LLC | 400 |
2031960010 | 30 BUCHANAN PLACE | KAMALJIT SINGH | B H & B CONSTRUCTION INC. | 398 |
2024590034 | 888 GRAND CONCOURSE | ADNAN ALSHABBI | DELI-GROCERY STORE | 394 |
4013120010 | 72-10 41 AVENUE | BRADLEY LEWART | MASSACHUSETTES LEASING CO | 389 |
2033100081 | 2955 GRAND CONCOURSE | JOSHUA SIEW | BARONS GATE TOWERS LLC | 382 |
2042920001 | 2074 WALLACE AVENUE | VITO SACCHETTI | BRADY LLC | 356 |
2039447501 | 2090 EAST TREMONT AVENUE | CHARLES TUCCI | PPC PROPERTY SERVICE LLC | 344 |
2026100012 | 530 EAST 169 STREET | KARAM SINGH | FORDHAM FULTON REALTY CORP | 300 |
Wonderful. Complaints joined with dobjobs, with the correct count. Here I paused for a while. I thought I was done. But after staring at the results for long enough I realized a problem with this. The "DISTINCT ON (bbL)" subquery was picking distinct rows essentially randomly. Because I want to make sure the owner listed is the most up-to-date, I need to postgres to not pick a random BBL to join but instead choose the most current one. An easy fix? I thought I'd tack on a ORDER BY to my subquery and be done with it. So I tried to modify the subquery -- the part between the "LEFT JOIN" and "AS job ON" above to this:
SELECT DISTINCT ON (bbl) bbl,
ownername,
ownerbusinessname,
latestactiondate
FROM dobjobs
ORDER BY latestactiondate DESC;
Yet, this produced an error:
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Hmm. According to the docs "the DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group." Fine, but how to get the most recent distinct BBL? Thanks to this stackoverflow question, the solution lies in yet another subquery:
SELECT *
FROM
(SELECT DISTINCT ON (bbl) bbl,
ownername,
ownerbusinessname,
latestactiondate
FROM dobjobs
ORDER BY bbl) AS dob
ORDER BY dob.latestactiondate DESC;
Place that subquery back into our previous query and we get:
SELECT c.bbl AS bbl,
c.address AS address,
jobs.ownername AS jobs_owner,
jobs.ownerbusinessname AS jobs_business,
jobs.latestactiondate AS jobs_date,
c.total_complaints AS total_complaints
FROM
(SELECT complaints.address,
count(complaints.bbl) AS total_complaints,
complaints.bbl
FROM complaints
GROUP BY complaints.address,
complaints.bbl) AS c
LEFT JOIN
(SELECT *
FROM
(SELECT DISTINCT ON (bbl) bbl,
ownername,
ownerbusinessname,
latestactiondate
FROM dobjobs
ORDER BY bbl) AS dob
ORDER BY dob.latestactiondate DESC )
AS jobs ON c.bb = jobs.bbl
ORDER BY total_complaints DESC;
Which produces:
bbl | address | jobs_owner | jobs_business | jobs_date | total_complaints |
---|---|---|---|---|---|
1021740175 | 34 ARDEN STREET | 1730 | |||
3017780038 | 554 DE KALB AVENUE | RAMON MENDEZ | MENDEZ DELI | 2012-01-24 | 1049 |
3061207502 | 9511 SHORE ROAD | NGOON L GOON | N/A | 2010-12-15 | 920 |
2032630007 | 3810 BAILEY AVENUE | MICHAEL GOLDBERG | 3810 LLC | 2013-11-25 | 748 |
2037140054 | 1025 BOYNTON AVENUE | ERAN GOHARI | 1025 BOYNTON AVENUE LLC | 2014-07-09 | 538 |
2032920019 | 2968 PERRY AVENUE | MEYER BRECHER | 1750 GRAND LLC | 2013-01-22 | 400 |
2031960010 | 30 BUCHANAN PLACE | PETER MAGISTRO | UNIVERSITY AVENUE LP | 2013-10-23 | 398 |
2024590034 | 888 GRAND CONCOURSE | ADNAN ALSHABBI | DELI-GROCERY STORE | 2010-08-20 | 394 |
4013120010 | 72-10 41 AVENUE | BRADLEY LEWART | MASSACHUSETTES LEASING CO | 2012-03-15 | 389 |
2033100081 | 2955 GRAND CONCOURSE | JOSHUA SIEW | BARONS GATE TOWERS LLC | 2013-05-30 | 382 |
2042920001 | 2074 WALLACE AVENUE | NICHOLAS CHIMENTI | BROTHER-SISTER LLC | 2013-05-03 | 356 |
2039447501 | 2090 EAST TREMONT AVENUE | LORRAINE HOSTY | COLGATE ENTERPRISES CORP | 2012-09-26 | 344 |
2026100012 | 530 EAST 169 STREET | RAJ SINGH | C/O - ERAN GOHARI | 2014-07-07 | 300 |
So finally we can also join the pluto data to get the table I was after. Complaints with ownership information from two sources, correctly sorted by complaints and ensuring we get the most up-to-date owner from the dob table:
SELECT c.bbl AS bbl,
c.address AS address,
jobs.ownername AS jobs_owner,
jobs.ownerbusinessname AS jobs_business,
jobs.latestactiondate AS jobs_date,
pluto.ownername AS pluto_owner,
pluto.unitsres AS units_res,
c.total_complaints AS total_complaints
FROM
(SELECT complaints.address,
count(complaints.bbl) AS total_complaints,
complaints.bbl
FROM complaints
GROUP BY complaints.address,
complaints.bbl) AS c
LEFT JOIN
(SELECT *
FROM
(SELECT DISTINCT ON (bbl) bbl,
ownername,
ownerbusinessname,
latestactiondate
FROM dobjobs
ORDER BY bbl) AS dob
ORDER BY dob.latestactiondate DESC ) AS jobs ON c.bbl = jobs.bbl
LEFT JOIN pluto ON c.bbl = pluto.bbl
ORDER BY total_complaints DESC;
Result:
bbl | address | jobs_owner | jobs_business | jobs_date | pluto_owner | units_res | total_complaints |
---|---|---|---|---|---|---|---|
1021740175 | 34 ARDEN STREET | 24-32 ARDEN ASSOCIATE | 67 | 1730 | |||
3017780038 | 554 DE KALB AVENUE | RAMON MENDEZ | MENDEZ DELI | 2012-01-24 | ALFREDO MORALES | 6 | 1049 |
3061207502 | 9511 SHORE ROAD | NGOON L GOON | N/A | 2010-12-15 | THE 9511 SHORE RD CON | 104 | 920 |
2032630007 | 3810 BAILEY AVENUE | MICHAEL GOLDBERG | 3810 LLC | 2013-11-25 | 3810 LLC | 57 | 748 |
2037140054 | 1025 BOYNTON AVENUE | ERAN GOHARI | 1025 BOYNTON AVENUE LLC | 2014-07-09 | 1025 BOYNTON AVENUERE | 61 | 538 |
2032920019 | 2968 PERRY AVENUE | MEYER BRECHER | 1750 GRAND LLC | 2013-01-22 | 2968 PERRY LLC | 31 | 400 |
2031960010 | 30 BUCHANAN PLACE | PETER MAGISTRO | UNIVERSITY AVENUE LP | 2013-10-23 | UNIVERSITY AVENUE, L. | 47 | 398 |
2024590034 | 888 GRAND CONCOURSE | ADNAN ALSHABBI | DELI-GROCERY STORE | 2010-08-20 | TINY FIESTA REALTY AS | 81 | 394 |
4013120010 | 72-10 41 AVENUE | BRADLEY LEWART | MASSACHUSETTES LEASING CO | 2012-03-15 | MASSACHUSETTS LEASING | 143 | 389 |
2033100081 | 2955 GRAND CONCOURSE | JOSHUA SIEW | BARONS GATE TOWERS LLC | 2013-05-30 | BARONSGATE TOWERS LLC | 76 | 382 |
2042920001 | 2074 WALLACE AVENUE | NICHOLAS CHIMENTI | BROTHER-SISTER LLC | 2013-05-03 | BRADY LLC | 191 | 356 |
2039447501 | 2090 EAST TREMONT AVENUE | LORRAINE HOSTY | COLGATE ENTERPRISES CORP | 2012-09-26 | PARKCHESTER NORTH CON | 3857 | 344 |
2026100012 | 530 EAST 169 STREET | RAJ SINGH | C/O - ERAN GOHARI | 2014-07-07 | FORDHAM FULTON REALTY | 321 | 300 |
1017740051 | 142 EAST 126 STREET | DIMITRIOUS TSOKONAS | TSOKONAS | 2010-05-18 | 2085 LEX OPERATING CO | 30 | 275 |
2031910045 | 2645 MORRIS AVENUE | 2645 REALTY ASSOCIATE | 32 | 275 | |||
1022190001 | 101 POST AVENUE | MORRIS PILLAR | N/A | 2011-03-17 | POST AVENUE ASSOCIATE | 121 | 261 |
2032500071 | 2856 WEBB AVENUE | NISAR AHMED | N/A | 2013-10-10 | DRAN VATAJ | 25 | 256 |
2032200001 | 2500 WEBB AVENUE | 2500 WEBB LLC | 57 | 255 | |||
.. | |||||||
The final queries, visually, looks like this: |