-
-
Save zsiv/bffea1a7709076ce076e6017bd603331 to your computer and use it in GitHub Desktop.
/* | |
Top 10 "helpful" locations | |
Assumptions: | |
- > 100 users in location | |
- no null/blank location | |
- most helpful calculated by highest overall average answer score (subjective) | |
*/ | |
WITH AnswerData AS ( | |
SELECT u.Id, u.[Location], AVG(p.Score) AS 'AvgAnswerScore' | |
FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId | |
WHERE p.PostTypeID = 2 | |
AND u.[Location] IS NOT NULL | |
AND u.[Location] <> '' | |
GROUP BY u.Id, u.[Location] | |
) | |
SELECT TOP 10 [Location], COUNT(*) AS [Population], AVG(AvgAnswerScore) AS AvgAnswerRep | |
FROM AnswerData | |
GROUP BY [Location] | |
HAVING COUNT(*) >= 100 | |
ORDER BY AvgAnswerRep DESC | |
/* | |
Top 10 "help needed" locations | |
Assumptions: | |
- > 100 users in location | |
- no null/blank location | |
- most help needed calculated by subtracting the total answers given in a location from the number of questions asked | |
(subjective, case could be made to average over individual user deltas?) | |
*/ | |
WITH HelpData AS ( | |
SELECT u.Id, u.[Location], COUNT(CASE WHEN p.PostTypeId = 1 THEN 1 END) AS QuestionCount, | |
COUNT(CASE WHEN p.PostTypeId = 2 THEN 1 END) AS AnswerCount | |
FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId | |
WHERE u.[Location] IS NOT NULL | |
AND u.[Location] <> '' | |
GROUP BY u.Id, u.[Location] | |
) | |
SELECT TOP 10 [Location], COUNT(*) AS [Population], (SUM(QuestionCount) - SUM(AnswerCount)) AS AnswerDelta | |
FROM HelpData | |
GROUP BY [Location] | |
HAVING COUNT(*) >= 100 | |
ORDER BY AnswerDelta DESC |
I just had a nice conversation with Copilot, wondering why my aggregation (average) of per-user average answer scores - coded in a similar fashion as the first SELECTs above - produced wrong results. It told me to look up Simpson's paradox, or TL;DR: User A: 100 answers each scoring 1. User B: 1 answer scoring 100. Sum(scores)/count(answers) = 200/101 = 1.98. AVG(AVG(Score[effectively per answer]) per user) = AVG(1, 100) = 50.5. And I STILL fell for it even though I had bumped into this at least twice before, once when computers used plain MS-DOS (and I fixed that bug) and once with a 2013 Toyota Auris whose fuel economy display was optimistic (and I think I know why).
@ahlstart I actually loved the idea of aggregating the per-user average scores for this because it's different than the approach I considered. I wrote this challenge with purposely vague specs because I wanted to show a variety of solutions, and yours was a good example of that!
@ahlstart Just finished reading up on Simpson's Paradox - interesting stuff. I guess it comes down to something as simple as an average (or average of averages in this case) potentially losing nuance of meaning in the results when all the other relevant statistical forces at play aren't considered. Maybe this is why I'm a DBA and not a mathematician :)
@zsiv Here's hoping you can some day point out this issue as the cause why one of your users' queries behaves in ever so slightly unexpected ways.
Both our intuitions were to approach this from the point of individual user. And AVG(AVG()) pleased Brent.
I made a couple of tweaks just to better understand the data, and in the first query, I multiplied Score * 1.0 so that we'd get decimal places in the average answer reputation. (Otherwise there were ties.)