Skip to content

Instantly share code, notes, and snippets.

@zsiv
Created February 29, 2024 19:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zsiv/bffea1a7709076ce076e6017bd603331 to your computer and use it in GitHub Desktop.
Save zsiv/bffea1a7709076ce076e6017bd603331 to your computer and use it in GitHub Desktop.
BrentOzar.com Code Example for 2/29/24
/*
Top 10 "helpful" locations
Assumptions:
- > 100 users in location
- no null/blank location
- most helpful calculated by highest overall average answer score (subjective)
*/
WITH AnswerData AS (
SELECT u.Id, u.[Location], AVG(p.Score) AS 'AvgAnswerScore'
FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId
WHERE p.PostTypeID = 2
AND u.[Location] IS NOT NULL
AND u.[Location] <> ''
GROUP BY u.Id, u.[Location]
)
SELECT TOP 10 [Location], COUNT(*) AS [Population], AVG(AvgAnswerScore) AS AvgAnswerRep
FROM AnswerData
GROUP BY [Location]
HAVING COUNT(*) >= 100
ORDER BY AvgAnswerRep DESC
/*
Top 10 "help needed" locations
Assumptions:
- > 100 users in location
- no null/blank location
- most help needed calculated by subtracting the total answers given in a location from the number of questions asked
(subjective, case could be made to average over individual user deltas?)
*/
WITH HelpData AS (
SELECT u.Id, u.[Location], COUNT(CASE WHEN p.PostTypeId = 1 THEN 1 END) AS QuestionCount,
COUNT(CASE WHEN p.PostTypeId = 2 THEN 1 END) AS AnswerCount
FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId
WHERE u.[Location] IS NOT NULL
AND u.[Location] <> ''
GROUP BY u.Id, u.[Location]
)
SELECT TOP 10 [Location], COUNT(*) AS [Population], (SUM(QuestionCount) - SUM(AnswerCount)) AS AnswerDelta
FROM HelpData
GROUP BY [Location]
HAVING COUNT(*) >= 100
ORDER BY AnswerDelta DESC
@BrentOzar
Copy link

I made a couple of tweaks just to better understand the data, and in the first query, I multiplied Score * 1.0 so that we'd get decimal places in the average answer reputation. (Otherwise there were ties.)

/*
Top 10 "helpful" locations

Assumptions:
- > 100 users in location
- no null/blank location
- most helpful calculated by highest overall average answer score (subjective)
*/

WITH AnswerData AS (
	SELECT u.Id, u.[Location], AVG(p.Score * 1.0) AS 'AvgAnswerScore',
		COUNT(DISTINCT p.Id) AS AnswerCount
	FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId
	WHERE p.PostTypeID = 2
		AND u.[Location] IS NOT NULL
		AND u.[Location] <> ''
	GROUP BY u.Id, u.[Location]
)

SELECT TOP 10 [Location], COUNT(*) AS [Population], AVG(AvgAnswerScore) AS AvgAnswerRep,
	SUM(AnswerCount) AS AnswerCount, COUNT(DISTINCT Id) AS Population
FROM AnswerData
GROUP BY [Location]
HAVING COUNT(*) >= 100
ORDER BY AvgAnswerRep DESC


/*
Top 10 "help needed" locations

Assumptions:
- > 100 users in location
- no null/blank location
- most help needed calculated by subtracting the total answers given in a location from the number of questions asked 
 (subjective, case could be made to average over individual user deltas?)
*/
GO
WITH HelpData AS (
	SELECT u.Id, u.[Location], COUNT(CASE WHEN p.PostTypeId = 1 THEN 1 END) AS QuestionCount,
		COUNT(CASE WHEN p.PostTypeId = 2 THEN 1 END) AS AnswerCount
	FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId
	WHERE u.[Location] IS NOT NULL
		AND u.[Location] <> ''
	GROUP BY u.Id, u.[Location]
)

SELECT TOP 10 [Location], COUNT(*) AS [Population], (SUM(QuestionCount) - SUM(AnswerCount)) AS AnswerDelta,
	SUM(QuestionCount) AS QuestionCount, SUM(AnswerCount) AS AnswerCount
FROM HelpData
GROUP BY [Location]
HAVING COUNT(*) >= 100
ORDER BY AnswerDelta DESC

@ahlstart
Copy link

ahlstart commented Mar 1, 2024

I just had a nice conversation with Copilot, wondering why my aggregation (average) of per-user average answer scores - coded in a similar fashion as the first SELECTs above - produced wrong results. It told me to look up Simpson's paradox, or TL;DR: User A: 100 answers each scoring 1. User B: 1 answer scoring 100. Sum(scores)/count(answers) = 200/101 = 1.98. AVG(AVG(Score[effectively per answer]) per user) = AVG(1, 100) = 50.5. And I STILL fell for it even though I had bumped into this at least twice before, once when computers used plain MS-DOS (and I fixed that bug) and once with a 2013 Toyota Auris whose fuel economy display was optimistic (and I think I know why).

@BrentOzar
Copy link

@ahlstart I actually loved the idea of aggregating the per-user average scores for this because it's different than the approach I considered. I wrote this challenge with purposely vague specs because I wanted to show a variety of solutions, and yours was a good example of that!

@zsiv
Copy link
Author

zsiv commented Mar 2, 2024

@ahlstart Just finished reading up on Simpson's Paradox - interesting stuff. I guess it comes down to something as simple as an average (or average of averages in this case) potentially losing nuance of meaning in the results when all the other relevant statistical forces at play aren't considered. Maybe this is why I'm a DBA and not a mathematician :)

@ahlstart
Copy link

ahlstart commented Mar 2, 2024

@zsiv Here's hoping you can some day point out this issue as the cause why one of your users' queries behaves in ever so slightly unexpected ways.
Both our intuitions were to approach this from the point of individual user. And AVG(AVG()) pleased Brent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment