Skip to content

Instantly share code, notes, and snippets.

@Montro1981
Last active March 1, 2024 16:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Montro1981/794b3a2625e0b70b546818e57c42b22d to your computer and use it in GitHub Desktop.
Save Montro1981/794b3a2625e0b70b546818e57c42b22d to your computer and use it in GitHub Desktop.
Query Exercise: Finding Sister Locations to Help Each Other
/*
First, what are the top 10 Locations populated with users who seem to be really helpful, meaning, they write really good answers?
Things to consider:
1. There are null and empty locations. Those should probably be excluded.
Filter(s) to add: Location IS NOT NULL AND Location != N''
2. In the Users table, the Location column’s datatype is NVARCHAR(100). People can type in whatever they want, which leads to a lot of oddball one-off locations. We should filter for a minimum number of people involved, especially since we’re considering paid advertising. Let’s filter for only locations with at least 100 people in them.
Filter(s) to add: GROUP BY Location HAVING COUNT(Id) >= 100
3. For the sake of this exercise, let’s pretend that each Location value is geographically unique. You’ll notice in the screenshot that there are entries for “India” and for “Bangalore”. You and I know that everyone who lives in Bangalore also lives in India, but for the sake of this question, let’s pretend that’s not the case. In reality, obviously we need to clean up our Location data, but I’ll save that for another challenge.
Okay, so no data cleaning. Good makes things easier but will screw results
4. High values for Reputation do not necessarily mean that people write really good answers. You can earn Reputation points a lot of ways, including writing good questions.
We'll look at the Score of the answers, answers are Posts with PostTypeId = 2
Filter(s) to add: PostTypeId = 2
*/
/* Let's create some indexes to support our queries, so that the results are faster while we are tinkering with the queries:
CREATE NONCLUSTERED INDEX [PostTypeId_incl_OwnerUserIdScoreParentId] ON dbo.[Posts] ([PostTypeId]) INCLUDE ([OwnerUserId], [Score], [ParentId]);
CREATE NONCLUSTERED INDEX [Location] ON dbo.[Users] ([Location]);
GO
*/
DECLARE
@MinimumAmountOfUsers int = 100
,@NumberOfHelpfulLocationd int = 10
,@NumberOfNeedyLocationd int = 10;
WITH UsersPerLocation AS
(
SELECT
[Id] = users.[Id]
,[Location] = users.[Location]
,[NumberOfUsers] = COUNT(*) OVER (PARTITION BY users.[Location])
FROM dbo.[Users] AS users
WHERE EXISTS
(
SELECT NULL
FROM dbo.[Users] AS usersPerLocation
WHERE usersPerLocation.[Location] IS NOT NULL
AND usersPerLocation.[Location] != N''
AND users.[Location] = usersPerLocation.[Location]
GROUP BY usersPerLocation.[Location]
HAVING COUNT(usersPerLocation.[Id]) >= @MinimumAmountOfUsers
)
)
SELECT TOP (@NumberOfHelpfulLocationd)
[Location] = upl.[Location]
,[NumberOfUsers] = MAX(upl.[NumberOfUsers])
,[UsersAnswering] = COUNT(Answers.[OwnerUserId])
,[PercentageUsersAnswering] = CAST(ROUND(100.0 * COUNT(Answers.[OwnerUserId]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(5, 2))
,[NumberOfAnwsers] = SUM(Answers.[NumberOfAnwsers])
,[TotalScore] = SUM(Answers.[SummedScore])
,[AverageScore] = CAST(ROUND(1.0 * AVG(Answers.[AverageScore]), 2) AS decimal(8, 2))
FROM UsersPerLocation AS upl
CROSS APPLY
(
SELECT
[OwnerUserId] = Answer.[OwnerUserId]
,[SummedScore] = SUM(Answer.[Score])
,[AverageScore] = 1.0 * AVG(Answer.[Score])
,[NumberOfAnwsers] = COUNT(Answer.[OwnerUserId])
FROM dbo.[Posts] AS Answer
WHERE Answer.[PostTypeId] = 2
AND Answer.[OwnerUserId] = upl.[Id]
GROUP BY Answer.[OwnerUserId]
) AS Answers
GROUP BY upl.[Location]
ORDER BY [AverageScore] DESC, [UsersAnswering] DESC;
/*
Second, what are the top 10 Locations where people seem to need the most help? Meaning, they ask a lot of questions, but they do not seem to be answering those of their neighbors?
How should I interpret this "but they do not seem to be answering those of their neighbors"?
1: Location of the person answering the question is not the same as the location as the person asking the question
2: The person has not answered a single question
*/
/* 1: Location of the person answering the question is not the same as the location as the person asking the question */
WITH UsersPerLocation AS
(
SELECT
[Id] = users.[Id]
,[Location] = users.[Location]
,[NumberOfUsers] = COUNT(*) OVER (PARTITION BY users.[Location])
FROM dbo.[Users] AS users
WHERE EXISTS
(
SELECT NULL
FROM dbo.[Users] AS usersPerLocation
WHERE usersPerLocation.[Location] IS NOT NULL
AND usersPerLocation.[Location] != N''
AND users.[Location] = usersPerLocation.[Location]
GROUP BY usersPerLocation.[Location]
HAVING COUNT(usersPerLocation.[Id]) >= @MinimumAmountOfUsers
)
)
SELECT TOP (@NumberOfNeedyLocationd)
[Location] = upl.[Location]
,[NumberOfUsers] = MAX(upl.[NumberOfUsers])
,[UserQuestions] = SUM(Questions.[NumberOfQuestions])
,[AverageQuestionsPerUser] = CAST(ROUND(1.0 * SUM(Questions.[NumberOfQuestions]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(14, 2))
,[UserAnswers] = SUM(Questions.[NumberOfAnswers])
,[AverageAnswersPerUser] = CAST(ROUND(1.0 * SUM(Questions.[NumberOfAnswers]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(14, 2))
,[UsersQuestioning] = COUNT(DISTINCT Questions.[OwnerUserId])
,[UsersAnswering] = SUM(Questions.[HasAnswers])
,[PercentageUsersQuestions] = CAST(ROUND(100.0 * COUNT(DISTINCT Questions.[OwnerUserId]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(5, 2))
,[PercentageUsersNotAnswering] = CAST(ROUND(100.0 * (COUNT(DISTINCT Questions.[OwnerUserId]) - SUM(Questions.[HasAnswers])) / MAX(upl.[NumberOfUsers]), 2) AS decimal(5, 2))
FROM UsersPerLocation AS upl
CROSS APPLY
(
SELECT
[OwnerUserId] = Question.[OwnerUserId]
,[NumberOfQuestions] = COUNT(Question.[OwnerUserId])
,[QuestionLocation] = Questioner.[Location]
,[HasAnswers] = IIF(MIN(Answers.[OwnerUserId]) IS NULL, 0, 1)
,[NumberOfAnswers] = SUM(ISNULL(Answers.[NumberOfAnswers], 0))
FROM dbo.[Posts] AS Question
INNER JOIN dbo.[Users] AS Questioner
ON Question.[OwnerUserId] = Questioner.[Id]
OUTER APPLY
(
SELECT
[OwnerUserId] = Answers.[OwnerUserId]
,[NumberOfAnswers] = COUNT(Answers.[OwnerUserId])
FROM dbo.[Posts] AS Answers
INNER JOIN dbo.[Users] AS Answerer
ON Answers.[OwnerUserId] = Answerer.[Id]
WHERE Answers.[PostTypeId] = 2
AND EXISTS
(
SELECT NULL
FROM dbo.[Posts] AS AnsweredQuestion
INNER JOIN dbo.[Users] AS AskedBy
ON AnsweredQuestion.[OwnerUserId] = AskedBy.[Id]
WHERE Answers.[ParentId] = AnsweredQuestion.[Id]
AND AnsweredQuestion.[PostTypeId] = 1
AND Answerer.[Location] != AskedBy.[Location]
)
AND Question.[OwnerUserId] = Answers.[OwnerUserId]
GROUP BY Answers.[OwnerUserId]
) AS Answers
WHERE Question.[PostTypeId] = 1
AND Question.[OwnerUserId] = upl.[Id]
GROUP BY
Questioner.[Location]
,Question.[OwnerUserId]
) AS Questions
GROUP BY upl.[Location]
ORDER BY [PercentageUsersNotAnswering] DESC;
/* 2: The person has not answered a single question */
WITH UsersPerLocation AS
(
SELECT
[Id] = users.[Id]
,[Location] = users.[Location]
,[NumberOfUsers] = COUNT(*) OVER (PARTITION BY users.[Location])
FROM dbo.[Users] AS users
WHERE EXISTS
(
SELECT NULL
FROM dbo.[Users] AS usersPerLocation
WHERE usersPerLocation.[Location] IS NOT NULL
AND usersPerLocation.[Location] != N''
AND users.[Location] = usersPerLocation.[Location]
GROUP BY usersPerLocation.[Location]
HAVING COUNT(usersPerLocation.[Id]) >= @MinimumAmountOfUsers
)
)
SELECT TOP (@NumberOfNeedyLocationd)
[Location] = upl.[Location]
,[NumberOfUsers] = MAX(upl.[NumberOfUsers])
,[UserQuestions] = SUM(Questions.[NumberOfQuestions])
,[AverageQuestionsPerUser] = CAST(ROUND(1.0 * SUM(Questions.[NumberOfQuestions]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(14, 2))
,[UserAnswers] = SUM(Questions.[NumberOfAnswers])
,[AverageAnswersPerUser] = CAST(ROUND(1.0 * SUM(Questions.[NumberOfAnswers]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(14, 2))
,[UsersQuestioning] = COUNT(DISTINCT Questions.[OwnerUserId])
,[UsersAnswering] = SUM(Questions.[HasAnswers])
,[PercentageUsersQuestions] = CAST(ROUND(100.0 * COUNT(DISTINCT Questions.[OwnerUserId]) / MAX(upl.[NumberOfUsers]), 2) AS decimal(5, 2))
,[PercentageUsersNotAnswering] = CAST(ROUND(100.0 * (COUNT(DISTINCT Questions.[OwnerUserId]) - SUM(Questions.[HasAnswers])) / MAX(upl.[NumberOfUsers]), 2) AS decimal(5, 2))
FROM UsersPerLocation AS upl
CROSS APPLY
(
SELECT
[OwnerUserId] = Question.[OwnerUserId]
,[NumberOfQuestions] = COUNT(Question.[OwnerUserId])
,[QuestionLocation] = Questioner.[Location]
,[HasAnswers] = IIF(MIN(Answers.[OwnerUserId]) IS NULL, 0, 1)
,[NumberOfAnswers] = SUM(ISNULL(Answers.[NumberOfAnswers], 0))
FROM dbo.[Posts] AS Question
INNER JOIN dbo.[Users] AS Questioner
ON Question.[OwnerUserId] = Questioner.[Id]
OUTER APPLY
(
SELECT
[OwnerUserId] = Answers.[OwnerUserId]
,[NumberOfAnswers] = COUNT(Answers.[OwnerUserId])
FROM dbo.[Posts] AS Answers
WHERE Answers.[PostTypeId] = 2
AND Question.[OwnerUserId] = Answers.[OwnerUserId]
GROUP BY Answers.[OwnerUserId]
) AS Answers
WHERE Question.[PostTypeId] = 1
AND Question.[OwnerUserId] = upl.[Id]
GROUP BY
Questioner.[Location]
,Question.[OwnerUserId]
) AS Questions
GROUP BY upl.[Location]
ORDER BY [PercentageUsersNotAnswering] DESC;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment