Created
January 31, 2024 10:00
-
-
Save Montro1981/c26bb6c7ca0992416f6c351e9db44da9 to your computer and use it in GitHub Desktop.
Brent Ozar Query Exercise: Find Tagged Questions Faster
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
1. What kinds of tags will perform worse than others for this query? | |
Since the LIKE it introduced with a double percent-sign '%%', it becomes non-sargable (as opposed to having a the wildcard at the end only 'x%') and the Engine will have to scan every row in the index to see the contents of the Tags column to find the correct rows up to the number in the TOP. | |
When seaching for tags that are not used often, like '<progress-reports>' that has been used once in the SO2013 database, with perform worse as the whole index will be scanned. | |
To illustrate, seaching for <sql-server> scans 17488 rows (0.1%) with 546 logical reads, and seaching for <progress-reports> scans 17142169 rows (100.0%) with 75555 logical reads. | |
*/ | |
/* Fact finding query to discover all distinct Tags */ | |
SELECT v.[value], COUNT(*) | |
FROM dbo.Posts | |
CROSS APPLY string_split(REPLACE(Tags, N'><', N'>;<'), N';') AS v | |
WHERE Tags IS NOT NULL | |
GROUP BY v.[value] | |
ORDER BY COUNT(*) ASC; | |
GO | |
DROP INDEX IF EXISTS Score_Tags ON dbo.Posts; | |
CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags); | |
GO | |
SET STATISTICS IO ON; | |
SELECT TOP 100 * | |
FROM dbo.Posts | |
WHERE Tags LIKE '%<sql-server>%' | |
ORDER BY Score DESC; | |
--546 logical reads | |
SELECT TOP 100 * | |
FROM dbo.Posts | |
WHERE Tags LIKE '%<progress-reports>%' | |
ORDER BY Score DESC; | |
--75555 logical reads | |
SET STATISTICS IO OFF; | |
/* | |
The more a certain tag is used, the quicker the query *might* become. | |
It depends(TM) on the distribution of the tags in the dataset, if a tag is often used on downvoted posts we need to scan the index longer to find the relevant posts | |
2. Could you change the query to perform better? | |
Let's try: There a quite a few rows where there are no tags at all, so we can start by filtering those out. | |
*/ | |
SET STATISTICS IO ON; | |
SELECT TOP 100 * | |
FROM dbo.Posts | |
WHERE Tags LIKE '%<sql-server>%' | |
AND Tags IS NOT NULL | |
ORDER BY Score DESC; | |
--546 logical reads | |
SELECT TOP 100 * | |
FROM dbo.Posts | |
WHERE Tags LIKE '%<progress-reports>%' | |
AND Tags IS NOT NULL | |
ORDER BY Score DESC; | |
--75555 logical reads | |
SET STATISTICS IO OFF; | |
/* | |
Okay, the query didn't benefit from the WHERE clause, because the leading column in the index is Score and we need to scan the whole index anyways. | |
So the index created by our lovely DBA isn't helping us in this case, and flipping the fields around will not help as well, as it will force SORTs into our plans because of the ORDER BY. | |
Let's make an extra table for storing the tags, however if we want to go this route we need to modify the app code as well to use this table, unless we do some shenanigans with the posts table and views... | |
*/ | |
DROP TABLE IF EXISTS dbo.PostTags; | |
CREATE TABLE dbo.PostTags | |
( | |
PostId int NOT NULL | |
,Tag nvarchar(50) NOT NULL | |
,PRIMARY KEY CLUSTERED (Tag ASC, PostId ASC) | |
); | |
INSERT INTO dbo.PostTags | |
SELECT PostId = p.Id, Tag = v.[value] | |
FROM dbo.Posts AS p | |
CROSS APPLY string_split(REPLACE(p.Tags, N'><', N'>;<'), N';') AS v | |
WHERE Tags IS NOT NULL; | |
SET STATISTICS IO ON; | |
SELECT TOP 100 * | |
FROM dbo.Posts AS p | |
WHERE EXISTS | |
( | |
SELECT 1/0 | |
FROM dbo.PostTags AS pt | |
WHERE pt.Tag = N'<sql-server>' | |
AND pt.PostId = p.Id | |
) | |
ORDER BY p.Score DESC; | |
--73111 + 104852 logical reads | |
SELECT TOP 100 * | |
FROM dbo.Posts AS p | |
WHERE EXISTS | |
( | |
SELECT 1/0 | |
FROM dbo.PostTags AS pt | |
WHERE pt.Tag = N'<progress-reports>' | |
AND pt.PostId = p.Id | |
) | |
ORDER BY p.Score DESC; | |
--6 + 8 logical reads | |
SET STATISTICS IO OFF; | |
DROP TABLE IF EXISTS dbo.PostTags; | |
GO | |
/* | |
Okay, that fixed the outlier query, but made the non-outlier query so much worse. | |
So let's modify the with the score field this time ... More denormalization \o/ | |
*/ | |
DROP TABLE IF EXISTS dbo.PostTags; | |
CREATE TABLE dbo.PostTags | |
( | |
PostId int NOT NULL | |
,Tag nvarchar(50) NOT NULL | |
,Score int NOT NULL | |
,PRIMARY KEY CLUSTERED (Score ASC, Tag ASC, PostId ASC) | |
); | |
INSERT INTO dbo.PostTags | |
SELECT PostId = p.Id, Tag = v.[value], Score = p.Score | |
FROM dbo.Posts AS p | |
CROSS APPLY string_split(REPLACE(p.Tags, N'><', N'>;<'), N';') AS v | |
WHERE Tags IS NOT NULL; | |
CREATE INDEX idx_PostTags ON dbo.PostTags (Tag ASC); | |
SET STATISTICS IO ON; | |
SELECT TOP 100 * | |
FROM dbo.Posts AS p | |
CROSS APPLY | |
( | |
SELECT pt.Score | |
FROM dbo.PostTags AS pt | |
WHERE pt.Tag = N'<sql-server>' | |
AND pt.PostId = p.Id | |
) AS pt | |
ORDER BY pt.Score DESC; | |
--415 + 4 logical reads | |
SELECT TOP 100 * | |
FROM dbo.Posts AS p | |
CROSS APPLY | |
( | |
SELECT pt.Score | |
FROM dbo.PostTags AS pt | |
WHERE pt.Tag = N'<progress-reports>' | |
AND pt.PostId = p.Id | |
) AS pt | |
ORDER BY p.Score DESC; | |
--5 + 5 logical reads | |
SET STATISTICS IO OFF; | |
DROP TABLE IF EXISTS dbo.PostTags; | |
GO | |
/* | |
That fixed up both the outlier and the non-outlier query, with ... quite a "few" database changes that will also require app changes to handle. | |
The bussiness users will be pleased because great response times, but the application team will not like this solution as it is not very fault tolerate and requires significate changes to the app code. | |
The database team might not like the extra denormalization, but since it is mostly the application code to handle it, they will go "it's not our problem". | |
3. Could you change the indexes to perform better, without changing the table structure? | |
A simple solution would be to filter the index made by our DBA | |
*/ | |
DROP INDEX IF EXISTS Score_Tags ON dbo.Posts; | |
CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags) WHERE Tags IS NOT NULL; | |
/* Note: Could also be PostTypeId = 1 */ | |
SET STATISTICS IO ON; | |
SELECT TOP 100 * | |
FROM dbo.Posts | |
WHERE Tags LIKE '%<sql-server>%' | |
AND Tags IS NOT NULL | |
ORDER BY Score DESC; | |
--1817 logical reads | |
SELECT TOP 100 * | |
FROM dbo.Posts | |
WHERE Tags LIKE '%<progress-reports>%' | |
AND Tags IS NOT NULL | |
ORDER BY Score DESC; | |
--56228 logical reads | |
SET STATISTICS IO OFF; | |
/* | |
That knocked some logical read off from the query, but nothing substantial | |
Time to get creative, let's try an indexed view where we split the tags column into individual parts. | |
Thanks Thomas Franz for the general idea. | |
*/ | |
DROP VIEW IF EXISTS dbo.view_PostTags | |
GO | |
CREATE OR ALTER VIEW dbo.view_PostTags | |
WITH SCHEMABINDING | |
AS | |
SELECT p.Id, p.Tags, p.Score | |
,Tag1 = SUBSTRING(p.Tags, 1, CHARINDEX(N'>', p.Tags, 1)) | |
,Tag2 = CASE | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 2 | |
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1))) | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) > 2 | |
THEN SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, 150))) | |
ELSE NULL | |
END | |
,Tag3 = CASE | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 3 | |
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1))) | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) > 3 | |
THEN SUBSTRING( | |
SUBSTRING(p.Tags, CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, 150), 1) + CHARINDEX(N'>', p.Tags, 1) + 1, 150) | |
,1 | |
,CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, 150), 1) + CHARINDEX(N'>', p.Tags, 1) + 1, 150), 1) | |
) | |
ELSE NULL | |
END | |
,Tag4 = CASE | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 4 | |
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1))) | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) > 4 | |
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, CHARINDEX(N'<', SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, 150)))) | |
ELSE NULL | |
END | |
,Tag5 = CASE | |
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 5 | |
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1))) | |
ELSE NULL | |
END | |
FROM dbo.Posts AS p | |
WHERE p.Tags IS NOT NULL; | |
GO | |
CREATE UNIQUE CLUSTERED INDEX idx_view_PostTags_Tag1 ON view_PostTags (Tag1, Score, Id); | |
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag2 ON view_PostTags (Tag2, Score, Id); | |
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag3 ON view_PostTags (Tag3, Score, Id); | |
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag4 ON view_PostTags (Tag4, Score, Id); | |
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag5 ON view_PostTags (Tag5, Score, Id); | |
GO | |
DROP VIEW IF EXISTS dbo.view_TagsPerPost | |
GO | |
CREATE OR ALTER VIEW dbo.view_TagsPerPost | |
AS | |
SELECT Id, Score, Tag = Tag1 | |
FROM dbo.view_PostTags WITH (NOEXPAND) | |
UNION ALL | |
SELECT Id, Score, Tag = Tag2 | |
FROM dbo.view_PostTags WITH (NOEXPAND) | |
WHERE Tag2 IS NOT NULL | |
UNION ALL | |
SELECT Id, Score, Tag = Tag3 | |
FROM dbo.view_PostTags WITH (NOEXPAND) | |
WHERE Tag3 IS NOT NULL | |
UNION ALL | |
SELECT Id, Score, Tag = Tag4 | |
FROM dbo.view_PostTags WITH (NOEXPAND) | |
WHERE Tag4 IS NOT NULL | |
UNION ALL | |
SELECT Id, Score, Tag = Tag5 | |
FROM dbo.view_PostTags WITH (NOEXPAND) | |
WHERE Tag5 IS NOT NULL | |
GO | |
SET STATISTICS IO ON; | |
SELECT TOP 100 p.* | |
FROM dbo.view_TagsPerPost AS tpp | |
INNER JOIN dbo.Posts AS p | |
ON p.Id = tpp.Id | |
WHERE tpp.Tag = '<sql-server>' | |
ORDER BY tpp.Score DESC; | |
--1139 + 20807 logical reads | |
SELECT TOP 100 * | |
FROM dbo.view_TagsPerPost AS tpp | |
INNER JOIN dbo.Posts AS p | |
ON p.Id = tpp.Id | |
WHERE tpp.Tag = '<progress-reports>' | |
ORDER BY tpp.Score DESC; | |
--5 + 24 logical reads | |
SET STATISTICS IO OFF; | |
DROP VIEW IF EXISTS dbo.view_PostTags | |
GO | |
DROP VIEW IF EXISTS dbo.view_TagsPerPost | |
GO | |
/* | |
That fixed up both the outlier and the non-outlier query, with some database changes that will the app will need to do nothing with. However the query from the app will need to changed to use the view. | |
The bussiness users will be pleased because great response times. | |
For the application team it is a small and managable change, but the ways tags are recorded can't be easily changed in the future. | |
However the database team might oppose the solution, as the view is mostly a database solution and they might be the team to maintain it. | |
*/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment