Skip to content

Instantly share code, notes, and snippets.

@Montro1981
Created January 31, 2024 10:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Montro1981/c26bb6c7ca0992416f6c351e9db44da9 to your computer and use it in GitHub Desktop.
Save Montro1981/c26bb6c7ca0992416f6c351e9db44da9 to your computer and use it in GitHub Desktop.
Brent Ozar Query Exercise: Find Tagged Questions Faster
/*
1. What kinds of tags will perform worse than others for this query?
Since the LIKE it introduced with a double percent-sign '%%', it becomes non-sargable (as opposed to having a the wildcard at the end only 'x%') and the Engine will have to scan every row in the index to see the contents of the Tags column to find the correct rows up to the number in the TOP.
When seaching for tags that are not used often, like '<progress-reports>' that has been used once in the SO2013 database, with perform worse as the whole index will be scanned.
To illustrate, seaching for <sql-server> scans 17488 rows (0.1%) with 546 logical reads, and seaching for <progress-reports> scans 17142169 rows (100.0%) with 75555 logical reads.
*/
/* Fact finding query to discover all distinct Tags */
SELECT v.[value], COUNT(*)
FROM dbo.Posts
CROSS APPLY string_split(REPLACE(Tags, N'><', N'>;<'), N';') AS v
WHERE Tags IS NOT NULL
GROUP BY v.[value]
ORDER BY COUNT(*) ASC;
GO
DROP INDEX IF EXISTS Score_Tags ON dbo.Posts;
CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags);
GO
SET STATISTICS IO ON;
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
ORDER BY Score DESC;
--546 logical reads
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<progress-reports>%'
ORDER BY Score DESC;
--75555 logical reads
SET STATISTICS IO OFF;
/*
The more a certain tag is used, the quicker the query *might* become.
It depends(TM) on the distribution of the tags in the dataset, if a tag is often used on downvoted posts we need to scan the index longer to find the relevant posts
2. Could you change the query to perform better?
Let's try: There a quite a few rows where there are no tags at all, so we can start by filtering those out.
*/
SET STATISTICS IO ON;
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
AND Tags IS NOT NULL
ORDER BY Score DESC;
--546 logical reads
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<progress-reports>%'
AND Tags IS NOT NULL
ORDER BY Score DESC;
--75555 logical reads
SET STATISTICS IO OFF;
/*
Okay, the query didn't benefit from the WHERE clause, because the leading column in the index is Score and we need to scan the whole index anyways.
So the index created by our lovely DBA isn't helping us in this case, and flipping the fields around will not help as well, as it will force SORTs into our plans because of the ORDER BY.
Let's make an extra table for storing the tags, however if we want to go this route we need to modify the app code as well to use this table, unless we do some shenanigans with the posts table and views...
*/
DROP TABLE IF EXISTS dbo.PostTags;
CREATE TABLE dbo.PostTags
(
PostId int NOT NULL
,Tag nvarchar(50) NOT NULL
,PRIMARY KEY CLUSTERED (Tag ASC, PostId ASC)
);
INSERT INTO dbo.PostTags
SELECT PostId = p.Id, Tag = v.[value]
FROM dbo.Posts AS p
CROSS APPLY string_split(REPLACE(p.Tags, N'><', N'>;<'), N';') AS v
WHERE Tags IS NOT NULL;
SET STATISTICS IO ON;
SELECT TOP 100 *
FROM dbo.Posts AS p
WHERE EXISTS
(
SELECT 1/0
FROM dbo.PostTags AS pt
WHERE pt.Tag = N'<sql-server>'
AND pt.PostId = p.Id
)
ORDER BY p.Score DESC;
--73111 + 104852 logical reads
SELECT TOP 100 *
FROM dbo.Posts AS p
WHERE EXISTS
(
SELECT 1/0
FROM dbo.PostTags AS pt
WHERE pt.Tag = N'<progress-reports>'
AND pt.PostId = p.Id
)
ORDER BY p.Score DESC;
--6 + 8 logical reads
SET STATISTICS IO OFF;
DROP TABLE IF EXISTS dbo.PostTags;
GO
/*
Okay, that fixed the outlier query, but made the non-outlier query so much worse.
So let's modify the with the score field this time ... More denormalization \o/
*/
DROP TABLE IF EXISTS dbo.PostTags;
CREATE TABLE dbo.PostTags
(
PostId int NOT NULL
,Tag nvarchar(50) NOT NULL
,Score int NOT NULL
,PRIMARY KEY CLUSTERED (Score ASC, Tag ASC, PostId ASC)
);
INSERT INTO dbo.PostTags
SELECT PostId = p.Id, Tag = v.[value], Score = p.Score
FROM dbo.Posts AS p
CROSS APPLY string_split(REPLACE(p.Tags, N'><', N'>;<'), N';') AS v
WHERE Tags IS NOT NULL;
CREATE INDEX idx_PostTags ON dbo.PostTags (Tag ASC);
SET STATISTICS IO ON;
SELECT TOP 100 *
FROM dbo.Posts AS p
CROSS APPLY
(
SELECT pt.Score
FROM dbo.PostTags AS pt
WHERE pt.Tag = N'<sql-server>'
AND pt.PostId = p.Id
) AS pt
ORDER BY pt.Score DESC;
--415 + 4 logical reads
SELECT TOP 100 *
FROM dbo.Posts AS p
CROSS APPLY
(
SELECT pt.Score
FROM dbo.PostTags AS pt
WHERE pt.Tag = N'<progress-reports>'
AND pt.PostId = p.Id
) AS pt
ORDER BY p.Score DESC;
--5 + 5 logical reads
SET STATISTICS IO OFF;
DROP TABLE IF EXISTS dbo.PostTags;
GO
/*
That fixed up both the outlier and the non-outlier query, with ... quite a "few" database changes that will also require app changes to handle.
The bussiness users will be pleased because great response times, but the application team will not like this solution as it is not very fault tolerate and requires significate changes to the app code.
The database team might not like the extra denormalization, but since it is mostly the application code to handle it, they will go "it's not our problem".
3. Could you change the indexes to perform better, without changing the table structure?
A simple solution would be to filter the index made by our DBA
*/
DROP INDEX IF EXISTS Score_Tags ON dbo.Posts;
CREATE INDEX Score_Tags ON dbo.Posts(Score, Tags) WHERE Tags IS NOT NULL;
/* Note: Could also be PostTypeId = 1 */
SET STATISTICS IO ON;
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<sql-server>%'
AND Tags IS NOT NULL
ORDER BY Score DESC;
--1817 logical reads
SELECT TOP 100 *
FROM dbo.Posts
WHERE Tags LIKE '%<progress-reports>%'
AND Tags IS NOT NULL
ORDER BY Score DESC;
--56228 logical reads
SET STATISTICS IO OFF;
/*
That knocked some logical read off from the query, but nothing substantial
Time to get creative, let's try an indexed view where we split the tags column into individual parts.
Thanks Thomas Franz for the general idea.
*/
DROP VIEW IF EXISTS dbo.view_PostTags
GO
CREATE OR ALTER VIEW dbo.view_PostTags
WITH SCHEMABINDING
AS
SELECT p.Id, p.Tags, p.Score
,Tag1 = SUBSTRING(p.Tags, 1, CHARINDEX(N'>', p.Tags, 1))
,Tag2 = CASE
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 2
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) > 2
THEN SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, 150)))
ELSE NULL
END
,Tag3 = CASE
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 3
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) > 3
THEN SUBSTRING(
SUBSTRING(p.Tags, CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, 150), 1) + CHARINDEX(N'>', p.Tags, 1) + 1, 150)
,1
,CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', SUBSTRING(p.Tags, CHARINDEX(N'>', p.Tags, 1) + 1, 150), 1) + CHARINDEX(N'>', p.Tags, 1) + 1, 150), 1)
)
ELSE NULL
END
,Tag4 = CASE
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 4
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) > 4
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, CHARINDEX(N'<', SUBSTRING(REVERSE(p.Tags), CHARINDEX(N'<', REVERSE(p.Tags), 1) + 1, 150))))
ELSE NULL
END
,Tag5 = CASE
WHEN LEN(p.Tags) - LEN(REPLACE(p.Tags, N'<', N'')) = 5
THEN REVERSE(SUBSTRING(REVERSE(p.Tags), 1, CHARINDEX(N'<', REVERSE(p.Tags), 1)))
ELSE NULL
END
FROM dbo.Posts AS p
WHERE p.Tags IS NOT NULL;
GO
CREATE UNIQUE CLUSTERED INDEX idx_view_PostTags_Tag1 ON view_PostTags (Tag1, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag2 ON view_PostTags (Tag2, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag3 ON view_PostTags (Tag3, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag4 ON view_PostTags (Tag4, Score, Id);
CREATE NONCLUSTERED INDEX idx_view_PostTags_Tag5 ON view_PostTags (Tag5, Score, Id);
GO
DROP VIEW IF EXISTS dbo.view_TagsPerPost
GO
CREATE OR ALTER VIEW dbo.view_TagsPerPost
AS
SELECT Id, Score, Tag = Tag1
FROM dbo.view_PostTags WITH (NOEXPAND)
UNION ALL
SELECT Id, Score, Tag = Tag2
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag2 IS NOT NULL
UNION ALL
SELECT Id, Score, Tag = Tag3
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag3 IS NOT NULL
UNION ALL
SELECT Id, Score, Tag = Tag4
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag4 IS NOT NULL
UNION ALL
SELECT Id, Score, Tag = Tag5
FROM dbo.view_PostTags WITH (NOEXPAND)
WHERE Tag5 IS NOT NULL
GO
SET STATISTICS IO ON;
SELECT TOP 100 p.*
FROM dbo.view_TagsPerPost AS tpp
INNER JOIN dbo.Posts AS p
ON p.Id = tpp.Id
WHERE tpp.Tag = '<sql-server>'
ORDER BY tpp.Score DESC;
--1139 + 20807 logical reads
SELECT TOP 100 *
FROM dbo.view_TagsPerPost AS tpp
INNER JOIN dbo.Posts AS p
ON p.Id = tpp.Id
WHERE tpp.Tag = '<progress-reports>'
ORDER BY tpp.Score DESC;
--5 + 24 logical reads
SET STATISTICS IO OFF;
DROP VIEW IF EXISTS dbo.view_PostTags
GO
DROP VIEW IF EXISTS dbo.view_TagsPerPost
GO
/*
That fixed up both the outlier and the non-outlier query, with some database changes that will the app will need to do nothing with. However the query from the app will need to changed to use the view.
The bussiness users will be pleased because great response times.
For the application team it is a small and managable change, but the ways tags are recorded can't be easily changed in the future.
However the database team might oppose the solution, as the view is mostly a database solution and they might be the team to maintain it.
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment