(originally posted on blogs.msdn.com/b/sqlcat – http://bit.ly/1ut2tlu)
Reviewers: Lubor Kollar; Conor Cunningham; Steve Howard; Kun Cheng; James Podgorski; Jimmy May; Joseph Sack; Welly Lee; Neeraj Joshi; Keith Bauer
We recently assisted a global ISV to address a performance issue related to a poor performing recursive CTE (Common Table Expression). The ISV wanted the query that was running in excess of 3 minutes to run in less than 15 seconds on their servers. The end result of our efforts was a 3,600% performance improvement.
Note: Validation for this post was performed in the SQL CAT Customer Lab on an HP Proliant DL380 G7, Westmere-EP X5680 3.33GHz 2 socket, 6 physical cores, 12 logical cores for a total of 24 cores; 144GB RAM.
A CTE (http://msdn.microsoft.com/en-us/library/ms186243.aspx), amongst other things, is a very effective single statement for handling hierarchial data, and more specifically, parent-child relationships. A CTE is a reasonable choice when the intention is to iterate through the data, particularly in cases where the schema cannot be changed but recursive queries are required. If you can design a different schema layout, there are often much faster ways to execute queries over hierarchies. CTEs are also a natural choice when converting a CONNECT BY statement from alternative RDBMS platforms (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/queries003.htm).
There is a scenario where using a CTE construct is significantly less efficient than the traditional approach to traversing the Parent-Child construct. This blog will describe how and when a WHILE loop performs better than the CTE. The characteristics that can contribute to CTEs being a less than optimal choice are:
- Non-Unique Parent/Child Key
- Complex Predicates
In our scenario the ISV was attempting to extract all the items that adhered to a certain set of characteristics. An item could exist on its own or be a component (child) within a larger item (parent) and there were no constraints to the number of parents a child could have. For this scenario, the query was recursively traversing the hierarchy in a non-unique fashion.
Below is the original CTE code:
WITH CTE(ItemID, StartPt, StartID, EndPt, EndID, level)
( — Anchor member definition
SELECT anc.ItemID, anc.StartPt, anc.StartID, anc.EndPt, anc.EndID,0
FROM dbo.CTEIssue AS anc
WHERE anc.StartPt = -1512034855
AND anc.StartID = 1809069910
AND anc.Category IN(-379491138,-379518300,-379520626,472917509,-379494005,-379506005,-2134852210,-379515627,-379515134,-379484415)
— Recursive member definition
SELECT chi.ItemID, chi.StartPt, chi.StartID, chi.EndPt, chi.EndID, par.level+1
dbo.CTEIssue AS chi
CTE AS par ON(par.EndPt = chi.StartPt AND par.EndID = chi.StartID)
SELECT DISTINCT cti.*
dbo.CTEIssue AS cti
CTE AS cte ON(cti.ItemID = cte.ItemID)
WHERE (cti.EndPt <> -1512034855 OR cti.Category = -379484415)
Identifying the issue
Firstly, we understand that there are cases where an inefficient query is acceptable, provided that it meets the intended objectives. It only becomes an issue when the intended objectives are not attained. In this case, the inefficient CTE meets the expected response times for small data sets or simple predicates and only became an issue once the data set became large and the throughput increased. Another concept to appreciate is that there are cases where a slow operation for small data sets is significantly faster when dealing with larger data sets. For example, nested joins are great for small sets of data but are typically no match for hash joins when the data is large.
It’s relatively easy to recognize that using CTEs is not the most efficient approach when the statement immediately following the CTE expression relies on the DISTINCT of all the columns. Essentially this implies that the recursive logic doesn’t have a primary key in the anchor, and thereby allowing each recursive member to not be unique. This creates a scenario where a huge number of duplicate rows are generated.
Another way to recognize that a CTE is inefficient is to compare the estimated query plan statistics to the actual statistics. The below table highlights the difference between the query plan Estimates versus Actuals.
|Estimate Rows||Estimate Executions||Rows||Executes||TruncatedOperatorText (First 60 characters)|
|183.53||NULL||3||1||WITH CTE(ItemID, StartPt, StartID, EndPt, EndID, level) AS|
|35.22541||1||3||1|||–Nested Loops(Inner Join, OUTER REFERENCES:([Recr1014]|
|35.22541||1||299197||1|||–Hash Match(Aggregate, HASH:([Recr1014]), RESIDUAL:|
|1240.829||1||4129317||1||| |–Index Spool(WITH STACK)|
|1||1241.829||0||0||| |–Compute Scalar(DEFINE:([Expr1021]=(|
|1237.773||1||0||0||| | |–Compute Scalar(DEFINE:([Expr10|
|1237.773||1||1723||1||| | |–Nested Loops(Inner Join,|
|1237.773||1||1723||1||| | |–Index Seek(OBJECT:([|
|1||1237.773||1723||1723||| | |–Clustered Index Seek|
|1.002469||1241.829||4127594||1||| |–Assert(WHERE:(CASE WHEN [Expr1023]>|
|1.002469||1241.829||4127594||1||| |–Nested Loops(Inner Join, OUTER|
|1||1241.829||0||0||| |–Compute Scalar(DEFINE:([E|
|1||1241.829||4129317||1||| | |–Table Spool(WITH STA|
|3.055948||1240.829||0||0||| |–Compute Scalar(DEFINE:([E|
|3.055948||1240.829||4127594||4129317||| |–Nested Loops(Inner J|
|3.055948||1240.829||4127594||4129317||| |–Index Seek(OBJE|
|1||3791.91||4127594||4127594||| |–Clustered Index|
|1||35.22541||3||299197|||–Clustered Index Seek(OBJECT:([MCC2].[dbo].[CTEIssu|
As you can see – in some cases the estimated rows versus actual rows were significantly skewed.
Improving the performance
The objective of that query was to identify unique rows and remove them from the recursive iteration. If this can be done in the CTE, then that would be ideal, otherwise a more performant alternative is to convert the query into WHILE loop. The CTE is far more efficient than WHILE loop if there is a unique parent/child key. However let me emphasize that when you have non-unique parent/child keys, estimations used to create the query plan is likely to be significantly skewed.
Below is the code that meets the objective of been more restrictive during each iteration. It is followed by an explanation of the reasoning behind the construction of the query, and why it performs better:
SET NOCOUNT ON DECLARE @StartPt int = -1512034855; DECLARE @StartID int = 1809069910; DECLARE @Counter tinyint = 0; DECLARE @MaxRecursion tinyint = 50; DECLARE @CategoryList table(Category int NOT NULL) DECLARE @PtIDPair table ( Pt int NOT NULL ,ID int NOT NULL ,lvl tinyint NOT NULL ,PRIMARY KEY CLUSTERED (lvl, Pt, ID) ) CREATE TABLE #Item ( ItemID int NOT NULL ,Category int NOT NULL ,StartPt int NOT NULL ,StartID int NOT NULL ,EndPt int NOT NULL ,EndID int NOT NULL ,lvl tinyint NOT NULL ) CREATE CLUSTERED INDEX CDX_#Item_Lvl ON #Item ([lvl]) CREATE NONCLUSTERED INDEX NDX_#Item_ItemID ON #Item (ItemID) INSERT INTO @CategoryList VALUES (-379491138) ,(-379518300) ,(-379520626) ,(472917509) ,(-379494005) ,(-379506005) ,(-2134852210) ,(-379515627) ,(-379515134) ,(-379484415); --Gather the anchors -- There are a set of ItemID values, which happens to contain a common -- Pt/Id combination. These are level 0 (Anchors) INSERT INTO #Item (ItemID, Category, StartPt, StartID, EndPt, EndID, lvl) SELECT ItemID, cti.Category, StartPt, StartID, EndPt, EndID, @Counter FROM dbo.CTEIssue AS cti INNER JOIN @CategoryList AS lst ON(lst.Category = cti.Category) WHERE StartPt = @StartPt AND StartID = @StartID -- We want to remove the duplicates, from the final result set, so we do our filtering to create a unique set of anchor pairs INSERT INTO @PtIDPair(Pt, ID, lvl) SELECT EndPt, EndID, @Counter FROM #Item WHERE lvl = @Counter GROUP BY EndPt, EndID WHILE @@ROWCOUNT > 0 BEGIN SET @Counter += 1 IF @Counter = 50 BEGIN RAISERROR('Max Recursion reached', 16, 1) BREAK END INSERT INTO #Item(ItemID, Category, StartPt, StartID, EndPt, EndID, lvl) SELECT cti.ItemID, cti.Category, cti.StartPt, cti.StartID, cti.EndPt, cti.EndID, @Counter FROM dbo.CTEIssue AS cti INNER JOIN @CategoryList AS lst ON(lst.Category = cti.Category) INNER JOIN @PtIDPair AS pr ON(cti.StartPt = pr.Pt AND cti.StartID = pr.ID) LEFT OUTER JOIN #Item AS itm ON(itm.ItemID = cti.ItemID) WHERE itm.ItemID IS NULL AND pr.lvl = @Counter - 1 INSERT INTO @PtIDPair(Pt, ID, lvl) SELECT itm.EndPt, itm.EndID, @Counter FROM #Item AS itm LEFT OUTER JOIN @PtIDPair AS pr ON(itm.EndPt = pr.Pt AND itm.EndID = pr.ID) WHERE pr.ID IS NULL AND itm.lvl = @Counter GROUP BY itm.EndPt, itm.EndID END SELECT cti.* FROM dbo.CTEIssue AS cti INNER JOIN #Item AS itm ON(cti.ItemID = itm.ItemID) WHERE cti.EndPt <> -1512034855 OR cti.Category = -379484415 GO
Side Note: Our investigations revealed that in this case table variables significantly outperformed local temp tables.
The first step was to declare variables and create a temporary table to store the equivalent of the CTE result set. A table variable was used to store the unique JOIN clause rows generated during the iterations. Just before entering the WHILE…END, we created the first record (a.k.a the Anchor Member). In the WHILE…END loop we found all the children based on the JOIN criteria, further filtered by the predicates, and inserted them into the list of unique records. The next iteration would not collect the same children again. This was the most important distinction.
Improving the CTE Query Plan
Compared to the CTE, the WHILE loop is a fair amount of coding. Prior to re-writing the CTE as a WHILE loop, we attempted to optimize the query plan itself, within the constraint of not being able to generate unique values in the join. We modified the CTE to use a table variable versus using an IN clause. We created indexes and updated the statistics, and achieved a 50% performance improvement, but unfortunately and more importantly missing the objective of improving by at least a factor of 12.
Below are the STATISTICS IO and TIME demonstrating the improvement of adding the TVP and indexes:
(3 row(s) affected)
Table ‘CTEIssue’. Scan count 41,592,377, logical reads 137,303,190, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 2, logical reads 24,576,797, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 209197ms, elapsed time = 217344ms.
(3 row(s) affected)
Table ‘CTEIssue’. Scan count 4,428,524, logical reads 41,930,102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 2, logical reads 24,575,641, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘#6EF57B66’. Scan count 1, logical reads 19,007,602, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 143646ms, elapsed time = 151268ms.
Side Note: Table variables do not maintain statistics and are generally not recommended when there are cost-based plan choices that can result in significant performance variability; one would typically leverage a local temporary table in that case. We used a TVP here because the ISV application was creating an IN clause that could reach hundreds of values, and a table variable data type was more robust, and in some cases more efficient.
When we compare the query plans between the CTE and the WHILE loop, we realized that each iteration of the WHILE loop, has the luxury of a different query plan. This is unlike the CTE which only executed against one query plan.
A significant contribution to the long running nature of the CTE query is this portion of the query plan that estimates less than 10 rows on the outer and inner query, but ends up having to iterate through millions of rows on both sides of the query. For a table that only has a few hundred thousand rows, this is a function of the lack of unique join criteria.
The Common Table Expression (CTE) is a powerful syntax in its ability to address the historically challenging objective of recursive queries using T-SQL. It is a natural and often times appropriate syntax when migrating from the CONNECT BY syntax. Recursive CTE queries do have a reliance on the unique parent/child keys in order to get the best performance. If this is not possible to achieve, then a WHILE loop is potentially a much more efficient approach to handling the recursive query.
Our final results completed this frequently run query within 4 seconds from 144 seconds, a 3600% performance improvement!!!