Optimize Recursive CTE Query

(originally posted on blogs.msdn.com/b/sqlcat – http://bit.ly/1ut2tlu)

MSDN Blogs

Reviewers: Lubor Kollar; Conor Cunningham; Steve Howard; Kun Cheng; James Podgorski; Jimmy May; Joseph Sack; Welly Lee; Neeraj Joshi; Keith Bauer


We recently assisted a global ISV to address a performance issue related to a poor performing recursive CTE (Common Table Expression). The ISV wanted the query that was running in excess of 3 minutes to run in less than 15 seconds on their servers. The end result of our efforts was a 3,600% performance improvement.

Note: Validation for this post was performed in the SQL CAT Customer Lab on an HP Proliant DL380 G7, Westmere-EP X5680 3.33GHz 2 socket, 6 physical cores, 12 logical cores for a total of 24 cores; 144GB RAM.

A CTE (http://msdn.microsoft.com/en-us/library/ms186243.aspx), amongst other things, is a very effective single statement for handling hierarchial data, and more specifically, parent-child relationships. A CTE is a reasonable choice when the intention is to iterate through the data, particularly in cases where the schema cannot be changed but recursive queries are required. If you can design a different schema layout, there are often much faster ways to execute queries over hierarchies. CTEs are also a natural choice when converting a CONNECT BY statement from alternative RDBMS platforms (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/queries003.htm).

There is a scenario where using a CTE construct is significantly less efficient than the traditional approach to traversing the Parent-Child construct. This blog will describe how and when a WHILE loop performs better than the CTE. The characteristics that can contribute to CTEs being a less than optimal choice are:

  • Non-Unique Parent/Child Key
  • Complex Predicates

In our scenario the ISV was attempting to extract all the items that adhered to a certain set of characteristics. An item could exist on its own or be a component (child) within a larger item (parent) and there were no constraints to the number of parents a child could have. For this scenario, the query was recursively traversing the hierarchy in a non-unique fashion.

Below is the original CTE code:

WITH CTE(ItemID, StartPt, StartID, EndPt, EndID, level)
(    — Anchor member definition
SELECT anc.ItemID, anc.StartPt, anc.StartID, anc.EndPt, anc.EndID,0
FROM dbo.CTEIssue AS anc
WHERE anc.StartPt = -1512034855
AND anc.StartID = 1809069910
AND anc.Category IN(-379491138,-379518300,-379520626,472917509,-379494005,-379506005,-2134852210,-379515627,-379515134,-379484415)
— Recursive member definition
SELECT chi.ItemID, chi.StartPt, chi.StartID, chi.EndPt, chi.EndID, par.level+1
dbo.CTEIssue AS chi
CTE AS par ON(par.EndPt = chi.StartPt AND par.EndID = chi.StartID)
chi.Category IN(-379491138,-379518300,-379520626,472917509,-379494005,-379506005,-2134852210,-379515627,-379515134,-379484415)

dbo.CTEIssue AS cti
CTE AS cte ON(cti.ItemID = cte.ItemID)
WHERE (cti.EndPt <> -1512034855 OR cti.Category = -379484415)

Identifying the issue

Firstly, we understand that there are cases where an inefficient query is acceptable, provided that it meets the intended objectives. It only becomes an issue when the intended objectives are not attained. In this case, the inefficient CTE meets the expected response times for small data sets or simple predicates and only became an issue once the data set became large and the throughput increased. Another concept to appreciate is that there are cases where a slow operation for small data sets is significantly faster when dealing with larger data sets. For example, nested joins are great for small sets of data but are typically no match for hash joins when the data is large.

It’s relatively easy to recognize that using CTEs is not the most efficient approach when the statement immediately following the CTE expression relies on the DISTINCT of all the columns. Essentially this implies that the recursive logic doesn’t have a primary key in the anchor, and thereby allowing each recursive member to not be unique. This creates a scenario where a huge number of duplicate rows are generated.

Another way to recognize that a CTE is inefficient is to compare the estimated query plan statistics to the actual statistics. The below table highlights the difference between the query plan Estimates versus Actuals.

Estimate Rows Estimate Executions Rows Executes TruncatedOperatorText (First 60 characters)
183.53 NULL 3 1 WITH CTE(ItemID, StartPt, StartID, EndPt, EndID, level) AS
35.22541 1 3 1 |–Nested Loops(Inner Join, OUTER REFERENCES:([Recr1014]
35.22541 1 299197 1 |–Hash Match(Aggregate, HASH:([Recr1014]), RESIDUAL:
1240.829 1 4129317 1 | |–Index Spool(WITH STACK)
1240.829 1 4129317 1 | |–Concatenation
1 1241.829 0 0 | |–Compute Scalar(DEFINE:([Expr1021]=(
1237.773 1 0 0 | | |–Compute Scalar(DEFINE:([Expr10
1237.773 1 1723 1 | | |–Nested Loops(Inner Join,
1237.773 1 1723 1 | | |–Index Seek(OBJECT:([
1 1237.773 1723 1723 | | |–Clustered Index Seek
1.002469 1241.829 4127594 1 | |–Assert(WHERE:(CASE WHEN [Expr1023]>
1.002469 1241.829 4127594 1 | |–Nested Loops(Inner Join, OUTER
1 1241.829 0 0 | |–Compute Scalar(DEFINE:([E
1 1241.829 4129317 1 | | |–Table Spool(WITH STA
3.055948 1240.829 0 0 | |–Compute Scalar(DEFINE:([E
3.055948 1240.829 4127594 4129317 | |–Nested Loops(Inner J
3.055948 1240.829 4127594 4129317 | |–Index Seek(OBJE
1 3791.91 4127594 4127594 | |–Clustered Index
1 35.22541 3 299197 |–Clustered Index Seek(OBJECT:([MCC2].[dbo].[CTEIssu

As you can see – in some cases the estimated rows versus actual rows were significantly skewed.

Improving the performance

The objective of that query was to identify unique rows and remove them from the recursive iteration. If this can be done in the CTE, then that would be ideal, otherwise a more performant alternative is to convert the query into WHILE loop. The CTE is far more efficient than WHILE loop if there is a unique parent/child key. However let me emphasize that when you have non-unique parent/child keys, estimations used to create the query plan is likely to be significantly skewed.

Below is the code that meets the objective of been more restrictive during each iteration. It is followed by an explanation of the reasoning behind the construction of the query, and why it performs better:


DECLARE @StartPt int = -1512034855;
DECLARE @StartID int = 1809069910;
DECLARE @Counter tinyint = 0;
DECLARE @MaxRecursion tinyint = 50;

DECLARE @CategoryList table(Category int NOT NULL)

DECLARE @PtIDPair table
    Pt        int        NOT NULL 
    ,ID        int        NOT NULL
    ,lvl    tinyint    NOT NULL

      ItemID     int NOT NULL
      ,Category     int NOT NULL
      ,StartPt int NOT NULL
      ,StartID     int NOT NULL
      ,EndPt     int NOT NULL
      ,EndID     int NOT NULL
      ,lvl         tinyint NOT NULL


INSERT INTO @CategoryList 

--Gather the anchors
-- There are a set of ItemID values, which happens to contain a common
-- Pt/Id combination.  These are level 0 (Anchors)
    (ItemID, Category, StartPt, StartID, EndPt, EndID, lvl)
    ItemID, cti.Category, StartPt, StartID, EndPt, EndID, @Counter
    dbo.CTEIssue AS cti
    @CategoryList AS lst ON(lst.Category = cti.Category)
    StartPt = @StartPt
    AND StartID = @StartID

-- We want to remove the duplicates, from the final result set, so we do our filtering to create a unique set of anchor pairs
INSERT INTO @PtIDPair(Pt, ID, lvl) 
SELECT EndPt, EndID, @Counter
FROM #Item
WHERE lvl = @Counter

    SET @Counter += 1
    IF @Counter = 50
        RAISERROR('Max Recursion reached', 16, 1)
    INSERT INTO #Item(ItemID, Category, StartPt, StartID, EndPt, EndID, lvl)
    SELECT    cti.ItemID, cti.Category, cti.StartPt, cti.StartID, cti.EndPt, cti.EndID, @Counter
        dbo.CTEIssue AS cti
        INNER JOIN
        @CategoryList AS lst ON(lst.Category = cti.Category)
        INNER JOIN
        @PtIDPair AS pr ON(cti.StartPt = pr.Pt AND cti.StartID = pr.ID)
        #Item AS itm ON(itm.ItemID = cti.ItemID)
        itm.ItemID IS NULL
        AND pr.lvl = @Counter - 1

    INSERT INTO @PtIDPair(Pt, ID, lvl) 
    SELECT itm.EndPt, itm.EndID, @Counter
        #Item AS itm
        @PtIDPair AS pr ON(itm.EndPt = pr.Pt AND itm.EndID = pr.ID)
        pr.ID IS NULL
        AND itm.lvl = @Counter
        itm.EndPt, itm.EndID

    dbo.CTEIssue AS cti
    #Item AS itm ON(cti.ItemID = itm.ItemID)
    cti.EndPt <> -1512034855
    OR cti.Category = -379484415

Side Note: Our investigations revealed that in this case table variables significantly outperformed local temp tables.

The first step was to declare variables and create a temporary table to store the equivalent of the CTE result set. A table variable was used to store the unique JOIN clause rows generated during the iterations. Just before entering the WHILE…END, we created the first record (a.k.a the Anchor Member). In the WHILE…END loop we found all the children based on the JOIN criteria, further filtered by the predicates, and inserted them into the list of unique records. The next iteration would not collect the same children again. This was the most important distinction.

Improving the CTE Query Plan

Compared to the CTE, the WHILE loop is a fair amount of coding. Prior to re-writing the CTE as a WHILE loop, we attempted to optimize the query plan itself, within the constraint of not being able to generate unique values in the join. We modified the CTE to use a table variable versus using an IN clause. We created indexes and updated the statistics, and achieved a 50% performance improvement, but unfortunately and more importantly missing the objective of improving by at least a factor of 12.

Below are the STATISTICS IO and TIME demonstrating the improvement of adding the TVP and indexes:


(3 row(s) affected)

Table ‘CTEIssue’. Scan count 41,592,377, logical reads 137,303,190, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Worktable’. Scan count 2, logical reads 24,576,797, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 209197ms, elapsed time = 217344ms.

with TVP

(3 row(s) affected)

Table ‘CTEIssue’. Scan count 4,428,524, logical reads 41,930,102, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Worktable’. Scan count 2, logical reads 24,575,641, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#6EF57B66’. Scan count 1, logical reads 19,007,602, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 143646ms, elapsed time = 151268ms.

Side Note: Table variables do not maintain statistics and are generally not recommended when there are cost-based plan choices that can result in significant performance variability; one would typically leverage a local temporary table in that case. We used a TVP here because the ISV application was creating an IN clause that could reach hundreds of values, and a table variable data type was more robust, and in some cases more efficient.

When we compare the query plans between the CTE and the WHILE loop, we realized that each iteration of the WHILE loop, has the luxury of a different query plan. This is unlike the CTE which only executed against one query plan.

A significant contribution to the long running nature of the CTE query is this portion of the query plan that estimates less than 10 rows on the outer and inner query, but ends up having to iterate through millions of rows on both sides of the query. For a table that only has a few hundred thousand rows, this is a function of the lack of unique join criteria.

Query Plan Statistics

In Conclusion…

The Common Table Expression (CTE) is a powerful syntax in its ability to address the historically challenging objective of recursive queries using T-SQL. It is a natural and often times appropriate syntax when migrating from the CONNECT BY syntax. Recursive CTE queries do have a reliance on the unique parent/child keys in order to get the best performance. If this is not possible to achieve, then a WHILE loop is potentially a much more efficient approach to handling the recursive query.

Our final results completed this frequently run query within 4 seconds from 144 seconds, a 3600% performance improvement!!!

This entry was posted in Thoughts and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: