Common Table Expressions (CTEs) are a bit complex and difficult to understand at first blush. Many of the tutorials and examples on the net don't make it any easier for those just starting out. I thought I'd put together a quick gist that tries to simplify the concept and demonstrate how to do recursive queries using CTEs.
Keep in mind that CTEs have other uses besides just recursive queries but this gist is just about how they can be used to create recursive searches.
I'm using SQLite in this example but any SQL language that implements the WITH keyword should be able to do the same thing. If you've never used SQLite before, you are missing out on an amazing, open source, stand alone, SQL engine. I encourage you to check it out.
In the sample data we have a basic employee directory. The first column is the users name. The second column is their user ID, the third is their title, and the last column is the user ID of their manager. This creates a simple flat file database that encapsulates hierarchical data. In this case, the management structure of the company. You can find anyone's position in the company by simply following the chain of managers until you reach the top (a person with no manager listed).
First, of course, download the files from the gist to a directory.
Use the initialization.sql
file to create the table and import the data as follows:
sqlite3 test.sqlite < initialization.sql
Now you can access the data via either the users table or via the view created using the CTE. The CTE allows you to do an arbitrary depth recursive scan of the data to create a management chain or path back to the top of the organization.
sqlite> select * from users order by name limit 10;
Adelina Marsha|admarsha|Change Magician|shcarroll
Akilah Audie|akaudie|Retail Jedi|hetennille
Alena Patrick|alpatrick|New Media Guru|auandrea
Alisia Caterina|alcaterina|Happiness Advocate|suhobert
Allen Cory|alcory|Digital Overlord|chleopoldo
Alphonso Fredrick|alfredrick|Director of First Impressions|wamitch
Amos Chia|amchia|Digital prophet|dehong
Angelic Arline|anarline|Under Secretary to the Sub-Committee|tacathie
Angie Cecilia|ancecilia|Cheese Sprayer|alcory
Anika Rob|anrob|Ambassador of buzz|debertie
This is just the stock query you can do of any table. The last field is the manager user ID (UID) so all we can see is the UID of the users's manager. We could write a query to see the manager's name instead of just their UID by JOINing our current query with the same table using the manager UID. For example
sqlite> select u.*,m.name from users u join users m on u.manager=m.cn order by u.name limit 10;
Adelina Marsha|admarsha|Change Magician|shcarroll|Shanita Carroll
Akilah Audie|akaudie|Retail Jedi|hetennille|Hermine Tennille
Alena Patrick|alpatrick|New Media Guru|auandrea|Audria Andrea
Alisia Caterina|alcaterina|Happiness Advocate|suhobert|Sulema Hobert
Allen Cory|alcory|Digital Overlord|chleopoldo|Ching Leopoldo
Alphonso Fredrick|alfredrick|Director of First Impressions|wamitch|Wava Mitch
Amos Chia|amchia|Digital prophet|dehong|Dee Hong
Angelic Arline|anarline|Under Secretary to the Sub-Committee|tacathie|Tajuana Cathie
Angie Cecilia|ancecilia|Cheese Sprayer|alcory|Allen Cory
Anika Rob|anrob|Ambassador of buzz|debertie|Dedra Bertie
We could continue this process to see the next manager up the chain as well by using a subquery. However, the query will get more and more complex and, without examining the data, we don't really know how many subqueries we need to add to be sure we get all the way to the top of the tree.
However by using our recursive CTE we can get a complete list of the management chain, all the way to the top of the tree. Using our recursive CTE we get the following
sqlite> select * from bp order by name limit 10;
Adelina Marsha|admarsha|Change Magician|shcarroll|Shanita Carroll > Ching Leopoldo
Akilah Audie|akaudie|Retail Jedi|hetennille|Hermine Tennille > Allen Cory > Ching Leopoldo
Alena Patrick|alpatrick|New Media Guru|auandrea|Audria Andrea > Helena Darius > Ching Leopoldo
Alisia Caterina|alcaterina|Happiness Advocate|suhobert|Sulema Hobert > Helena Darius > Ching Leopoldo
Allen Cory|alcory|Digital Overlord|chleopoldo|Ching Leopoldo
Alphonso Fredrick|alfredrick|Director of First Impressions|wamitch|Wava Mitch > Ching Leopoldo
Amos Chia|amchia|Digital prophet|dehong|Dee Hong > Taylor Harvey > Ching Leopoldo
Angelic Arline|anarline|Under Secretary to the Sub-Committee|tacathie|Tajuana Cathie > Helena Darius > Ching Leopoldo
Angie Cecilia|ancecilia|Cheese Sprayer|alcory|Allen Cory > Ching Leopoldo
Anika Rob|anrob|Ambassador of buzz|debertie|Dedra Bertie > Wava Mitch > Ching Leopoldo
As we can see Allen Cory reports directly to the top boss (Ching Leopoldo). Alphonso Fredrick reports to Wava Mitch who reports to Ching. Alena Patrick reports to Audria, and Alisia Caterina reports to Sulema. Both Audria and Sulema report to Helena who then reports to Ching.
As you can see, there are a number of different levels of reports in the company. Some report directly to the boss, while others are two or more levels away from the boss. It would be nearly impossible to create a sufficiently deep set of joins to create the 'boss path' back to the root of the organization. And even if you did, it would be very fragile and need constant maintenance to keep up with a growing organization.
On the other hand, by using a recursive CTE, we are able to handle any level of recursion (up to the limits of the server's recursion setting).
Creating a view in SQLite give us an easy way to store our CTE in the database so we don't have to type all that out each time we want to return the management path in a query. Now, instead of specifying the users table, we instead specify the view name (bp). The view works just like a real table except that you can't insert into a view.
So how does all the recursive magic happen? Let's take a look at the body of the view using the recursive common table expression.
CREATE VIEW bp AS
-- create a CTE (common table expression)
-- think of this as creating a temporary table that only exists during this query
-- works somewhat like CREATE TEMPORARY TABLE bosspath(cn, path)
WITH RECURSIVE bosspath(cn,path) AS
(
-- initial select statement to get started
-- this is only executed once
-- in this case we are starting at the 'root' of the management tree
-- the record with no manager
SELECT cn,name FROM users WHERE manager=''
-- merge those results with the following query
UNION ALL
-- recursive select statement
-- executed repeatedly until there are no more results
SELECT users.cn,users.name||' > '||bosspath.path
FROM users
JOIN bosspath ON users.manager=bosspath.cn
)
-- now query the CTE to produce results
-- think of 'bosspath' as being a (very) temporary table
SELECT users.*,bosspath.path FROM users JOIN bosspath ON users.manager=bosspath.cn;
As you can see from the embedded comments (lines with "--"), the CTE starts off using the 'WITH RECURSIVE' keywords. Next we have the name of the CTE (bosspath) followed by the columns (cn, and path) that this CTE returns. Think of it as defining a (very temporary) table named bosspath with two columns, 'cn' and 'path'. Following this is the body of the CTE which will be discussed in some detail below. And finally there is the main SELECT statement that returns results. Note that in the SELECT we are joining data from the main users table with data from the CTE to create our final result.
Within a recursive CTE there are two major components. The first is an initialization query to get the ball rolling. The second is the recursive query that will be called repeatedly.
The initialization query must return at least one result, but it can return more. Each of those results goes on a queue to be fed into the recursive query until no new results are returned.
The recursive query (the one after UNION ALL) is the real workhorse of the recursive CTE. This query is executed on everything in the queue and all rows output go on that same queue to be further processed by the same query. That's where the recursive magic comes in.
Note in particular how we build up the path using concatenation of the previous results with the current name. This is what allows the query to return the full management chain as a single value / column (see below for a discussion of how to reverse the order).
At the bottom we finally have a query against the CTE we've constructed. This works just like a normal query against a table and you can include joins, limits, and pretty much any other normal SQL constructs.
We can easily reverse the order of the boss list (going from top down instead of bottom up) by reversing the order that we paste the string together in the recursive query. Instead of
SELECT users.cn,users.name||' > '||bosspath.path
we use
SELECT users.cn,bosspath.path||' > '||users.name
so that the previous path comes first, followed by the current user (manager) name.
Another advantage of building the management chain path as a string is searching for everyone under a particular manager. Because the CTE acts as a temporary table, and since we created the management path as a single string using concatenation, we can now perform queries on the path itself.
For example, if we want to see everyone who reports up through Allen Cory, we can do a query like this.
sqlite> SELECT * FROM bp WHERE path like '%Allen Cory%' order by name limit 10;
Akilah Audie|akaudie|Retail Jedi|hetennille|Hermine Tennille > Allen Cory > Ching Leopoldo
Angie Cecilia|ancecilia|Cheese Sprayer|alcory|Allen Cory > Ching Leopoldo
Ashly Farah|asfarah|Chief Inspiration Officer|hetennille|Hermine Tennille > Allen Cory > Ching Leopoldo
Dewayne Darius|dedarius|Chief Inspiration Officer|hetennille|Hermine Tennille > Allen Cory > Ching Leopoldo
Dot Molly|domolly|Oyster Floater|mamaurice|Mabelle Maurice > Gina Brook > Allen Cory > Ching Leopoldo
Gina Brook|gibrook|Problem Wrangler|alcory|Allen Cory > Ching Leopoldo
Hermine Tennille|hetennille|New Media Guru|alcory|Allen Cory > Ching Leopoldo
Jone Kenna|jokenna|Oyster Floater|alcory|Allen Cory > Ching Leopoldo
Kary Lin|kalin|Director of First Impressions|gibrook|Gina Brook > Allen Cory > Ching Leopoldo
Laurena Leone|laleone|Beverage Dissemination Officer|hetennille|Hermine Tennille > Allen Cory > Ching Leopoldo
We can also use all the normal aggregation tools, such as COUNT, on our CTE based view.
sqlite> SELECT COUNT(*) FROM bp WHERE path like '%Allen Cory%';
27
A CTE is easily understood if you think of it as just creating a temporary table, with the specified fields, that only lasts for the duration of the query. Recursive CTEs give us a way to express in pure SQL what would normally be done in an external programming language (search for a person then recursively search for the manager up the chain). By expressing and returning the recursive data as a string, we can do further queries on the recursive data that would be nearly impossible to do in SQL any other way (e.g. find everyone who reports up through a given person).
Hopefully this has given you a better understanding of how recursive CTEs work and how they might be helpful in projects involving hierarchical information.
Thanks for this great description - it was exactly the type of clear explanation I was looking for!