Laravel & Recursive, Relation-Constrained, Tree-Shaken Queries
Disclaimer
The syntax for recursive SQL CTE queries varies depending on what you are using. While this will probably work with other DB engines, I've only used this with PostgreSQL. I am also using a PostgreSQL-exclusive SQL statement for optimal efficiency; however, I do note an alternative for non-PostgreSQL servers as well.
Use Case
Let's say we have a Category model with self-referential Parent & Children relationships set up. In addition, there is also a hasMany relationship set up for a Video model. So maybe we have some categories set up with their own sub-categories, and those sub-categories also have sub-categories, and those sub-sub-categories have a video.
Now, let's say we need to fetch only the categories that have at least one published video. That's easy enough - but if we're building out a hierarchical tree, we'll need to also query the ancestral tree all the way up. While this can be done in a few different ways within Laravel, it's typically not very efficient and requires multiple individual queries to be made.
With a recursive CTE (Common Table Expression) query, we can fetch only the required Categories to complete the hierarchy tree in a single query, resulting in much more efficient computation cycles for larger datasets which is my professional assumption... use at your own peril.
The Query
Dataset Example
What we're essentially going for here is to tree-shake the entire group of Categories so that only the following remains:
- Categories that have at least one video
- OR Categories that have a descendant at any depth with at least one video
Given the use case above, the visual representation of the categories is as follows:
Original
- Example 1 (unpublished video ❌)
- Example 2
- Example 3 (published video ✔️)
- Example 4
- Example 5 (published video ✔️)
- Example 6
- Example 7
- Example 8 (unpublished video ❌)
- Example 9
- Example 10 (published video ✔️)
- Example 8 (unpublished video ❌)
- Example 2
- Example 11
- Example 12 (published video ✔️)
- Example 13 (published video ✔️)
- Example 14 (unpublished video ❌)
Tree-shaken
- Example 1
- Example 2
- Example 3 (published video ✔️)
- Example 5 (published video ✔️)
- Example 7
- Example 10 (published video ✔️)
- Example 7
- Example 2
- Example 11
- Example 12 (published video ✔️)
- Example 13 (published video ✔️)
The problem here is that there is no limit to how many levels deep a given Category's descendants might go. In practice, I'd imagine the upper limit for most applications would be about 10 levels deep (and that would be a lot of levels), but the idea of having to know the maximum depth of sub-Categories for a given dataset at the time of executing the query feels like an anti-pattern. Instead, I wanted a solution that would work dynamically, regardless of the depth of subcategories we're dealing with.
Enter PostgreSQL CTE
Before looking into this, I hadn't worked with recursive queries before. I recommend reading about them more in-depth on PostgreSQL's website. Essentially what this does is utilize temporary tables for just this query while recursively executing the SQL statements specified within.
Our query will look something like this:
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
Lets break down what's happening here.
The Common Table Expression
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
The first line is what tells the SQL server that we're doing a recursive CTE.
Selecting the initial Categories
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
In the next chunk we're querying for Catagories that have at least one published video. You'll notice that in additional to selecting all the columns in for the Categories, I've added on a video_id column; with this column added it creates a unique trail from a given video's related Category down that category's furthest ancestor.
Note that it actually doesn't matter if the video_id is for a published video or not - we're already specifying that the category needs to have a published video in the WHERE clause, so the value of video_id itself only needs to remain constant in the trail. It's a very minor thing, but I like to imagine it adds to the overall efficiency of the SQL query but honestly I have no idea if it really matters.
Chaining their ancestors
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
The next SELECT within the recursive statement is the part that strings together the hierarchal trail; if a given video belongs to a root Category (i.e. one that has no parent_id), then this SELECT will return nothing. However if the Category does have a parent_id, then will return that parent. This is the key recursive element of this query; it works regardless of how deep the hierarchy is on a give Category trail.
Returning a clean dataset
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
The results of the recursive CTE will potentially have multiple instances of a given Category; the higher the Category is in the hierarchy, the more likely it will have duplicated rows in the dataset here. That's because for each video, we'll get the full Category hierarchy. So if a given Category has multiple descendant Categories with a video, it will have as many duplicate rows in the returned dataset.
Obviously, we don't want a bunch of duplicate rows in the dataset, so we'll need ot purge them. Thankfully, PostgreSQL has a cool SELECT DISTINCT ON(COLUMN) construct. While SELECT DISTINCT will select unique rows of data, SELECT DISTINCT ON(COLUMN) will only consider that column for distinct results.
When paired with WHERE video_id IS NOT NULL, the final dataset returned is exactly what we want; every Category needed to build a Category-with-video(s) hierarchy tree and no duplicates.
id | parent_id | name | video_id |
---|---|---|---|
1 | "Example 1" | 1 | |
2 | 1 | "Example 2" | 1 |
3 | 2 | "Example 3" | 1 |
5 | 1 | "Example 5" | 2 |
7 | 5 | "Example 7" | 3 |
10 | 7 | "Example 10" | 3 |
11 | "Example 11" | 4 | |
12 | 11 | "Example 12" | 4 |
13 | "Example 13" | 5 |
However, of course this data is flat - In order to get an actual hierarchy tree, we'll need to do some restructuring.
Tip: For Non-PostgreSQL Servers
INFO
If you're using PostgreSQL, you can skip to the next section
As of this writing (February 2024), the SELECT DISTINCT ON() is exclusive to PostgreSQL. However, if you adjust the final SQL statement to a normal DISTINCT, you'll get results like this:
SELECT DISTINCT *
FROM Hierarchy gh
WHERE video_id IS NOT NULL
id | parent_id | name | video_id |
---|---|---|---|
1 | "Example 1" | 1 | |
2 | 1 | "Example 2" | 1 |
3 | 2 | "Example 3" | 1 |
1 | "Example 1" | 2 | |
5 | 1 | "Example 5" | 2 |
1 | "Example 1" | 3 | |
5 | 1 | "Example 5" | 3 |
7 | 5 | "Example 7" | 3 |
10 | 7 | "Example 10" | 3 |
11 | "Example 11" | 4 | |
12 | 11 | "Example 12" | 4 |
13 | "Example 13" | 5 |
The issue here is that while we have enough data to do what we want, we actually have too much; each video has the full trail, leading to duplicate rows for ids 1 & 5.
This can be cleaned up with a post-query filtering step; the snippet below will filter out any duplicates and fill the $unique variable with only the unique rows (unique by ID), which will end with the same results as a PostgreSQL SELECT DISTINCT ON() statement.
$uniqueIds = [];
$unique = [];
foreach($results as $result)
{
if (!in_array($result->id, $uniqueIds))
{
$uniqueIds[] = $result->id;
$unique[] = $result;
}
}
Structuring into a proper hierarchy tree
Now that we have the data we want, let's hop over to the Laravel & PHP side of things and start using it.
First, let's wrap our CTE query into a proper Laravel Eloquent query builder statement:
use Illuminate\Support\Facades\DB;
public static function getHierarchyWithVideos()
{
$results = DB::select(
DB::raw('
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
')
->getValue(DB::connection()->getQueryGrammar())
);
}
INFO
After initially having some trouble getting the output of this query with Eloquent, I stumbled upon an answer in a Laracast forum post that mentioned using the getQueryGrammar() here as the argument for getValue(). It wasn't obvious to me what exactly getQueryGrammar() was doing, and others have asked the same question - for now though, since this same syntax is used within the source code of Laravel itself I'm not really worried about it.
Since this is a recursive CTE and beyond what Eloquent supports, we can wrap it in a DB::raw() within a DB::select(), and the getValue() call will return the dataset as stdClass objects, i.e. not related to the actual Model class the table is for.
We now need to build out the multi-level tree and cast each object into the Model class that it actually is. To do this, we'll add an internal function within this one that we'll recursively call:
function parseHierarchy($categories, int $parentId = null)
{
$hierarchy = [];
foreach ($categories as $category) {
if ($category->parent_id == $parentId) {
// Call self for recursive build out
$children = parseHierarchy($categories, $category->id);
// $children will be null if not categories exist with the current category's id as their parent_id;
if ($children) $category->children = $children;
// Cast to our Laravel Model class; since we already populated the children
// parameter, no eager loading is necessary for that relationship
$category = new Category((array) $category);
// Eager load any existing videos for current category
$category->load(['videos' => function($query) {
// Add any eager loading constraints
$query->where('published', '<>', null);
}]);
// Push into main array
$hierarchy[] = $category;
}
}
return $hierarchy;
}
TL;DR Final Results
use Illuminate\Support\Facades\DB;
public static function getHierarchyWithVideos()
{
$results = DB::select(
DB::raw('
WITH RECURSIVE Hierarchy AS (
-- Anchor member: Categories directly related to videos OR the root category for tree
SELECT c.*, v.id AS video_id
FROM categories c
LEFT JOIN videos v
ON c.id = v.category_id
WHERE v.published IS NOT NULL
UNION
-- Recursive member: Categories that are descendants of the directly related categories
SELECT c.*, h.video_id
FROM categories c
JOIN Hierarchy h ON c.id = h.parent_id
)
-- Final select: Retrieve all categories in the hierarchy
SELECT DISTINCT ON (h.id) *
FROM Hierarchy h
WHERE video_id IS NOT NULL
')
->getValue(DB::connection()->getQueryGrammar())
);
function parseHierarchy($categories, int $parentId = null)
{
$hierarchy = [];
foreach ($categories as $category) {
if ($category->parent_id == $parentId) {
// Call self for recursive build out
$children = parseHierarchy($categories, $category->id);
// $children will be null if not categories exist with the current category's id as their parent_id;
if ($children) $category->children = $children;
// Cast to our Laravel Model class; since we already populated the children
// parameter, no eager loading is necessary for that relationship
$category = new Category((array) $category);
// Eager load any existing videos for current category
$category->load(['videos' => function($query) {
// Add any eager loading constraints
$query->where('published', '<>', null);
}]);
// Push into main array
$hierarchy[] = $category;
}
}
return $hierarchy;
}
return parseHierarchy($results);
}
There you have it; this will return you a tree-shaken & relation-constrained hierarchal tree of Models which have a self-referential parent/child relationship. Of course, you will need to update the table names, Model class, etc. to fit your application.
There are undoubtedly more optimizations you could make to this; the primary thing that comes to mind would be to eager-load the videos in the CTE itself to remove any eager loading at all.
I also played around with abstracting this functionality to be easily reusable for any Model & relationship; however I quickly realized the complexity that would required in order to support all the different relationship types in Laravel and dynamically build out the CTE.