Often it useful to be able to sort content by relevance. For instance, imagine you owned a site that published articles about the city of New York. Imagine this site contained a search page, with a form like this:
These fields hold taxonomy terms. Users can input the terms they want to search for, and the resulting list should be ranked by relevance. That is to say, an article tagged with "new york", "brooklyn", and "nightlife" should appear in the results, but it should rank lower than an article tagged with "new york", "brooklyn", "nightlife", and "community." In short, you want to know which content has been tagged the most with relevant search terms.
This can be done in an extremely performant way, with pure SQL. There is no need to involve a 3rd party search solution, such as Apache Solr. What follows is a set of benchmarks revealing the most performant way to generate this ranked list.
For these bench marks, I generated a sample Drupal database with 1 million nodes and 1 million taxonomy terms. The taxonomy terms were evenly divided among 3 vocabularies, called "tags", "test", and "test2". Each node referenced an average of 15 taxonomy terms—5 in each vocabulary. The taxonomy_index
table, the largest in the database, contained just over 15 million records. Unless otherwise stated, all bench marks are for a single query run 5,000 times per test, across 10 tests.
I ran these tests on a 64 bit Macbook Pro with 8GB of DDR3 memory, 2.3GHz Intel Core i7 processor, SSD hard drive. I used PHP 5.5 and mySQL 5.6.23.
One way to find out which nodes are tagged the most is by querying the taxonomy_index table. This table receives a new row every time any node references any taxonomy term. This allows us to do a very simple query, but only by using a very large table.
For this test, we ran the following query 5,000 times. Note that a new list of taxonomy term IDs was selected on each iteration, so for this test we are running 5,000 unique queries (i.e. never running the same query twice). The values 1, 2, 3...
are presented here as a visual example.
SELECT nid,count(tid)
FROM taxonomy_index
WHERE tid
IN (1, 2, 3, 4, 5, 6, 7, 8)
GROUP BY nid
ORDER BY count(tid) DESC
LIMIT 5
This gives us a result like this:
Node ID | Number of Matches |
---|---|
7996 | 5 |
77791 | 3 |
529431 | 1 |
1006185 | 1 |
426174 | 1 |
In this result set, the node 7996
had 5 of the tags we searched for. Thus 7996
outranks node 77791
which had only three of the tags we searched for. Here is how this query performed over 5,000 iterations. Remember each "run" is 5,000 unique queries, so 50,000 unique queries were run in total:
- Run 1: 32.245 seconds
- Run 2: 29.395 seconds
- Run 3: 29.552 seconds
- Run 4: 30.586 seconds
- Run 5: 30.01 seconds
- Run 6: 30.472 seconds
- Run 7: 30.634 seconds
- Run 8: 29.643 seconds
- Run 9: 29.485 seconds
- Run 10: 29.451 seconds
See: code used to run Simple Select
#Test 2: Union Select
In the previous test, we used the taxonomy_index
table to lookup nid-tid pairs. This is actually one of two tables in a Drupal database where this data can be found. Each taxonomy term reference field is maintained in its own table. For instance, given a term reference field named "tags" we can find all the relevant nid-tid pairs in the table field_data_field_tags
. For most datasets, these tables are much smaller.
For this test, we are going to independently examine the tables of three term reference fields. These fields are called "tags", "test", and "test2". Each table contains about 5 million rows, as compared to the taxonomy_index table, which contains about 15 million rows. This is a trade off, because we must merge 3 separate SELECT
statements (1 for each table) and then sort the result. We are doing more SELECTS
but we are selecting from a smaller dataset.
Perhaps more importantly, we are able to use a much smaller IN
statement. Whereas our previous test used a comma separated list of of 8 term ID values to examine each row in taxonomy_index, here we will only compare any given field to a list of just 3 tids. We are able to do this because taxonomy terms are organized into vocabularies, which allows us to do a more targeted search.
SELECT nodeid, SUM(tagcount)
FROM (
SELECT node.nid nodeid, field_data_field_tags.field_tags_tid tag, COUNT(node.nid) tagcount
FROM node
LEFT JOIN field_data_field_tags
ON node.nid=field_data_field_tags.entity_id
WHERE field_data_field_tags.field_tags_tid
IN (1, 2, 3)
GROUP BY nid
UNION ALL SELECT node.nid nodeid, field_data_field_test.field_test_tid tag, COUNT(node.nid) tagcount
FROM node
LEFT JOIN field_data_field_test
ON node.nid=field_data_field_test.entity_id
WHERE field_data_field_test.field_test_tid
IN (4, 5, 6)
GROUP BY nid
UNION ALL SELECT node.nid nodeid, field_data_field_test2.field_test2_tid tag, COUNT(node.nid) tagcount
FROM node
LEFT JOIN field_data_field_test2
ON node.nid=field_data_field_test2.entity_id
WHERE field_data_field_test2.field_test2_tid
IN (7, 8, 9)
GROUP BY nid
) subquery
GROUP BY nodeid
ORDER BY sum(tagcount) DESC
LIMIT 5
Here is how this query performs:
- Run 1: 11.145 seconds
- Run 2: 11.084 seconds
- Run 3: 11.137 seconds
- Run 4: 11.826 seconds
- Run 5: 11.701 seconds
- Run 6: 11.087 seconds
- Run 7: 11.107 seconds
- Run 8: 11.434 seconds
- Run 9: 12.783 seconds
- Run 10: 12.314 seconds
A surprising result. Despite the additional selects, the complex subquery, the multiple calls to COUNT()
and SUM()
we still outperform Simple Select by a factor of three. A fun fact: even if we use 8 tids in each IN
statement for a query of this style, it still out performs Simple Select by roughly a factor of 2.
See: code used to run Union Select
In our idealized dataset, Union Select is the clear winner. The problem comes when you attempt to run a query with lots of matches. Union Select falls down quickly for taxonomy terms which are used frequently. For instance, imagine a taxonomy term called "news." This term might be applied to thousands upon thousands of articles. If you were to run the two queries above and included "news" as one of your chosen terms, Union Select would perform much worse.
So while Union Select might win for tags that are used only 10 or 20 times, Simple Select ultimately wins for realistic datasets.
To see how these two techniques stack up, I created different quantities of Drupal nodes with a randomly selected tag. For instance, I created 10 nodes with the tag 123, 40 nodes with the tag 456, 75 nodes with the tag 789 and so on. Then I searched for nodes which have that tag using the two queries described above. While Union Select performed better for tags with few corresponding nodes, Simple Select performed better for any tag that was used more that 1,000 times. Note, these bench marks are how long it took to run each query 15 times, unlike the other bench marks in this document.
Number Of Matches | Union Select | Simple Select |
---|---|---|
10 | 0.1021 | 0.2925 |
40 | 0.1043 | 0.2994 |
75 | 0.1064 | 0.2986 |
150 | 0.1241 | 0.3206 |
312 | 0.1624 | 0.3128 |
625 | 0.2185 | 0.3022 |
1250 | 0.3232 | 0.3077 |
2500 | 0.5528 | 0.3129 |
5000 | 1.3039 | 0.3026 |
7500 | 1.5887 | 0.3062 |
The difference is very stark, especially when you see these values graphed:
Load an unsorted list of relevant node IDs with SQL and then sort and rank them with PHP, 5000 times
I include this for the sake of completeness, but there's probably no good reason to ever do this. In this scenario, we'll reporpose the query from Simple Select removing the GROUP BY
and ORDER BY
statements. Removing these statements provides literally zero performance advantage as it turns out, so trying to replicate native SQL sorting functions in PHP can only hurt us.
For this, we'll run a query like this:
SELECT nid
FROM taxonomy_index
WHERE tid
IN (1, 2, 3, 4, 5, 6, 7, 8)
We'll take the results and do a pretty simple PHP sort like so:
$results = db_query("SELECT....")->fetchAll();
$result_array = array();
foreach ($results as $result) {
$result_array[] = $result->nid;
}
arsort(array_count_values($result_array));
$result_array = array_slice($result_array, 0, 5, true);
Predictably, this performs just slightly worse than Simple Select, making it the overall worst-of-breed:
- Run 1: 37.086 seconds
- Run 2: 36.753 seconds
- Run 3: 33.964 seconds
- Run 4: 33.647 seconds
- Run 5: 33.68 seconds
- Run 6: 33.215 seconds
- Run 7: 34.263 seconds
- Run 8: 35.078 seconds
- Run 9: 36.019 seconds
- Run 10: 36.003 seconds
See: code used to run Simple Select with PHP Sort
Load an unsorted list of relevant node IDs with SQL, using a union select statement, and then sort and rank them with PHP, 5000 times
I will omit this discussion for the sake of brevity, but sufficient to say, this performs identically to traditional Union Select, and suffers from the same scaling problem.
See: code used to run Union Select with PHP Sort
Using a simple select on the taxonomy_index table is performant and effective for most real-world datasets. This solution is preferable to other solutions.