MKorostoff/gist:bf1704f8c307c3a35443

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Sorting By Relevance With Drupal Taxonomy

Often it useful to be able to sort content by relevance.  For instance, imagine you owned a site that published articles about the city of New York.  Imagine this site contained a search page, with a form like this:

These fields hold taxonomy terms.  Users can input the terms they want to search for, and the resulting list should be ranked by relevance.  That is to say, an article tagged with "new york", "brooklyn", and "nightlife" should appear in the results, but it should rank lower than an article tagged with "new york", "brooklyn", "nightlife", and "community."  In short, you want to know which content has been tagged the most with relevant search terms.
This can be done in an extremely performant way, with pure SQL.  There is no need to involve a 3rd party search solution, such as Apache Solr.  What follows is a set of benchmarks revealing the most performant way to generate this ranked list.
The Sample Data Set

For these bench marks, I generated a sample Drupal database with 1 million nodes and 1 million taxonomy terms.  The taxonomy terms were evenly divided among 3 vocabularies, called "tags", "test", and "test2".  Each node referenced an average of 15 taxonomy terms—5 in each vocabulary.  The taxonomy_index table, the largest in the database, contained just over 15 million records.  Unless otherwise stated, all bench marks are for a single query run 5,000 times per test, across 10 tests.
I ran these tests on a 64 bit Macbook Pro with 8GB of DDR3 memory, 2.3GHz Intel Core i7 processor, SSD hard drive.  I used PHP 5.5 and mySQL 5.6.23.
Test 1: Simple Select

Load a ranked list of node IDs with SQL 5,000 times

One way to find out which nodes are tagged the most is by querying the taxonomy_index table.  This table receives a new row every time any node references any taxonomy term.  This allows us to do a very simple query, but only by using a very large table.
For this test, we ran the following query 5,000 times.  Note that a new list of taxonomy term IDs was selected on each iteration, so for this test we are running 5,000 unique queries (i.e. never running the same query twice).  The values 1, 2, 3... are presented here as a visual example.
SELECT nid,count(tid)
FROM taxonomy_index
WHERE tid
IN (1, 2, 3, 4, 5, 6, 7, 8)
GROUP BY nid
ORDER BY count(tid) DESC
LIMIT 5
This gives us a result like this:


Node ID
Number of Matches


7996
5


77791
3


529431
1


1006185
1


426174
1


In this result set, the node 7996 had 5 of the tags we searched for.  Thus 7996 outranks node 77791 which had only three of the tags we searched for.  Here is how this query performed over 5,000 iterations.  Remember each "run" is 5,000 unique queries, so 50,000 unique queries were run in total:

Run 1: 32.245 seconds
Run 2: 29.395 seconds
Run 3: 29.552 seconds
Run 4: 30.586 seconds
Run 5: 30.01 seconds
Run 6: 30.472 seconds
Run 7: 30.634 seconds
Run 8: 29.643 seconds
Run 9: 29.485 seconds
Run 10: 29.451 seconds

See: code used to run Simple Select
#Test 2: Union Select
Load a ranked list of node IDs with SQL 5,000 times (alternate strategy)

In the previous test, we used the taxonomy_index table to lookup nid-tid pairs.  This is actually one of two tables in a Drupal database where this data can be found.  Each taxonomy term reference field is maintained in its own table.  For instance, given a term reference field named "tags" we can find all the relevant nid-tid pairs in the table field_data_field_tags.  For most datasets, these tables are much smaller.
For this test, we are going to independently examine the tables of three term reference fields.  These fields are called "tags", "test", and "test2".  Each table contains about 5 million rows, as compared to the taxonomy_index table, which contains about 15 million rows.  This is a trade off, because we must merge 3 separate SELECT statements (1 for each table) and then sort the result.  We are doing more SELECTS but we are selecting from a smaller dataset.
Perhaps more importantly, we are able to use a much smaller IN statement.  Whereas our previous test used a comma separated list of of 8 term ID values to examine each row in taxonomy_index, here we will only compare any given field to a list of just 3 tids.  We are able to do this because taxonomy terms are organized into vocabularies, which allows us to do a more targeted search.
SELECT nodeid, SUM(tagcount)
FROM (
  SELECT node.nid nodeid, field_data_field_tags.field_tags_tid tag, COUNT(node.nid) tagcount
  FROM node
  LEFT JOIN field_data_field_tags
  ON node.nid=field_data_field_tags.entity_id
  WHERE field_data_field_tags.field_tags_tid
  IN (1, 2, 3)
  GROUP BY nid

  UNION ALL SELECT node.nid nodeid, field_data_field_test.field_test_tid tag, COUNT(node.nid) tagcount
  FROM node
  LEFT JOIN field_data_field_test
  ON node.nid=field_data_field_test.entity_id
  WHERE field_data_field_test.field_test_tid
  IN (4, 5, 6)
  GROUP BY nid

  UNION ALL SELECT node.nid nodeid, field_data_field_test2.field_test2_tid tag, COUNT(node.nid) tagcount
  FROM node
  LEFT JOIN field_data_field_test2
  ON node.nid=field_data_field_test2.entity_id
  WHERE field_data_field_test2.field_test2_tid
  IN (7, 8, 9)
  GROUP BY nid
) subquery

GROUP BY nodeid
ORDER BY sum(tagcount) DESC
LIMIT 5
Here is how this query performs:

Run 1: 11.145 seconds
Run 2: 11.084 seconds
Run 3: 11.137 seconds
Run 4: 11.826 seconds
Run 5: 11.701 seconds
Run 6: 11.087 seconds
Run 7: 11.107 seconds
Run 8: 11.434 seconds
Run 9: 12.783 seconds
Run 10: 12.314 seconds

A surprising result.  Despite the additional selects, the complex subquery, the multiple calls to COUNT() and SUM() we still outperform Simple Select by a factor of three.  A fun fact: even if we use 8 tids in each IN statement for a query of this style, it still out performs Simple Select by roughly a factor of 2.
See: code used to run Union Select
Union Select vs. Simple Select: The Problem of Scale

In our idealized dataset, Union Select is the clear winner.  The problem comes when you attempt to run a query with lots of matches.  Union Select falls down quickly for taxonomy terms which are used frequently.  For instance, imagine a taxonomy term called "news."  This term might be applied to thousands upon thousands of articles.  If you were to run the two queries above and included "news" as one of your chosen terms, Union Select would perform much worse.
So while Union Select might win for tags that are used only 10 or 20 times, Simple Select ultimately wins for realistic datasets.
Test 2a: Searching for the most tagged nodes, with a large number of results

To see how these two techniques stack up, I created different quantities of Drupal nodes with a randomly selected tag.  For instance, I created 10 nodes with the tag 123, 40 nodes with the tag 456, 75 nodes with the tag 789 and so on.  Then I searched for nodes which have that tag using the two queries described above.  While Union Select performed better for tags with few corresponding nodes, Simple Select performed better for any tag that was used more that 1,000 times.


Number Of Matches
Union Select
Simple Select


10
0.1021
0.2925


40
0.1043
0.2994


75
0.1064
0.2986


150
0.1241
0.3206


312
0.1624
0.3128


625
0.2185
0.3022


1250
0.3232
0.3077


2500
0.5528
0.3129


5000
1.3039
0.3026


7500
1.5887
0.3062


The difference is very stark, especially when you see these values graphed:

Test 3: Simple Select with PHP Sort

Load an unsorted list of relevant node IDs with SQL and then sort and rank them with PHP, 5000 times

I include this for the sake of completeness, but there's probably no good reason to ever do this.  In this scenario, we'll reporpose the query from Simple Select removing the GROUP BY and ORDER BY statements.  Removing these statements provides literally zero performance advantage as it turns out, so trying to replicate native SQL sorting functions in PHP can only hurt us.
For this, we'll run a query like this:
SELECT nid
FROM taxonomy_index
WHERE tid
IN (1, 2, 3, 4, 5, 6, 7, 8)
We'll take the results and do a pretty simple PHP sort like so:
$results = db_query("SELECT....")->fetchAll();

$result_array = array();
foreach ($results as $result) {
  $result_array[] = $result->nid;
}

arsort(array_count_values($result_array));
$result_array = array_slice($result_array, 0, 5, true);
Predictably, this performs just slightly worse than Simple Select, making it the overall worst-of-breed:

Run 1: 37.086 seconds
Run 2: 36.753 seconds
Run 3: 33.964 seconds
Run 4: 33.647 seconds
Run 5: 33.68 seconds
Run 6: 33.215 seconds
Run 7: 34.263 seconds
Run 8: 35.078 seconds
Run 9: 36.019 seconds
Run 10: 36.003 seconds

See: code used to run Simple Select with PHP Sort
Test 4: Union Select with PHP Sort

Load an unsorted list of relevant node IDs with SQL, using a union select statement, and then sort and rank them with PHP, 5000 times

I will omit this discussion for the sake of brevity, but sufficient to say, this performs identically to traditional Union Select, and suffers from the same scaling problem.
See: code used to run Union Select with PHP Sort
Analysis and Findings

Using a simple select on the taxonomy_index table is performant and effective for most real-world datasets.  This solution is preferable to other solutions.
Number Of Matches	Union Select	Simple Select
10	0.1021	0.2925
40	0.1043	0.2994
75	0.1064	0.2986
150	0.1241	0.3206
312	0.1624	0.3128
625	0.2185	0.3022
1250	0.3232	0.3077
2500	0.5528	0.3129
5000	1.3039	0.3026
7500	1.5887	0.3062