jdarcy/gist:d68a334471db8ace748b

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Tiering Data Structures and Algorithms

Problem Statement

In the design of a tiering solution, the first problem one encounters is the definition of an ideal end condition.  As a first approximation, consider this:
Sort all files by (descending) time of last access.  The hot tier should contain the files at the top of the list, up to capacity.  The cold tier should contain everything else.

This definition practically requires two kinds of records, because crawling even the top tier alone - let alone both tiers - to get the same information would be prohibitively expensive and so slow that the answers would be wrong by the time you got them.


A database of last-access times.


A manifest of what's currently in the top tier.


A lively debate could be had about whether these two records should be combined or separate, but there's an even more important problem: the goal is wrong.  The real ideal end condition is the one that results in the greatest probability of future I/O going to the hot tier.  This is proportional not to how recently a file was accessed, but how often.  Therefore, let's tweak our definition of an ideal end state.
Sort all files by (descending) "heat level" - total accesses or sum of accesses with decaying weight over time.  The hot tier should contain the files at the top of the list, up to capacity.  The cold tier should contain everything else.

Applying this definition will yield better hit ratios than its predecessor.  The rest of this document is about how to do that.
Standard Solution: Generations

Another fundamental problem with tiering is that it's infeasible to retain information about every file that was ever accessed.  This problem appears in two ways:


Older data, which is assumed to have lost its value, needs to be garbage-collected.


Sorting every record since the "beginning of time" is less efficient than sorting several smaller data sets.


A common approach to both problems is to divide time up into generations - defined either by time or by metadata storage space - and to maintain access time/count information only within a generation instead of globally.  This immediately addresses the sorting-time problem.  It also makes garbage collection very simple and efficient; just remove the file or drop the table corresponding to the oldest generation, instead of deleting one record at a time.
Another benefit of the generational approach is that it provides an obvious way to apply the "decaying weight" heat metric (which is generally more accurate than the "total access count" method).  After calculating heat within the current generation, files that are "on the edge" can be distinguished from one another by adding the data from the previous generation (with a "decay factor" to give it less weight).  This process can be repeated with successively older generations until some desired tradeoff between time spent and margin of error is achieved.  In fact, it's often sufficient for information about older generations to be approximate, even down to binary "was/wasn't accessed" bits in a Bloom or cuckoo filter.
Proactive Demotion

So far, we've assumed that promotion/demotion occurs only periodically.  Alternatively, we might want to maintain a pool of free space in the hot tier and place new files there on the (quite reasonable) assumption that files recently written will also be frequently accessed.  In addition to defining a free-space threshold, this means that after each generational shift we should retain the list of demotion candidates in sorted (coldest first) order.  This list can be used both immediately to make space for promotion candidates and later to make space for newly created files.
Hysteresis

Some workloads involve periodic access to certain files, or a predictable delay between production/consumption of data.  "Big Data" workloads particularly tend to be this way, and it would be a shame if we consistently found ourselves demoting files just before they were needed.  Therefore, it might also be necessary to define a "minimum residency time" during which files should remain in the hot tier before being considered for demotion.
Alternate Solution: Running Totals

For completeness, we may also consider how to implement the decaying-weight model within a single generation.  Instead of periodically creating new tables and then having to traverse them when evaluating promotion/demotion candidates, we could just do the following:


Divide all current file access counts by the generational decay factor.


Sort by (now adjusted) access counts.


Delete the records with the lowest counts either immediately (for files already in the slow tier) or after demotion.


This is probably simpler to implement, but it's important to note that it does not solve the sorting-time problem.  Record-by-record deletion is also likely to be less efficient than bulk deletion (as noted previously).
Effect on Tuning Parameters

Based on the discussion above, we can finally consider what parameters could be used to affect tiering.


Frequency with which we re-evaluate promotion/deletion candidates.


Duration/size of each generation (need not match the evaluation frequency, though that might be a convenient simplification).


Decay factor (greater than one).  Access counts for N generations ago are divided by decay^N to give them less weight.


Free-space threshold, if using proactive demotion.


Minimum residency time, if implementing hysteresis.