Popog/retain_iterator.md

## hash_drain_if.md

      
    Raw
  

              hash_drain_if.md
            
          
Feature Name: hash_drain_if
Start Date: 10/6/2016
RFC PR: (leave this empty)
Rust Issue: (leave this empty)

Summary

New API for std::collections::HashSet and std::collections::HashMap.
For HashSet:
/// Clears the set of all selected elements, returning them in an iterator.
fn drain_if<P>(&mut self, predicate: F) -> DrainIf<T, P>
where P: FnMut(&mut T) -> bool;

/// Clears the map of all selected key-value pairs, returning them in an iterator.
fn drain_if<P>(&mut self, predicate: F) -> DrainIf<K, V, P>
where P: FnMut(&K, &mut V) -> bool;

Motivation

Specifically I've run into the following scenario, I have two HashMap, one of ready_items and one of unready_items. Every once in a while, I want to go through all the unready_items, remove all the ones that are now ready, and add them to ready_items. Right now the only ways of doing that involve a temporary collection of some sort.
Option 1: Drain and reinsert
let new_unready: Vec<_> = unready_items.drain().filter_map(
  |(k, v)| {
    if v.is_ready() {
      ready_items.insert(k, v);
      None
    } else {
      Some((k, v))
    }
  }
).collect();

for (k, v) in new_unready { unready_items.insert(k, v); }

Option 2: Filter and remove
let new_ready = Vec<_> unready_items.iter().filter_map(
  |(k, v)| if v.is_ready() { Some(k.clone()) } else { None }
).collect();

for k in new_ready {
  let v = unready_items.remove(k).unwrap();
  ready_items.insert(k, v);
}

On top of the temporary collection, Option 1 requires reinserting all values which were not supposed to be drained, and Option 2 requires copying keys, both logically unnecessary and potentially expensive operations.
The new API avoids all of these issues.
for (k, v) in unready_items.drain_if(|_, v| v.is_ready()) {
  ready_items.insert(k, v);
}

Detailed design

The first step is to duplicate most of the implementation of drain, excluding ExactSizeIterator, which DrainIf does not implement. DrainIf will also need a new field to store the predicate.
For the implementation of Iterator, the simplest option just uses shift after every removal. However this option is not very performant. Worst case, requires O(n²) shifts.
A more performant option is available with the tradeoff that 1 bit per bucket additional space is required. There are a couple options as to where that bit could go.

It could be stored another bit of the hash, reducing hashes to 62 bits
It could be stored in buffer as another offset
It could be stored in a separate buffer in RawTable (optionally lazy initialized)
It could be stored in temporary buffer in DrainIf

This bit is used to store if a bucket has had its displacement minimized. This is needed because for the duration of the DrainIf operation, the displacement invarient is broken. This bit is used to rebuild and restore the invarient. For the purpose of this document, the two states the bit can be in will be described as DEFAULT and TOGGLED.

If the bit is stored in the hash, DEFAULT is 1 and TOGGLED is 0.
For all other cases, DEFAULT is 0 and TOGGLED is 1.

The choice of how to store the minimized marker changes how the following functions work internally, but their general function remains the same:

mark_processed
is_minimized
mark_minimized
mark_unminimized
get_unminimized_buckets

If the buffer is being initialized in the constructor for DrainIf, all the bits initialized to DEFAULT. Otherwise, they should already be in this state.
The bit works in conjuction with the address/index of the highest proccessed bucket in the following way:

If a bucket's index is less than or equal to the highest processed index

A value of DEFAULT represents minimized
A value of TOGGLED represents unminimized


If a bucket's index is greater than the highest processed index

A value of DEFAULT represents unminimized
A value of TOGGLED represents minimized


This representation allows us to avoid having to touch empty buckets. It also means we exit the function with all bits set to DEFAULT, which could be useful if the buffer is permanent.
The function mark_processed transitions between these two states by flipping the bit for a given bucket and returning.
The function is_minimized takes a bucket and the highest processed index/address to return true if a bucket is marked as minimized.
The functions mark_minimized and mark_unminimized take a bucket and highest processed index/address to mark the bucket as minimized or unminimized respectively. They might also record bound information to accelerate other steps.
The function get_unminimized_buckets iterates through all unminimized buckets until none remain. It is likely this function will actually just be implemented inline as loops. This functions will likely benefit from any bounds information stored by mark_unminimized. If the bit is stored in a separate buffer, this function can skip 64 bits at a time with a logarithmic search inside those 64 bits should any of them be unminimized (depending on performance numbers).
We then define the iterator implementation as follows:
impl<'a, K, V, P: FnMut(&'a K, &'a mut V) -> bool> Iterator for DrainIf<'a, K, V, P> {
  type Item = (SafeHash, K, V);

  #[inline]
  fn next(&mut self) -> Option<(SafeHash, K, V)> {
    // Loop through buckets
    let val = self.iter.next().filter_map(|bucket| {
      // Record that this bucket has been processed
      let minimized = self.mark_processed(bucket);

      unsafe {
        // Skip buckets that have gone through displacement minimization
        // Return items that the predicate says to remove
        // Attempt to minimize buckets that are not being removed.
        if minimized {
          None
        } else if (self.predicate(&*bucket.key, &mut *(bucket.val as *mut V))) {
          (**self.table).size -= 1;
          Some((
            SafeHash { hash: ptr::replace(bucket.hash, EMPTY_BUCKET) },
            ptr::read(bucket.key),
            ptr::read(bucket.val),
          ))
        } else {
          self.minimize_displacement(bucket, bucket.hash as usize);
          None
        }
      }
    });

    // Return a value if we have one
    if val.is_some() {
      return val;
    }

    // We are done, so fixup the displacement of everything that was saved
    // but not fixed up    
    for bucket in self.get_unminimized_buckets() {
      self.minimize_displacement(bucket, usize::max_value());
    }

    None
  }

  fn size_hint(&self) -> (usize, Option<usize>) {
    let size = unsafe { (**self.table).size() };
    (0, Some(size))
  }
}

The helper minimize_displacement attemps to minimize the displacement of RawBucket. It assumes the bucket passed is not empty and only buckets that have not been marked as "minimized" will be passed. It requires the map displacement invarient be true for all buckets marked as "minimized", but does not require this invariant for any other buckets. If displacement minimzation succeeds, the invariant has been satisfied for the written bucket which has been marked as minimized.
// Returns Some(minimized_location) if the displacement was minimized
fn minimize_displacement(&mut self, bucket: RawBucket, hpa: usize) {
  // Start by looking at the ideal index for this bucket
  let mut target = Bucket::new(self.table, bucket.hash());

  // Keep going until we reach our own index
  while target.raw.hash != b.raw.hash {
    match target.peek() {
      // If we find an empty bucket, migrate into it and mark it as minimized
      Empty(target) => {
        unsafe {
          *target.raw.hash = mem::replace(&mut *bucket.hash, EMPTY_BUCKET);
          ptr::copy_nonoverlapping(bucket.key, target.key as *mut K, 1);
          ptr::copy_nonoverlapping(bucket.val, target.val as *mut V, 1);
        }
        return self.mark_minimized(target.raw, hpa);
      },

      // If we find a full bucket, check if it is minimized. If so, we should
      // check the next cell, otherwise we cannot minimize this bucket right now
      Full(full) => {
        !if is_minimized(full, mpa) { return self.mark_unminimized(bucket, hpa); }
        target.next();
      },
    }
  }

  // If everything between our ideal index and our current index is minimized
  // that means we're minimized
  self.mark_minimized(b, hpa);
}

Drawbacks

More APIs, more complexity. Another bit of storage is required.
Alternatives


As always, do nothing.


Different name (e.g.drain_filter)


Unresolved questions

Is there a more efficent algorithm for minimizing displacement.