- Feature Name: hash_drain_if
- Start Date: 10/6/2016
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)
Summary
New API for std::collections::HashSet and std::collections::HashMap.
For HashSet:
/// Clears the set of all selected elements, returning them in an iterator.
fn drain_if<P>(&mut self, predicate: F) -> DrainIf<T, P>
where P: FnMut(&mut T) -> bool;
/// Clears the map of all selected key-value pairs, returning them in an iterator.
fn drain_if<P>(&mut self, predicate: F) -> DrainIf<K, V, P>
where P: FnMut(&K, &mut V) -> bool;
Motivation
Specifically I've run into the following scenario, I have two HashMap, one of ready_items and one of unready_items. Every once in a while, I want to go through all the unready_items, remove all the ones that are now ready, and add them to ready_items. Right now the only ways of doing that involve a temporary collection of some sort.
Option 1: Drain and reinsert
let new_unready: Vec<_> = unready_items.drain().filter_map(
|(k, v)| {
if v.is_ready() {
ready_items.insert(k, v);
None
} else {
Some((k, v))
}
}
).collect();
for (k, v) in new_unready { unready_items.insert(k, v); }
Option 2: Filter and remove
let new_ready = Vec<_> unready_items.iter().filter_map(
|(k, v)| if v.is_ready() { Some(k.clone()) } else { None }
).collect();
for k in new_ready {
let v = unready_items.remove(k).unwrap();
ready_items.insert(k, v);
}
On top of the temporary collection, Option 1 requires reinserting all values which were not supposed to be drained, and Option 2 requires copying keys, both logically unnecessary and potentially expensive operations.
The new API avoids all of these issues.
for (k, v) in unready_items.drain_if(|_, v| v.is_ready()) {
ready_items.insert(k, v);
}
Detailed design
The first step is to duplicate most of the implementation of drain, excluding ExactSizeIterator, which DrainIf does not implement. DrainIf will also need a new field to store the predicate.
For the implementation of Iterator, the simplest option just uses shift after every removal. However this option is not very performant. Worst case, requires O(n²) shifts.
A more performant option is available with the tradeoff that 1 bit per bucket additional space is required. There are a couple options as to where that bit could go.
- It could be stored another bit of the hash, reducing hashes to 62 bits
- It could be stored in
bufferas another offset - It could be stored in a separate buffer in
RawTable(optionally lazy initialized) - It could be stored in temporary buffer in
DrainIf
This bit is used to store if a bucket has had its displacement minimized. This is needed because for the duration of the DrainIf operation, the displacement invarient is broken. This bit is used to rebuild and restore the invarient. For the purpose of this document, the two states the bit can be in will be described as DEFAULT and TOGGLED.
- If the bit is stored in the hash,
DEFAULTis 1 andTOGGLEDis 0. - For all other cases,
DEFAULTis 0 andTOGGLEDis 1.
The choice of how to store the minimized marker changes how the following functions work internally, but their general function remains the same:
mark_processedis_minimizedmark_minimizedmark_unminimizedget_unminimized_buckets
If the buffer is being initialized in the constructor for DrainIf, all the bits initialized to DEFAULT. Otherwise, they should already be in this state.
The bit works in conjuction with the address/index of the highest proccessed bucket in the following way:
- If a bucket's index is less than or equal to the highest processed index
- A value of
DEFAULTrepresents minimized - A value of
TOGGLEDrepresents unminimized
- A value of
- If a bucket's index is greater than the highest processed index
- A value of
DEFAULTrepresents unminimized - A value of
TOGGLEDrepresents minimized
- A value of
This representation allows us to avoid having to touch empty buckets. It also means we exit the function with all bits set to DEFAULT, which could be useful if the buffer is permanent.
The function mark_processed transitions between these two states by flipping the bit for a given bucket and returning.
The function is_minimized takes a bucket and the highest processed index/address to return true if a bucket is marked as minimized.
The functions mark_minimized and mark_unminimized take a bucket and highest processed index/address to mark the bucket as minimized or unminimized respectively. They might also record bound information to accelerate other steps.
The function get_unminimized_buckets iterates through all unminimized buckets until none remain. It is likely this function will actually just be implemented inline as loops. This functions will likely benefit from any bounds information stored by mark_unminimized. If the bit is stored in a separate buffer, this function can skip 64 bits at a time with a logarithmic search inside those 64 bits should any of them be unminimized (depending on performance numbers).
We then define the iterator implementation as follows:
impl<'a, K, V, P: FnMut(&'a K, &'a mut V) -> bool> Iterator for DrainIf<'a, K, V, P> {
type Item = (SafeHash, K, V);
#[inline]
fn next(&mut self) -> Option<(SafeHash, K, V)> {
// Loop through buckets
let val = self.iter.next().filter_map(|bucket| {
// Record that this bucket has been processed
let minimized = self.mark_processed(bucket);
unsafe {
// Skip buckets that have gone through displacement minimization
// Return items that the predicate says to remove
// Attempt to minimize buckets that are not being removed.
if minimized {
None
} else if (self.predicate(&*bucket.key, &mut *(bucket.val as *mut V))) {
(**self.table).size -= 1;
Some((
SafeHash { hash: ptr::replace(bucket.hash, EMPTY_BUCKET) },
ptr::read(bucket.key),
ptr::read(bucket.val),
))
} else {
self.minimize_displacement(bucket, bucket.hash as usize);
None
}
}
});
// Return a value if we have one
if val.is_some() {
return val;
}
// We are done, so fixup the displacement of everything that was saved
// but not fixed up
for bucket in self.get_unminimized_buckets() {
self.minimize_displacement(bucket, usize::max_value());
}
None
}
fn size_hint(&self) -> (usize, Option<usize>) {
let size = unsafe { (**self.table).size() };
(0, Some(size))
}
}
The helper minimize_displacement attemps to minimize the displacement of RawBucket. It assumes the bucket passed is not empty and only buckets that have not been marked as "minimized" will be passed. It requires the map displacement invarient be true for all buckets marked as "minimized", but does not require this invariant for any other buckets. If displacement minimzation succeeds, the invariant has been satisfied for the written bucket which has been marked as minimized.
// Returns Some(minimized_location) if the displacement was minimized
fn minimize_displacement(&mut self, bucket: RawBucket, hpa: usize) {
// Start by looking at the ideal index for this bucket
let mut target = Bucket::new(self.table, bucket.hash());
// Keep going until we reach our own index
while target.raw.hash != b.raw.hash {
match target.peek() {
// If we find an empty bucket, migrate into it and mark it as minimized
Empty(target) => {
unsafe {
*target.raw.hash = mem::replace(&mut *bucket.hash, EMPTY_BUCKET);
ptr::copy_nonoverlapping(bucket.key, target.key as *mut K, 1);
ptr::copy_nonoverlapping(bucket.val, target.val as *mut V, 1);
}
return self.mark_minimized(target.raw, hpa);
},
// If we find a full bucket, check if it is minimized. If so, we should
// check the next cell, otherwise we cannot minimize this bucket right now
Full(full) => {
!if is_minimized(full, mpa) { return self.mark_unminimized(bucket, hpa); }
target.next();
},
}
}
// If everything between our ideal index and our current index is minimized
// that means we're minimized
self.mark_minimized(b, hpa);
}
Drawbacks
More APIs, more complexity. Another bit of storage is required.
Alternatives
-
As always, do nothing.
-
Different name (e.g.
drain_filter)
Unresolved questions
Is there a more efficent algorithm for minimizing displacement.