- Feature Name: hash_drain_if
- Start Date: 10/6/2016
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)
New API for std::collections::HashSet
and std::collections::HashMap
.
For HashSet
:
/// Clears the set of all selected elements, returning them in an iterator.
fn drain_if<P>(&mut self, predicate: F) -> DrainIf<T, P>
where P: FnMut(&mut T) -> bool;
/// Clears the map of all selected key-value pairs, returning them in an iterator.
fn drain_if<P>(&mut self, predicate: F) -> DrainIf<K, V, P>
where P: FnMut(&K, &mut V) -> bool;
Specifically I've run into the following scenario, I have two HashMap
, one of ready_items
and one of unready_items
. Every once in a while, I want to go through all the unready_items
, remove all the ones that are now ready, and add them to ready_items
. Right now the only ways of doing that involve a temporary collection of some sort.
Option 1: Drain and reinsert
let new_unready: Vec<_> = unready_items.drain().filter_map(
|(k, v)| {
if v.is_ready() {
ready_items.insert(k, v);
None
} else {
Some((k, v))
}
}
).collect();
for (k, v) in new_unready { unready_items.insert(k, v); }
Option 2: Filter and remove
let new_ready = Vec<_> unready_items.iter().filter_map(
|(k, v)| if v.is_ready() { Some(k.clone()) } else { None }
).collect();
for k in new_ready {
let v = unready_items.remove(k).unwrap();
ready_items.insert(k, v);
}
On top of the temporary collection, Option 1 requires reinserting all values which were not supposed to be drained, and Option 2 requires copying keys, both logically unnecessary and potentially expensive operations.
The new API avoids all of these issues.
for (k, v) in unready_items.drain_if(|_, v| v.is_ready()) {
ready_items.insert(k, v);
}
The first step is to duplicate most of the implementation of drain
, excluding ExactSizeIterator
, which DrainIf
does not implement. DrainIf
will also need a new field to store the predicate
.
For the implementation of Iterator
, the simplest option just uses shift
after every removal. However this option is not very performant. Worst case, requires O(n²)
shifts.
A more performant option is available with the tradeoff that 1 bit per bucket additional space is required. There are a couple options as to where that bit could go.
- It could be stored another bit of the hash, reducing hashes to 62 bits
- It could be stored in
buffer
as another offset - It could be stored in a separate buffer in
RawTable
(optionally lazy initialized) - It could be stored in temporary buffer in
DrainIf
This bit is used to store if a bucket has had its displacement minimized. This is needed because for the duration of the DrainIf
operation, the displacement invarient is broken. This bit is used to rebuild and restore the invarient. For the purpose of this document, the two states the bit can be in will be described as DEFAULT
and TOGGLED
.
- If the bit is stored in the hash,
DEFAULT
is 1 andTOGGLED
is 0. - For all other cases,
DEFAULT
is 0 andTOGGLED
is 1.
The choice of how to store the minimized marker changes how the following functions work internally, but their general function remains the same:
mark_processed
is_minimized
mark_minimized
mark_unminimized
get_unminimized_buckets
If the buffer is being initialized in the constructor for DrainIf
, all the bits initialized to DEFAULT
. Otherwise, they should already be in this state.
The bit works in conjuction with the address/index of the highest proccessed bucket in the following way:
- If a bucket's index is less than or equal to the highest processed index
- A value of
DEFAULT
represents minimized - A value of
TOGGLED
represents unminimized
- A value of
- If a bucket's index is greater than the highest processed index
- A value of
DEFAULT
represents unminimized - A value of
TOGGLED
represents minimized
- A value of
This representation allows us to avoid having to touch empty buckets. It also means we exit the function with all bits set to DEFAULT
, which could be useful if the buffer is permanent.
The function mark_processed
transitions between these two states by flipping the bit for a given bucket and returning.
The function is_minimized
takes a bucket and the highest processed index/address to return true if a bucket is marked as minimized.
The functions mark_minimized
and mark_unminimized
take a bucket and highest processed index/address to mark the bucket as minimized or unminimized respectively. They might also record bound information to accelerate other steps.
The function get_unminimized_buckets
iterates through all unminimized buckets until none remain. It is likely this function will actually just be implemented inline as loops. This functions will likely benefit from any bounds information stored by mark_unminimized
. If the bit is stored in a separate buffer, this function can skip 64 bits at a time with a logarithmic search inside those 64 bits should any of them be unminimized (depending on performance numbers).
We then define the iterator implementation as follows:
impl<'a, K, V, P: FnMut(&'a K, &'a mut V) -> bool> Iterator for DrainIf<'a, K, V, P> {
type Item = (SafeHash, K, V);
#[inline]
fn next(&mut self) -> Option<(SafeHash, K, V)> {
// Loop through buckets
let val = self.iter.next().filter_map(|bucket| {
// Record that this bucket has been processed
let minimized = self.mark_processed(bucket);
unsafe {
// Skip buckets that have gone through displacement minimization
// Return items that the predicate says to remove
// Attempt to minimize buckets that are not being removed.
if minimized {
None
} else if (self.predicate(&*bucket.key, &mut *(bucket.val as *mut V))) {
(**self.table).size -= 1;
Some((
SafeHash { hash: ptr::replace(bucket.hash, EMPTY_BUCKET) },
ptr::read(bucket.key),
ptr::read(bucket.val),
))
} else {
self.minimize_displacement(bucket, bucket.hash as usize);
None
}
}
});
// Return a value if we have one
if val.is_some() {
return val;
}
// We are done, so fixup the displacement of everything that was saved
// but not fixed up
for bucket in self.get_unminimized_buckets() {
self.minimize_displacement(bucket, usize::max_value());
}
None
}
fn size_hint(&self) -> (usize, Option<usize>) {
let size = unsafe { (**self.table).size() };
(0, Some(size))
}
}
The helper minimize_displacement
attemps to minimize the displacement of RawBucket
. It assumes the bucket passed is not empty and only buckets that have not been marked as "minimized" will be passed. It requires the map displacement invarient be true for all buckets marked as "minimized", but does not require this invariant for any other buckets. If displacement minimzation succeeds, the invariant has been satisfied for the written bucket which has been marked as minimized.
// Returns Some(minimized_location) if the displacement was minimized
fn minimize_displacement(&mut self, bucket: RawBucket, hpa: usize) {
// Start by looking at the ideal index for this bucket
let mut target = Bucket::new(self.table, bucket.hash());
// Keep going until we reach our own index
while target.raw.hash != b.raw.hash {
match target.peek() {
// If we find an empty bucket, migrate into it and mark it as minimized
Empty(target) => {
unsafe {
*target.raw.hash = mem::replace(&mut *bucket.hash, EMPTY_BUCKET);
ptr::copy_nonoverlapping(bucket.key, target.key as *mut K, 1);
ptr::copy_nonoverlapping(bucket.val, target.val as *mut V, 1);
}
return self.mark_minimized(target.raw, hpa);
},
// If we find a full bucket, check if it is minimized. If so, we should
// check the next cell, otherwise we cannot minimize this bucket right now
Full(full) => {
!if is_minimized(full, mpa) { return self.mark_unminimized(bucket, hpa); }
target.next();
},
}
}
// If everything between our ideal index and our current index is minimized
// that means we're minimized
self.mark_minimized(b, hpa);
}
More APIs, more complexity. Another bit of storage is required.
-
As always, do nothing.
-
Different name (e.g.
drain_filter
)
Is there a more efficent algorithm for minimizing displacement.