Popog/retain_iterator.md

## retain_iterator.md

      
    Raw
  

              retain_iterator.md
            
          
Feature Name: retain_iterator
Start Date: 10/6/2016
RFC PR: (leave this empty)
Rust Issue: (leave this empty)

Summary

New API for std::collections::HashMap and std::collections::HashSet. Modification to the retain API for std::vec::Vec and std::collections::vec_deque::VecDeque.
For HashMap:
/// Retains only the elements specified by the predicate.
///
/// In other words, remove all key-value pairs `(k, v)` such that `predicate(&k, &v)` returns false.
/// Removed key-value pairs are yielded by the returned iterator.
fn retain<P>(&mut self, predicate: P) -> Retain<K, V, P>
where P: FnMut(&K, &V) -> bool;

fn retain_mut<P>(&mut self, predicate: P) -> RetainMut<K, V, P>
where P: FnMut(&K, &mut V) -> bool;
For HashSet:
/// Retains only the elements specified by the predicate.
///
/// In other words, remove all elements `e` such that `predicate(&e)` returns false.
/// Removed elements yielded by the returned iterator.
fn retain<P>(&mut self, predicate: P) -> Retain<T, P>
where P: FnMut(&T) -> bool;

fn retain_mut<P>(&mut self, predicate: P) -> RetainMut<T, P>
where P: FnMut(&mut T) -> bool;
For Vec and VecDeque:
/// Retains only the elements specified by the predicate.
///
/// In other words, remove all elements `e` such that `predicate(&e)` returns false.
/// This method operates in place and preserves the order of the retained
/// elements.
/// Removed elements yielded by the returned iterator.
fn retain<P>(&mut self, predicate: P) -> Retain<T, P>
where P: FnMut(&T) -> bool;

fn retain_mut<P>(&mut self, predicate: P) -> RetainMut<T, P>
where P: FnMut(&mut T) -> bool;
Motivation

For HashMap & HashSet

Specifically I've run into the following scenario, I have two HashMap, one of ready_items and one of unready_items. Every once in a while, I want to go through all the unready_items, remove all the ones that are now ready, and add them to ready_items. Right now the only ways of doing that involve a temporary collection of some sort.
Option 1: Drain and reinsert
let new_unready: Vec<_> = unready_items.drain().filter_map(
  |(k, v)| {
    if v.is_ready() {
      ready_items.insert(k, v);
      None
    } else {
      Some((k, v))
    }
  }
).collect();

for (k, v) in new_unready { unready_items.insert(k, v); }
Option 2: Filter and remove
let new_ready = Vec<_> unready_items.iter().filter_map(
  |(k, v)| if v.is_ready() { Some(k.clone()) } else { None }
).collect();

for k in new_ready {
  let v = unready_items.remove(k).unwrap();
  ready_items.insert(k, v);
}
On top of the temporary collection, Option 1 requires reinserting all values which were supposed to be retained, and Option 2 requires copying keys, both logically unnecessary and potentially expensive operations.
The new API avoids all of these issues.
for (k, v) in unready_items.retain(|_, v| !v.is_ready()) {
  ready_items.insert(k, v);
}

For retain_mut, as we already have have mutable ownership of the buffer, there does not seem to be a reason not to pass &mut. As for why not just a single function that always does &mut, this is done to minimize breakage with the current Vec API.
For Vec & VecDeque

The motivation to return an iterator is similar to the above reasons for HashMap and HashSet. Also, to keep retain functionality consistent across containers.
Detailed design

For HashMap & HashSet

The first step is to duplicate most of the boilerplate for existing iterator functions (e.g. drain). We then define our table::Retain struct as follows with some simple trait implementations:
pub struct Retain<'a, K: 'a, V: 'a, P: FnMut(&'a K, &'a V) -> bool> {
    table: Shared<RawTable<K, V>>,
    predicate: P,
    /// Something to iterate through the table
    iter: ...,
    marker: marker::PhantomData<&'a RawTable<K, V>>,
}

unsafe impl<'a, K: Sync, V: Sync, P: FnMut(&'a K, &'a V) -> bool> Sync for Retain<'a, K, V, P> {}
unsafe impl<'a, K: Send, V: Send, P: Send+FnMut(&'a K, &'a V) -> bool> Send for Retain<'a, K, V, P> {}

impl<'a, K: 'a, V: 'a, P: FnMut(&'a K, &'a V) -> bool> Drop for Retain<'a, K, V, P> {
    fn drop(&mut self) {
        for _ in self {}
    }
}
For the implementation of Iterator, the simplest option just uses shift after every removal. However this option is not very performant. Worst case, requires O(n²) shifts. But, by simply reversing our iteration order, and starting at an empty bucket, this problem can be eliminated.
To accomplish this, we'll make a new helper iterator type (drain will also probably need to use this):
struct RevRawBuckets<'a, K: 'a, V: 'a> {
    raw: RawBucket<K, V>,
    hashes_start: *const HashUint,
    end: RawBucket<K, V>,
    elems_left: usize,
    marker: marker::PhantomData<&'a (K, V)>,
}

impl<K, V> RawTable<K, V> {
    fn rev_raw_buckets(&mut self) -> RevRawBuckets<K, V> {
        let mut raw = self.first_bucket_raw();
        let end = unsafe { raw.offset(self.capacity as isize); }

        // Find the first empty bucket
        if self.size > 0 {
          while let Full(_) = raw.peek() {
              debug_assert!(raw.hash != end.hash);
              raw = unsafe { raw.offset(1) };
          }
        }

        RevRawBuckets {
            raw: raw,
            hashes_start: self.hashes,
            end: end,
            elems_left: self.size,
            marker: marker::PhantomData,
        }
    }
}


impl<'a, K, V> Iterator for RevRawBuckets<'a, K, V> {
    type Item = RawBucket<K, V>;

    fn next(&mut self) -> Option<RawBucket<K, V>> {
        if self.elems_left == 0 {
            return None;
        }

        loop {
            if self.raw.hash == self.hashes_start {
                self.raw = self.end;
            }

            self.raw = unsafe { self.raw.offset(-1) };

            if *self.raw.hash != EMPTY_BUCKET {
                self.elems_left -= 1;
                return Some(self.raw);
            }
        }
    }
}

impl<'a, K, V> ExactSizeIterator for RevRawBuckets<'a, K, V> {
    fn len(&self) -> usize { self.elems_left }
}
We can now define the type of the field Retain.iter to be RevRawBuckets<'a, K, V>, which lets us define the constructor and Iterator implementation:
impl<K, V> RawTable<K, V> {
    pub fn retain<P: FnMut(&'a K, &'a V) -> bool>(&mut self, predicate: P) -> Retain<K, V, P> {
        Retain {
            table: unsafe { Shared::new(self) },
            predicate: predicate,
            iter: self.rev_raw_buckets(),
            marker: marker::PhantomData,
        }
    }
}

impl<'a, K, V, P: FnMut(&'a K, &'a mut V) -> bool> Iterator for Retain<'a, K, V, P> {
    type Item = (SafeHash, K, V);

    fn next(&mut self) -> Option<(SafeHash, K, V)> {
        while let Some(bucket) = self.iter.next() {
            let pair_mut = raw.pair as *mut (K, V);
            if !(self.predicate)(&(*pair_mut).0, &mut (*pair_mut).1) {
                // Extract the values from the table
                let (hash, (k, v)) = unsafe {
                    (**self.table).size -= 1;
                    (ptr::replace(bucket.hash, EMPTY_BUCKET), ptr::read(bucket.pair))
                }
                let hash = SafeHash { hash: hash };

                // Shift to restore the invariant
                let empty = EmptyBucket {
                  raw: bucket,
                  idx: bucket.hash as usize - table.hashes_start as usize,
                  table: &self.table,
                };
                let mut gap = match empty.gap_peek() {
                    Some(b) => b,
                    None => return Some((hash, k, v)),
                };

                while gap.full().displacement() != 0 {
                    gap = match gap.shift() {
                        Some(b) => b,
                        None => break,
                    };
                }

                return Some((hash, k, v));
            }
        }

        None
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        (0, Some(self.iter.len()))
    }
}
For Vec & VecDeque

Replace the existing retain function with a version which sets the length to 0 to start with and returns an iterator object Retain. This is done for much the same reason it's done in drain, to ensure that removed items are not accessible if Drop is not called.
The Retain object implements Sync, Send, and Drop in much the same ways that Drain does.
pub struct Retain<'a, T: 'a, P: FnMut(&T) -> bool> {
    /// Iterator to track current position in the
    iter: Enumerate<slice::Iter<'a, T>>,
    vec: Shared<Vec<T>>,
    predicate: P,
}

#[stable(feature = "drain", since = "1.6.0")]
unsafe impl<'a, T: Sync, P: FnMut(&T) -> bool> Sync for Drain<'a, T> {}
#[stable(feature = "drain", since = "1.6.0")]
unsafe impl<'a, T: Send: P: Send + FnMut(&T) -> bool> Send for Drain<'a, T> {}
pub fn retain<P: FnMut(&T) -> bool>(&mut self, predicate: P) -> Retain<T> {
    unsafe {
        // Use the borrow in the IterMut to indicate borrowing behavior of the
        // whole Retain iterator (like &mut T).
        let range_slice = slice::from_raw_parts_mut(self.as_mut_ptr(), self.len());

        // set self.vec length's to 0, to be safe in case Retain is leaked
        self.set_len(0);

        Retain {
           iter: range_slice.iter().enumerate(),
           vec: Shared::new(self as *mut _),
           predicate: predicate,
        }
    }
}
Unlike the current retain implementation, which uses swap, we will be returning items each iteration, so our Iterator implementation loops through checking if items satisfy the predicate. Items which fail, are returned by value, items which succeed are written to the correct index in the buffer, and the vec size is updated. Like the current implementation of retain, items which satisfy the predicate before any deletions take place don't incur a write penalty, as they're already in the correct index in the buffer.
impl<'a, T> Iterator for Retain<'a, T> {
    type Item = T;

    #[inline]
    fn next(&mut self) -> Option<T> {
        while let Some((i, v)) = self.next() {
            let len = self.vec.len();
            if !self.predicate(v) {
                let v = unsafe { ptr::read(v as *const _) };
                return Some(v);
            } else if i > len {
              unsafe {
                let dst = self.vec.get_unchecked_mut(len);
                ptr::copy_nonoverlapping(dst as *mut T, v as *const T, 1);
              }
            }
            unsafe { self.vec.set_len(len + 1); }
        }
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        (0, self.iter.len())
    }
}
Drawbacks


More APIs, more complexity.
Technically breaks mem::forget(vec.retain(bar)), as currently retain returns (). I seriously doubt anyone is doing this.
Like iter and drain, we expose the HashMap iteration order and thus enable O(n²) blowup.

A potential mitigation is to randomly alternate between 2+ disjoint iterators, though the performance cost might be too high.


Alternatives


As always, do nothing.
Merge retain and retain_mut.

Unresolved questions


Is it worth the duplication of retain_mut to avoid breakages? Would it break any existing code?