Skip to content

Instantly share code, notes, and snippets.

@rob-p
Last active July 10, 2023 19:56
Show Gist options
  • Save rob-p/0569ff0f092c19108478a21a8570f03b to your computer and use it in GitHub Desktop.
Save rob-p/0569ff0f092c19108478a21a8570f03b to your computer and use it in GitHub Desktop.
Rust Iterator Question

Note: This is cross-posted from reddit.

I've been trying to determine (a) if it's possible to achieve something in rust and (b) if so, how. I will try to abstract the problem as much as possible since the details of this in my code are rather boring and unessential to describe the problem.

I have a program that extracts information from the combination of a polars data frame, and a paired file. This is for a genomics application, and for those interested, the data frame contains the locations of sequence features (exons, transcripts, etc.) and the file contains the genome sequence (chromosome by chromosome). The genome is large, so we prefer not to load the whole thing in memory, and instead to iterate over it chromosome by chromosome and then feature by feature, yielding each sequence feature one at a time.

It turns out it's relatively simple to write a "single sequence" iterator (i.e. an iterator that yields the sequences of all of the features of one chromosome). It looks something like this:

struct ChrRowSeqIter<'a> {
    iters: Vec<polars::series::SeriesIter<'a>>,
    record: &'a Record,
}

where the lifetime 'a is the lifetime of a polars data frame that is being referenced and record is the paired sequence record for a specific chromosome. This iterator works fine.

Now, the problem: I'd like to have an iterator that essentially chains together many of the above iterators transparently to yield, in turn, all features over the entire genome. So this would basically create a ChrRowSeqIter for chromosome 1, yield its entries, then move on to chromosome 2, etc. The problem then, is that I will have to create a ChrRowSeqIter the first references a dataframe for chromosome 1, then chromosome 2, etc. and at the same time the sequence record for chromosome 1 then chromosome 2 etc.

The way things are currently structured in the program is that this "outer" iterator takes ownership of a data frame, let's say X, from which we will create the per-chromosome data frame that is borrowed out to each ChrRowSeqIter. The outer iterator should do the following: * While the current ChrRowSeqIter has entries left, yield them one by one (note that the yielded entry is returned by move — so it's not lending out anything dependent on its own lifetime) * When the current ChrRowSeqIter is exhausted, read the next chromosome from file, and prepare a ChrRowSeqIter with the features corresponding to this chromosome. This data frame is created from X, and then borrowed out to ChrRowSeqIter. * When there are no more records to read from file, return None

For the life of me, I cannot build such an iterator. The fundamental challenge seems to be that ChrRowSeqIter refers to a data frame created from X, but X is owned by the outer iterator. Therefore, this is a form of a self-referential struct, I guess. When I implement next(), I get something like the following:

impl<'a> Iterator for OuterIter<'a> {

  fn next(&mut self) -> Option<Self::Item> {
    // ... stuff 
    self.next_chr_row_seq_iter = Some( ChrRowSeqIter(&self.X, &self.next_chromosome) );
  }
}

At that inner assignment (of self.next_chr_row_seq_iter), the compiler complains that 'a must outlive '1 where we can assume that self: '1. I can't figure out how to shake this.

Strangely, if I don't attempt to implement the actual iterator trait, I can convince the compiler with the following:

impl<'a> OuterIter<'a> {

  fn next_entry<'b>(&'b mut self) -> Option<Self::Item> 
  where 'b: 'a {
    // ... stuff
    self.next_chr_row_seq_iter = Some( ChrRowSeqIter(&self.X, &self.next_chromosome) );
  }

}

So that we know that this is safe because the lifetime of self is > the lifetime of the thing we are creating inside self (which is true, because self owns X). Of course, the actual signature of next() in Iterator doesn't permit fn next<'b>(&'b mut self) and so this isn't an option to implement the real Iterator trait.

I've been using rust for a while now, and usually have been able to easily reorganize my code to satisfy the borrow checker. This one really threw me for a loop though, because it's more "library" type code than the "application" type code I normally write, and I've been unable to really figure this one out. I'd really appreciate feedback, input and suggestions from the community here on the best path forward. Is something like this simply impossible (in which case, how could one reorganize the code to allow something like an OuterIter)? If not, what must I do to convince the borrow checker to allow what I am trying to do? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment