Skip to content

Instantly share code, notes, and snippets.

@nikomatsakis
Last active January 19, 2019 12:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nikomatsakis/6dd99a0126e3f82c64a6d4a6f9989c63 to your computer and use it in GitHub Desktop.
Save nikomatsakis/6dd99a0126e3f82c64a6d4a6f9989c63 to your computer and use it in GitHub Desktop.
polonius-blog-posts

In my previous post about Polonius and subregion obligations, I mentioned that there needs to be a follow-up to deal with higher-ranked subregions. This post doesn't really provide a solution. Instead, it just kinds of digs a bit into what the problem is in the first place.

The subset relation in Polonius is not enough

In my original post on Polonius, I assumed that when we computed a subtype relation T1 <: T2 between two types, the result was either a hard error or a set of subset relations between various regions. So, for example, if we had a subtype relation between two references:

&'a u32 <: &'b u32

the result would be a subset relation 'a: 'b (or, "'a contains a subset of the loans in 'b").

For a more complex case, consider the relationship of two fn types:

fn(&'a u32) <: fn(&'b u32)
// ^^^^^^^     ^^^^^^^^^^
// |           A fn expecting a `&'b u32` as argument.
// |
// A fn expecting a `&'a u32` as argument.

If we imagine that we have some variable f of type fn(&'a u32) -- that is, a fn that can be called with a 'a reference -- then this subtype relation is saying that f can be given the type fn(&'b u32) -- that is, a fn that can be called with a 'b reference. That is fine so long as that 'b reference can be used as a 'a reference: that is, &'b u32 <: &'a u32. So, we can say that the two fn types are subtypes so long as 'b: 'a (note that the order is reversed from the first example; this is because fn types are contravariant in their argument type).

Unfortunately, this structure isn't flexible enough to accommodate a subtyping question involving higher-ranked types. Consider a subtype relation like this:

fn(&'a u32) <: for<'b> fn(&'b u32)
//             ^^^^^^^
//             Unlike before, the supertype
//             expects a reference with *any*
//             lifetime as argument.

What subtype relation should come from this? We can't say 'b: 'a as before, because the lifetime 'b isn't some specific region -- rather, the supertype says that the function has to accept a reference with any lifetime 'b. (In fact, this subtyping relation should ultimately yield an error -- but we would like that error to arise not during subtyping, but rather later, when )

To express a constraint like this, we need a richer set of constraints than just subset. In fact, what rustc implements is something more like this:

Constraint = Subset
           | Constraint, Constraint
           | forall<R1> { Constraint }
           | exists<R1> { Constraint }

Subset = R1: R2  

Now we can say that

fn(&'a u32) <: for<'b> fn(&'b u32)

holds if the constraint forall<'b> { 'b: 'a } holds, which implies that 'a has to contain all possible loans. (This would of course be possible if 'a were an "infinite set" covering the whole program or some such thing, but we can treat it as an error.)

Background: universes

In fact, rustc has recently changed how it handles type checks that involve forall-style constraints. In PR #55517, we adopted a system based on the concept of universes, which is something I first learned about when studying up on how Lambda Prolog is implemented. If you're interested in some background material, I gave a talk at a recent Papers We Love meetup where I covered the paper that explains how it works. I also wrote a section in the rustc-guide on the topic some time back.

The basic idea of a universe is that it represents a set of names. There is a notion of the "root universe" U0 that contains the "global" names in the program. It's easiest to explain in terms of types: the global names might be the structs from the program. When we want to reason about a "forall" constraint, though, we do that by introducing a fresh, placeholder name that is distinct from all other names in the program. To do this, we create a new universe U1 that is a child of the root univese -- the idea is that U1 contains all the global names from U0, but also this additional placeholder name.

Next, we associate each inference variable with a universea. The inference variable is only allowed to be inferred to values that contain names from that universe.

So, as a simple example, imagine that we have a constraint like the following (this is not a region constraint, just a general logic equation):

forall<A> { exists<B> { A = B } }

Here we are saying, for all types A, there exists some type B where A = B. Well, that's clearly true, we can just infer B to be A. In terms of universes, we would say that the name A is bound to a placeholder in U1, and the inference variable B is created in that same universe U1.

But if we reverse the order of "forall" and "exists", we want to get an error:

exists<B> { forall<A> { A = B } }

In this case, the variable B would wind up in the root universe U0, but the placeholder A is still in the child universe U1. Therefore, the value for B cannot be A, because A is not nameable from U0.

Conclusion (for now)

So now we have this problem. To encode the "solutions" to higher-ranked subtyping and trait matching, we need to use this richer notion of constraints that include forall and exists quantifiers. In fact, I think we will eventually want something even stronger, that includes implication and maybe even "or", so we might as well "scale up" to hereditary harrop constraints (also a notion I learned about through Lambda Prolog):

Constraint = Subset
           | Constraint, Constraint // and
           | Constraint; Constraint // or
           | forall<R1> { Constraint }
           | exists<R1> { Constraint }
           | if (Assumption) { Constraint }
           
Assumption = Subset
           | forall<R1> { Assumption }
           | if (Constraint) { Assumption }

Subset = R1: R2  

But now we have to map these complex constraints down to the simple subset relation that Polonius uses. I'll talk about how to do that in the next post.

In my previous post, I introduced this notion of more complex region constraints, which corresponded to hereditary harrop constraints:

Constraint = Subset
           | Constraint, Constraint // and
           | Constraint; Constraint // or
           | forall<R1> { Constraint }
           | exists<R1> { Constraint }
           | if (Assumption) { Constraint }
           
Assumption = Subset
           | forall<R1> { Assumption }
           | if (Constraint) { Assumption }

Subset = R1: R2  

But we need a way to translate these constraints into the simpler subset relation of Polonius.

My proposal is to do this by leveraging the logic programming solver. Essentially, we will write some Prolog-like rules that dictate when a Subset is provable -- and those rules will generate a set of simpler, first-order constraints that can be fed to polonius. (Actually, the rules will generate multiple sets of first-order constraints, but we'll get to that particular problem later.)

Notation

I'm going to use a kind of special goal in my notation:

GenerateConstraint[R1: R2] :-
  Universe(R1, U0), 
  Universe(R2, U0).

As the prolog rule states, this rule is only applicable if R1 and R2 are in the root univese. What makes this rule special is that it is (in some sense) "side-effecting" -- basically, each time we prove this rule, we generate a first-order constraint that must be fed to polonius.

From a different perspective, once we've found a proof tree for the constraint, we can create the Polonius subset constraints by extracting all GenerateConstraint[R1: R2] nodes from the proof tree.

(An alternative would be to formulate the "output constraints" as an explicit logic variable, but that notation is a lot more tedious.)

Reflexive rule and root universe subsets

The first rule is the simplest. Every region outlives itself. Therefore:

R1: R1.

Next, Any time we have a relation between two root universe variables, we can prove it by "recording" the constraint:

(R1: R2) :-
  GenerateConstraint[R1: R2].

So, as a simple example, if we had to solve the subtyping constraint

&'a u32 <: &'b u32

this would give rise to the facts a region constraint 'a: 'b, where 'a and 'b are in the root universe U0. We can then treat this as a goal to prove, resulting in the following proof tree:

- 'a: 'b
  - GenerateConstraint['a: 'b]
    - Universe('a, U0).
    - Universe('b, U0).

So now we know that we need the polonius constraint of 'a: 'b. Wow.

Outside of the root universe

Where things get more interesting is when either R1 or R2 is a placeholder and hence outside the root universe. In that case, we can't relate them directly -- what we need to do instead is to relate them through intermediaries.

(R1: R3) :-
  (not { RootUniverse(R1) }; not { RootUniverse(R3) }),
  (R1: R3),
  (R2: R3).

mentioned that there needs to be a follow-up to deal with higher-ranked subregions. This post kinds of digs into one approach that I've been thinking about. The idea is actually not to deal with higher-ranked problems within Polonius, but rather to solve such problems before-hand, basically reducing them to the sorts of subset relations that Polonius can solve with ease. Along the way, we'll discuss some of what makes how we deal with region constraints in Chalk, and some of the challenges they pose to an efficient implementation (plus the workarounds we presently use in the compiler).

The subset relation in Polonius is not enough

In my original post on Polonius, I assumed that when we computed a subtype relation T1 <: T2 between two types, the result was either a hard error or a set of subset relations between various regions. So, for example, if we had a subtype relation between two references:

&'a u32 <: &'b u32

the result would be a subset relation 'a: 'b (or, "'a contains a subset of the loans in 'b").

For a more complex case, consider the relationship of two fn types:

fn(&'a u32) <: fn(&'b u32)
// ^^^^^^^     ^^^^^^^^^^
// |           A fn expecting a `&'b u32` as argument.
// |
// A fn expecting a `&'a u32` as argument.

If we imagine that we have some variable f of type fn(&'a u32) -- that is, a fn that can be called with a 'a reference -- then this subtype relation is saying that f can be given the type fn(&'b u32) -- that is, a fn that can be called with a 'b reference. That is fine so long as that 'b reference can be used as a 'a reference: that is, &'b u32 <: &'a u32. So, we can say that the two fn types are subtypes so long as 'b: 'a (note that the order is reversed from the first example; this is because fn types are contravariant in their argument type).

Unfortunately, this structure isn't flexible enough to accommodate a subtyping question involving higher-ranked types. Consider a subtype relation like this:

fn(&'a u32) <: for<'b> fn(&'b u32)
//             ^^^^^^^
//             Unlike before, the supertype
//             expects a reference with *any*
//             lifetime as argument.

What subtype relation should come from this? We can't say 'b: 'a as before, because the lifetime 'b isn't some specific region -- rather, the supertype says that the function has to accept a reference with any lifetime 'b. (In fact, this subtyping relation should ultimately yield an error -- but we would like that error to arise not during subtyping, but rather later, when )

To express a constraint like this, we need a richer set of constraints than just subset. In fact, what rustc implements is something more like this:

Constraint = Subset
           | Constraint, Constraint
           | forall<R1> { Constraint }
           | exists<R1> { Constraint }

Subset = R1: R2  

Now we can say that

fn(&'a u32) <: for<'b> fn(&'b u32)

holds if the constraint forall<'b> { 'b: 'a } holds, which implies that 'a has to contain all possible loans. (This would of course be possible if 'a were an "infinite set" covering the whole program or some such thing, but we can treat it as an error.)

Side note: universes

In fact, rustc has recently changed how it handles type checks that involve forall-style constraints. In PR #55517, we adopted a system based on the concept of universes, which is something I first learned about when studying up on how Lambda Prolog is implemented. If you're interested in some background material, I gave a talk at a recent Papers We Love meetup where I covered the paper that explains how it works. I also wrote a section in the rustc-guide on the topic some time back.

The basic idea of a universe is that it represents a set of names. There is a notion of the "root universe" U0 that contains the "global" names in the program. It's easiest to explain in terms of types: the global names might be the structs from the program. When we want to reason about a "forall" constraint, though, we do that by introducing a fresh, placeholder name that is distinct from all other names in the program. To do this, we create a new universe U1 that is a child of the root univese -- the idea is that U1 contains all the global names from U0, but also this additional placeholder name.

Next, we associate each inference variable with a universea. The inference variable is only allowed to be inferred to values that contain names from that universe.

So, as a simple example, imagine that we have a constraint like the following (this is not a region constraint, just a general logic equation):

forall<A> { exists<B> { A = B } }

Here we are saying, for all types A, there exists some type B where A = B. Well, that's clearly true, we can just infer B to be A. In terms of universes, we would say that the name A is bound to a placeholder in U1, and the inference variable B is created in that same universe U1.

But if we reverse the order of "forall" and "exists", we want to get an error:

exists<B> { forall<A> { A = B } }

In this case, the variable B would wind up in the root universe U0, but the placeholder A is still in the child universe U1. Therefore, the value for B cannot be A, because A is not nameable from U0.

A richer notion of region constraints

So now we have this problem. We have this richer notion of constraints, that include forall and exists quantifiers, and we need to map this to polonius constraints, which are just sets of subset edges. In fact, I think we actually want a somewhat richer notion of constraints, one that includes implication as well (I'll use the Chalk notation of "if" for that):

Constraint = Subset
           | Constraint, Constraint
           | forall<R1> { Constraint }
           | exists<R1> { Constraint }
           | if (Constraint) { Constraint }

Subset = R1: R2  

This maps pretty closely to the notion of hereditary harrop clauses from Lambda Prolog.

Including implication is somewhat interesting. It implies we might have a constraint like this, for example:

exists<'a> {
  forall<'b> {
    if ('a: 'b) {
      'a: 'b
    }
  }
}

Here, even though the variable 'a is in a universe which cannot name 'b, we still know that 'a: 'b is true (because we assumed it), and so 'a: 'b is provable.

Let's look at another example:

exists<'a> {
  forall<'b> {
    if ('static: 'b) {
      'a: 'b
    }
  }
}

Here, 'a cannot name 'b, but it can name 'static. And so for 'a to outlive

As I wrote, rustc adopted a universe-based approach in #55517.a `'static, bbu "forall" things by introducing a fresh univese

it encodes a set of loans. Each region (like 'a) is assigned to a specific universe and that encodes the loans that it can contain. Regions are structured into a tree, where a child universe U2 extends its parent U1 with additional loans that are not included in U1. In other words, the child U2 is a strict superset of the parent U1.

For our purposes, we will always have a root universe that contains all the loans that appear in the current function being checked. We'll create new universes to reason about higher-ranked functions and trait matching. Just as in rustc, we can represent universes as a simple integer. We call the root universe U0, and thus a child universe might be U1 -- because U0 < U1, we know that U1 is a child of U0.

So, let's go back to that example of

fn(&'a u32) <: for<'b> fn(&'b u32)

If you recall, that generated a constraint forall<'b> { 'b: 'a }. We might encode that by saying that the region variable 'a is in the universe U0, but the region variable 'b is in a child universe U1. The idea is that because 'b corresponds to "any set of loans", we model that as potentially containing some (as yet unknown) loans that couldn't be named by 'a.

Now that NLL has been shipped, I've been doing some work revisiting the Polonius project. Polonius is the project that implements [the "alias-based formulation" described in my older blogpost][pp]. Polonius has come a long way since that post; it's now quite fast and also experimentally integrated into rustc, where it passes the full test suite.

[pp]: {{ site.baseurl }}/blog/2018/04/27/an-alias-based-formulation-of-the-borrow-checker/

However, polonius as described is not complete. It describes the core "borrow check" analysis, but there are a number of other checks that the current implementation checks which polonius ignores:

  • Polonius does not account for moves and initialization.
  • Polonius does not check for relations between named lifetimes.

This blog post is focused on the second of those bullet points. It covers the simple cases; hopefully I will soon post a follow-up that targets some of the more complex cases that can arise.

Brief Polonius review

If you've never read the [the original Polonius post][pp], you should probably do so now. But if you have, let me briefly review some of the key details that are relevant to this post:

  • Instead of interpreting the 'a notation as the lifetime of a reference (i.e., a set of points), we interpret 'a as a set of loans. We refer to 'a as a "region"1 in order to emphasize this distinction.
  • We call 'a: 'b a subset relation; it means that the loans in 'a must be a subset of the loans in 'b. We track the required subset relations at each point in the program.
  • A loan comes from some borrow expression like &foo. A loan L0 is "live" if some live variable contains a region 'a whose value includes L0. When a loan is live, the "terms of the loan" must be respected: for a shared borrow like &foo, that means the path that was borrowed (foo) cannot be mutated. For a mutable borrow, it means that the path that was borrowed cannot be accessed at all.
    • If an access occurs that violates the terms of a loan, that is an error.

Running Example 1

Let's give a quick example of some code that should result in an error, but which would not if we only considered the errors that polonius reports today:

fn foo<'a, 'b>(x: &'a [u32], y: &'b [u32]) -> &'a u32 {
    &y[0]
}

Here, we declared that we are returning a &u32 with lifetime 'a (i.e., borrowed from x) but in fact we are returning data with lifetime 'b (i.e., borrowed from y).

Slightly simplified, the MIR for this function looks something like this:

fn foo(_1: &'a [u32], _2: &'b [u32]) -> &'a [u32] {
  _0 = &'X (*_2)[const 0usize]; // S0
  return;                       // S1
}  

As you can see, there's only really one interesting statement; it borrows from _2 and stores the result into _0, which is the special "return slot" in MIR.

In the case of the parameters _1 and _2, the regions come directly from the method signature. For regions appearing in the function body, we create fresh region variables -- in this case, only one, 'X. 'X represents the region assigned to the borrow.

The relevant polonius facts for this function are as follows:

  • base_subset('b, 'X, mid(S0)) -- as described in the NLL RFC, "re-borrowing" the referent of a reference (i.e., *_2) creates a subset relation between the region of the region (here, 'b) and the region of the borrow (here, 'X). Written in the notation of the [NLL RFC], this would be the relation 'X: 'b @ mid(S0).
  • base_subset('X, 'a, mid(S0)) -- the borrow expression in S0 produces a result of type &'X u32. This is then assigned to _0, which has the type &'a [u32]. The subtyping rules require that 'X: 'a.

Combining the two base_subset relations allows us to conclude that the full subset relation includes subset('b, 'a, mid(S0)) -- that is, for the function to be valid, the region 'b must be a subset of the region 'a. This is an error because the regions 'a and 'b are actually parameters to foo; in other words, foo must be valid for any set of regions 'a and 'b, and hence we cannot know if there is a subset relationship between them. This is a different sort of error than the "illegal access" errors that Polonius reported in the past: there is no access at all, in fact, simply subset relations.

Placeholder regions and loans

There is an important distinction between named regions like 'a and 'b and the region 'X we created for a borrow. The definition of foo has to be true for all regions 'a and 'b, but for a region like 'X there only has to be some valid value. This difference is often called being universally quantified (true for all regions) versus existentially quantified (true for some region).

In this post, I will call universally quantified regions like 'a and 'b "placeholder" regions. This is because they don't really represent a known quantity of loans, but rather a kind of "placeholder" for some unknown set of loans.

Next, we'll introduce the idea of a placeholder loan. Whereas previously all the Polonius loans represented some restrictions created by borrow expressions that appeared within the source function, placeholder loans represent restrictions created by borrow expressions we can't see. There is one placeholder loan created for each of the placeholder regions.

To represent these placeholders, we'll add a relation placeholder_region that relates the region and its corresponding loan:

.decl placeholder_region(R: region, L: loan)
.input placeholder_region

This fact is true for any placeholder region. So in our example we might have

placeholder_region('a, loan_a).
placeholder_region('b, loan_b).

Here loan_a and loan_b represent the placeholder loans for 'a and 'b respectively.

(This placeholder_region is similar but different to an existing relation in the polonius implementation, called universal_region. The main difference is that we now include the placeholder loan -- these loans will be particularly useful in the next post, when we go beyond named parameters like 'a or 'b to talk about the higher-ranked regions that appear in types like for<'a> fn(&'a u32).)

Renaming the "requires" relation from Polonius

The [original polonius formulation][pp] included a relation requires:

.decl requires(R:region, L:loan, P:point)

which can be read as "the region R requires the terms of the loan L to be enforced at the point P".

In retrospect, though, I think I should have called this relation contains, as it basically indicates set membership -- that is, it indicates that -- at the point P -- the loan L is a member of the region R (L ∈ R). So I'm going to retroactively rename it to contains:

.decl contains(R:region, L:loan, P:point)

Extending the "contains" relation with placeholder

We can extend this to include placeholders like so:

contains(R, L, P) :- placeholder_region(R, L).

Note that P here from the header is actually unbound in the body. This is not in fact valid datalog. We can make it valid datalog by having an input relation for CFG nodes:

.decl cfg_node(P: point)

Then we can rewrite our requires rule as to use cfg_node:

contains(R, L, P) :- placeholder_region(R, L), cfg_node(P).

If we go back to our running example function, we can derive the following clauses:

contains('a, loan_a, mid(S0))
contains('b, loan_b, mid(S0))

If you recall, we were also able to derive a subset relation between 'b and 'a:

subset('b, 'a, mid(S0))

Using the rules from the [previous polonius post][pp], we can combine the subset and contains relations to conclude that loan_b is a member of 'a (as 'a is a superset of 'b):

contains('a, loan_b, mid(S0))

And this precisely brings us to the point of our error: based on the signature of our function, we do not expect 'a to have loan_b as a member. But how do we express that in datalog?

Representing known relations

In practice, the contents of placeholder regions are not always totally unknown. The function signature will often include where clauses (or implied bounds) that indicate some known relationships between placeholder regions. For example, if our example foo had included a where clause like where 'b: 'a, then there would be no error.

We can represent the known relationships using an input:

.decl known_subset(R1: region, R2: region)
.input known_subset

Then we can define a known_contains rules to use these "known subset" relations to figure out which placeholder regions are known to contain which loans:

.decl known_contains(R: region, L: region)

known_contains(R, L) :- placeholder_region(R, L).
known_contains(R2, L) :- known_contains(R1, L), known_subset(R1, R2).

In our example of foo, there are no where clauses nor implied bounds, so these relations are empty. If there were a where clause like where 'b: 'a, however, then we would have a known_subset('b, 'a) fact, which would allow us to conclude the following known_contains relation:

known_contains('a, loan_a) // from `placeholder_region`
known_contains('b, loan_b) // from `placeholder_region`
known_contains('a, loan_b) // derived from `known_subset`

Detecting illegal subset relations

We can now extend the polonius rules to report errors for cases like our running example. The basic idea is this: if the function requires a subset relationship 'r1: 'r2 between two placeholder regions 'r1 and 'r2, then it must be a "known subset", or else we have an error. We can encode this like so:

.decl placeholder_error(R1: region, R2: region, P:point)

placeholder_error(R1, R2, P) :-
  contains(R1, L2, P),        // `L1` is an element of `R1`
  placeholder_region(R1, _),  // `R1` is a placeholder
  placeholder_region(R2, L2), // `L2` is the placeholder loan for `R2`
  !known_contains(R1, L2).    // `R1` is not known to contain `L2`.

In our example program, we can clearly derive placeholder_error('a, 'b, mid(S0)), and hence we have an error:

  • we saw earlier that contains('a, loan_b, mid(S0)) holds
  • as 'a is a placeholder region, placeholder_region('a, _) will appear in the input
  • similarly, placeholder_region('b, loan_b) appears in the input, giving us the connection to 'b
  • finally, the known_contains relation in our example is empty (since known_subset is empty), so !known_contains('a, loan_b) is true.

Sidenote on negative reasoning and stratification. This rule makes use of negative reasoning in the form of the !known_contains(R1, R2) predicate. Negative reasoning is fine in datalog so long as the program is "stratified" -- in particular, we must be able to compute the entire known_contains relation without having to compute subset_error. In this case, the program is trivialy stratified -- known_contains depends only on the input relation known_base_subset.)

Conclusion

This post describes a simple extension to the polonius rules that covers errors arising from subset relations. Unlike the prior rules, these errors are not triggered by any "access", but rather simply the creation of a (transitive) subset relation between two placeholder regions.

Unfortunately, this is not the complete story around region checking errors. In particular, this post ignored subset relations that can arise from "higher-ranked" types like for<'a> fn(&'a u32). Handling these properly requires us to introduce a bit more logic and will be covered in a follow-up.

Comments, if any, should be posted in the internals thread dedicated to my previous polonius post

Footnotes

Footnotes

  1. The term "region" is not an especially good fit, but it's common in academia.

Now that NLL has been shipped, I've been doing some work revisiting the Polonius project. Polonius is the project that implements [the "alias-based formulation" described in my older blogpost][pp]. Polonius has come a long way since that post; it's now quite fast and also experimentally integrated into rustc, where it passes the full test suite.

[pp]: {{ site.baseurl }}/blog/2018/04/27/an-alias-based-formulation-of-the-borrow-checker/

However, polonius as described is not complete. It describes the core "borrow check" analysis, but there are a number of other checks that the current implementation checks which polonius ignores:

  • Polonius does not account for moves and initialization.
  • Polonius does not check for relations between named lifetimes.

This blog post is focused on the second of those bullet points. It covers the simple cases; hopefully I will soon post a follow-up that targets some of the more complex cases that can arise (specifically, dealing with higher-ranked things).

Brief Polonius review

If you've never read the [the original Polonius post][pp], you should probably do so now. But if you have, let me briefly review some of the key details that are relevant to this post:

  • Instead of interpreting the 'a notation as the lifetime of a reference (i.e., a set of points), we interpret 'a as a set of loans. We refer to 'a as a "region"1 in order to emphasize this distinction.
  • We call 'a: 'b a subset relation; it means that the loans in 'a must be a subset of the loans in 'b. We track the required subset relations at each point in the program.
  • A loan comes from some borrow expression like &foo. A loan L0 is "live" if some live variable contains a region 'a whose value includes L0. When a loan is live, the "terms of the loan" must be respected: for a shared borrow like &foo, that means the path that was borrowed (foo) cannot be mutated. For a mutable borrow, it means that the path that was borrowed cannot be accessed at all.
    • If an access occurs that violates the terms of a loan, that is an error.

Running Example 1

Let's give a quick example of some code that should result in an error, but which would not if we only considered the errors that polonius reports today:

fn foo<'a, 'b>(x: &'a [u32], y: &'b [u32]) -> &'a u32 {
    &y[0]
}

Here, we declared that we are returning a &u32 with lifetime 'a (i.e., borrowed from x) but in fact we are returning data with lifetime 'b (i.e., borrowed from y).

Slightly simplified, the MIR for this function looks something like this.

fn foo(_1: &'a [u32], _2: &'b [u32]) -> &'a [u32] {
  _0 = &'X (*_2)[const 0usize]; // S0
  return;                       // S1
}  

As you can see, there's only really one interesting statement; it borrows from _2 and stores the result into _0, which is the special "return slot" in MIR.

In the case of the parameters _1 and _2, the regions come directly from the method signature. For regions appearing in the function body, we create fresh region variables -- in this case, only one, 'X. 'X represents the region assigned to the borrow.

The relevant polonius facts for this function are as follows:

  • base_subset('b, 'X, mid(S0)) -- as described in the NLL RFC, "re-borrowing" the referent of a reference (i.e., *_2) creates a subset relation between the region of the region (here, 'b) and the region of the borrow (here, 'X). Written in the notation of the [NLL RFC], this would be the relation 'X: 'b @ mid(S0).
  • base_subset('X, 'a, mid(S0)) -- the borrow expression in S0 produces a result of type &'X u32. This is then assigned to _0, which has the type &'a [u32]. The subtyping rules require that 'X: 'a.

Combining the two base_subset relations allows us to conclude that the full subset relation includes subset('b, 'a, mid(S0)) -- that is, for the function to be valid, the region 'b must be a subset of the region 'a. This is an error because the regions 'a and 'b are actually parameters to foo; in other words, foo must be valid for any set of regions 'a and 'b, and hence we cannot know if there is a subset relationship between them. This is a different sort of error than the "illegal access" errors that Polonius reported in the past: there is no access at all, in fact, simply subset relations.

Placeholder regions

There is an important distinction between named regions like 'a and 'b and the region 'X we created for a borrow. The definition of foo has to be true for all regions 'a and 'b, but for a region like 'X there only has to be some valid value. This difference is often called being universally quantified (true for all regions) versus existentially quantified (true for some region).

In this post, I will call universally quantified regions like 'a and 'b "placeholder" regions. This is because they don't really represent a known quantity of loans, but rather a kind of "placeholder" for some unknown set of loans.

We will include a base fact that helps us to identify placeholder regions:

.decl placeholder_region(R1: region)
.input placeholder_region

This fact is true for any placeholder region. So in our example we might have

placeholder_region('a).
placeholder_region('b).

Note that the actual polonius impl already includes a relation like this2, because we need to account for the fact that placeholder regions are "live" at all points in the control-flow graph, as we always assume there may be future uses of them that we cannot see.

Representing known relations

Even placeholder regions are not totally unknown though. The function signature will often include where clauses (or implied bounds) that indicate some known relationships between placeholder regions. For example, if foo included a where clause like where 'b: 'a, then it would be perfectly legal.

We can represent the known relationships using an input:

.decl known_base_subset(R1: region, R2: region)
.input known_base_subset

Naturally these known relations are transitive, so we can define a known_subset rule to encode that:

.decl known_subset(R1: region, R2: region)

known_subset(R1, R2) :- known_base_subset(R1, R2).
known_subset(R1, R3) :- known_base_subset(R1, R2), known_subset(R2, R3).

In our example of foo, there are no where clauses nor implied bounds, so these relations are empty. If there were a where clause like where 'b: 'a, however, then we would have a known_base_subset('b, 'a) fact. Similarly, per out implied bounds rules, such an input fact might be derived from an argument with a type like &'a &'b u32, where there are 'nested' regions.

Detecting illegal subset relations

We can now extend the polonius rules to report errors for cases like our running example. The basic idea is this: if the function requires a subset relationship 'r1: 'r2 between two placeholder regions 'r1 and 'r2, then it must be a "known subset", or else we have an error. We can encode this like so:

.decl subset_error(R1: region, R2: region, P:point)

subset_error(R1, R2, P) :-
  subset(R1, R2, P),      // `R1: R2` required at `P`
  placeholder_region(R1), // `R1` is a placeholder
  placeholder_region(R2), // `R2` is also a placeholder
  !known_subset(R1, R2).  // `R1: R2` is not a "known subset" relation.

In our example program, we can clearly derive subset_error('b, 'a, mid(S0)), and hence we have an error:

  • we saw earlier that subset('a, 'b, mid(S0)) holds
  • as 'a is a placeholder region, placeholder_region('a) will appear in the input (same for 'b)
  • finally, the known_base_subset (and hence known_subset) relation in our example is empty

Sidenote on negative reasoning and stratification. This rule makes use of negative reasoning in the form of the !known_subset(R1, R2) predicate. Negative reasoning is fine in datalog so long as the program is "stratified" -- in particular, we must be able to compute the entire known_subset relation without having to compute subset_error. In this case, the program is trivialy stratified -- known_subset depends only on the input relation known_base_subset.)

Observation about borrowing local data

It is interesting to walk through a different example. This is another case where we expect an error, but in this case the error arises because we are returning a reference to the stack:

fn bar<'a>(x: &'a [u32]) -> &'a u32 {
    let stack_slot = x[0];
    &stack_slot
}

Polonius will report an error for this case, but not because of the mechanisms in this blog post. What happens instead is that we create a loan for the borrow expression &stack_slot, we'll call it L0. When the borrow is returned, this loan L0 winds up being a member of the 'a region. It is therefore "live" when the storage for stack_slot is popped from the stack, which is an error: you can't pop the storage for a stack slot where there are live loans that have reference it.

Conclusion

This post describes a simple extension to the polonius rules that covers errors arising from subset relations. Unlike the prior rules, these errors are not triggered by any "access", but rather simply the creation of a (transitive) subset relation between two placeholder regions.

Unfortunately, this is not the complete story around region checking errors. In particular, this post ignored subset relations that can arise from "higher-ranked" types like for<'a> fn(&'a u32). Handling these properly requires us to introduce a bit more logic and will be covered in a follow-up.

Comments, if any, should be posted in the internals thread dedicated to my previous polonius post

Appendix: A (potentially) more efficient formulation

The subset_error formulation above relied on the transitive subset relation to work, because we wanted to report errors any time that one placeholder wound up being forced to be a subset of another. In the more optimized polonius implementations, we don't compute the full transitive relation, so it might be useful to create a new relation subset_placeholder that is specific to placeholder regions:

.decl subset_placeholder(R1: region, R2: region, P:point)

The idea is that subset_placeholder(R1, R2, P) means that, at the point P, we know that R1: R2 must hold, where R1 is a placeholder. You can express this via a "base" rule:

subset_placeholder(R1, R2, P) :-
  subset(R1, R2, P),      // `R1: R2` required at `P`
  placeholder_region(R1). // `R1` is a placeholder

and a transitive rule:

subset_placeholder(R1, R3, P) :-
  subset_placeholder(R1, R2, P), // `R1: R2` at P where `R1` is a placeholder
  subset(R2, R3, P).      // `R2: R3` required at `P`

Then we reformulate the subset_error rule to be based on subset_placeholder:

.decl subset_error(R1: region, R2: region, P:point)

subset_error(R1, R2, P) :-
  subset_placeholder(R1, R2, P), // `R1: R2` required at `P`
  placeholder_region(R2), // `R2` is also a placeholder
  !known_subset(R1, R2).  // `R1: R2` is not a "known subset" relation.

Footnotes

Footnotes

  1. The term "region" is not an especially good fit, but it's common in academia.

  2. Currently called universal_region, though I plan to rename it.

Following up on my previous post, let's take a look at checking subregion relationships with higher-ranked types.

The subset relation in Polonius is not enough

In my original post on Polonius, I assumed that when we computed a subtype relation T1 <: T2 between two types, the result was either a hard error or a set of subset relations between various regions. So, for example, if we had a subtype relation between two references:

&'a u32 <: &'b u32

the result would be a subset relation 'a: 'b (or, "'a contains a subset of the loans in 'b").

For a more complex case, consider the relationship of two fn types:

fn(&'a u32) <: fn(&'b u32)
// ^^^^^^^     ^^^^^^^^^^
// |           A fn expecting a `&'b u32` as argument.
// |
// A fn expecting a `&'a u32` as argument.

If we imagine that we have some variable f of type fn(&'a u32) -- that is, a fn that can be called with a 'a reference -- then this subtype relation is saying that f can be given the type fn(&'b u32) -- that is, a fn that can be called with a 'b reference. That is fine so long as that 'b reference can be used as a 'a reference: that is, &'b u32 <: &'a u32. So, we can say that the two fn types are subtypes so long as 'b: 'a (note that the order is reversed from the first example; this is because fn types are contravariant in their argument type).

Unfortunately, this structure isn't flexible enough to accommodate a subtyping question involving higher-ranked types. Consider a subtype relation like this:

fn(&'a u32) <: for<'b> fn(&'b u32)
//             ^^^^^^^
//             Unlike before, the supertype
//             expects a reference with *any*
//             lifetime as argument.

What subtype relation should come from this? We can't say 'b: 'a as before, because the lifetime 'b isn't some specific region -- rather, the supertype says that the function has to accept a reference with any lifetime 'b. (In fact, this subtyping relation should ultimately yield an error -- but we would like that error to arise not during subtyping, but rather later, when )

To express a constraint like this, we need a richer set of constraints than just subset. In fact, what rustc implements is something more like this:

Constraint = Subset
           | Constraint, Constraint
           | forall<R1> { Constraint }
           | exists<R1> { Constraint }

Subset = R1: R2  

Now we can say that

fn(&'a u32) <: for<'b> fn(&'b u32)

holds if the constraint forall<'b> { 'b: 'a } holds, which implies that 'a has to contain all possible loans. (This would of course be possible if 'a were an "infinite set" covering the whole program or some such thing, but we can treat it as an error.)

Encoding these constraints with universes

OK, so we want this richer family of constraints, but how can we encode those sorts of constraints into the Polonius subset relation?

Actually, the way we check these things in rustc has been changing too. We've recently (in #55517) adopted a system based on the concept of universes, which is something I first learned about when studying up on how Lambda Prolog is implemented. If you're interested in some background material, I gave a talk at a recent Papers We Love meetup where I covered the paper that explains how it works. I also wrote a section in the rustc-guide on the topic some time back.

The basic idea of a universe, in this context at least, is that it encodes a set of loans. Each region (like 'a) is assigned to a specific universe and that encodes the loans that it can contain. Regions are structured into a tree, where a child universe U2 extends its parent U1 with additional loans that are not included in U1. In other words, the child U2 is a strict superset of the parent U1.

For our purposes, we will always have a root universe that contains all the loans that appear in the current function being checked. We'll create new universes to reason about higher-ranked functions and trait matching. Just as in rustc, we can represent universes as a simple integer. We call the root universe U0, and thus a child universe might be U1 -- because U0 < U1, we know that U1 is a child of U0.

So, let's go back to that example of

fn(&'a u32) <: for<'b> fn(&'b u32)

If you recall, that generated a constraint forall<'b> { 'b: 'a }. We might encode that by saying that the region variable 'a is in the universe U0, but the region variable 'b is in a child universe U1. The idea is that because 'b corresponds to "any set of loans", we model that as potentially containing some (as yet unknown) loans that couldn't be named by 'a.

To encode these relationships into Polonius facts, I propose to extend the subset relation. Whereas before we had a subset(R1, R2, P) relation, we will now include a universe:

.decl subset_base(R1: region, U: universe, R2: region, P:point)
.input subset_base

The meaning is something like this:

  • For every loan L in R1,
    • there must be some loan L' in R2, where
      • L' is nameable in U, and
      • L is a subset of L'.

For subset_base, this universe will always be the universe associated with the "superset" region -- so to encode forall<'b> { 'b: 'a }, we would use the universe U0, since 'a is in U0:

subset_base('b, U0, 'a, P)

We can carry the universe into the full subset relation unchanged:

subset(R1, U, R2, P) :-
  subset_base(R1, U, R2, P).

And hence we get the relation subset('b, U0, 'a, P).

Transitive rule with universes

The transitive subset rule is where things get interesting. The idea here is that we can combine two subset relations, but we need to also combine their universes by taking the "least upper bound" (which is just the minimum of the two indices in practice):

subset(R1, U3, R3, P) :-
  subset(R1, U1, R2, P),
  subset(R2, U2, R3, P),
  least_upper_bound(U1, U2, U3).

Let's work through an example to show how this works. Imagine that we have a region constraint like this:

exists<R1> {
  forall<R2> { R2: R1 }
  forall<R3> { R1: R3 }
}

We will see that this constraint is unsolveable. No single region R1 can be both a subset and a superset of all other regions. (Nonetheless, we can still construct a transitive subset relation using the rules above.)

Let's say that the universe of R1 is U1 and the universes for R2 and R3 are U2 and U3 respectively. These universes form a tree, where U2 and U3 are siblings and U1 is their parent:

  • U1
    • U2
    • U3

We can translate this to two subset_base facts (here I am ignoring the position, which doesn't play a role):

subset_base(R2, U1, R1, P)
subset_base(R1, U3, R3, P)

Remember that we always use the univese of the "superregion" -- in the first relation, R2: R1, the superregion is R1 so we use U1. In the second, we use the universe U3 for the region R3.

Now we can create a transitive subset relation, but we have to combine U1 and U3 -- the minimum of which is U1. The resulting subset relation is then:

subset(R2, U1, R3, P)

Accommodating approximation

subset(R1b, E, R2, P) :-
  subset(R1a, E, R2, P),
  placeholder_region(R1),
  approximate(R1a, E1, R1b).

This is a key point. If we are going to approximate, do we need to generate a true subset relation? It's awfully unfortunate, going to complicate optimization. My key fear is that we establish some subset that can force a loan to be extended. Trying to come up with some realistic code where that is the case. Trait matching is an obvious candidate, but it's tricky, because it is invariant. But I feel like there must be a way to "force" a region to 'static.

Well, there is a semi-obvious one:

fn foo<'a>(x: &'a u32)
where for<'b> 'a: 'b { }

fn main() {
  let data: u32 = 22;
  let p: &'1 u32 = &'0 data;
  foo::<'2>(data); // requires:
  // &'1 u32 <: &'2 u32
  // '1: '2
  // forall<'3> { '2: '3 }
}

Hmm but this isn't the best example. What happens here is that we get an error because we push the loan of data into '3, and that fails the "known contains" check (this assumes that we don't generate loans of static data). Or... alternatively... we get an error because '3 (as a placeholder..) is live, and that makes the loan live until the slot is freed?

(What are the liveness rules around subuniverse placeholders? I assume they are not live anywhere, actually.)

Maybe if we add if?

fn foo<'a>(x: &'a u32)
where for<'b> where ('a: 'b) { 'a: 'b } { }

fn main() {
  let data: u32 = 22;
  let p: &'1 u32 = &'0 data;
  foo::<'2>(data); // requires:
  // &'1 u32 <: &'2 u32
  // '1: '2
  // forall<'3> { '2: '3 }
}

Errors

placeholder_error(R1, E, R2, P) :-
  subset(R1, E, R2, P),
  placeholder_region(R1),
  inexpressible(R1, E1).
@lqd
Copy link

lqd commented Dec 22, 2018

Super nice and enjoyable to read!

A comment: if I understand correctly, it seems we'd have a subset_error for each point where r1: r2 is required ? (I thought we'd be having only one error per "missing" r1: r2 subset (A.K.A !known_subset(r1, r2)) if r1: r2 is required at any point, à la subset_error(R1, R2))

(there's a tiny typo: "per out implied bounds rules")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment