Skip to content

Instantly share code, notes, and snippets.

@kmicinski
Created January 30, 2024 19:04
Show Gist options
  • Save kmicinski/1f14405ea6c3029dbdc2149dfa293abc to your computer and use it in GitHub Desktop.
Save kmicinski/1f14405ea6c3029dbdc2149dfa293abc to your computer and use it in GitHub Desktop.
Notes for our 2PM reading group tomorrow. (Also CCing Darshana and Oliver, two DB collaborators of mine from UBuffalo) Very off-the-cuff thoughts here. I read this paper a few weeks ago and finally went through it again this afternoon to write up my thoughts.
Wow, very interesting paper. I am excited to discuss this. My highest level thoughts are that the restriction to circuits is doing a lot of the heavy lifting here, the authors say they are working on a compiler but it's not obvious to me that this won't hamper incrmentality in practice. Especially as you write richer programs that need to mix various data structures with varying concerns and space/materialization trade-offs.
PAPER TITLE: "DBSP: Automatic Incremental View Maintenance for Rich Query Languages"
SUMMARY
This paper presents a new approach to providing incremental view maintenance as a Rust library to a user. The high-level idea of incrementalization is being able to write programs that automatically work in terms of deltas--instead of having to replay the entire program in response to a change, we want to recompute as little as possible. This idea manifests in various instances, in Datalog it shows up as semi-naive evaluation, for example.
The "programmer's view" of DBSP is as a Rust library which gives you the ability to build "circuits" which are powered by magical data structures which take care of all of the IVM stuff for you (e.g., OrdIndexedZSet). If you work in terms of these data structures, you get incrementalization "for free."
More formally, the authors formalize operations over streams. Streams naturally pop in databases from, e.g., transactions. It is important to note that the streams are of "small" values (otherwise you push all of the interesting updating to just updating a giant transaction, say...). Of course, you can make an incremental version of an operator by just manifesting all of the history and operating over that (in terms of program analysis: when a single token changes, you can rerun the analysis from scratch on the whole program). But you don't just want that--you want to reuse as much work as possible.
The answer is to recognize that your relations satisfy algebraic laws and to use those laws to help you do less work. Formally, this is done via abelian groups: i.e., things that commute. You can then build stream circuits carrying values from these abelian groups. Operations on these groups give you the natural add/remove (bag semantics) you might expect--the ability to remove alongside adding is something interesting and distinct from Datalog but common in IVM, and appeals to bag semantics.
Using these principles, the paper presents a language for building circuits--including recursive circuits--from stream operators. The paper then presents
BIG IDEAS / MY KNEE-JERK THOUGHTS ON THEM
- Although the system has a beautiful theory for updating streams, the way that *programs* are built is quite ugly, requiring them to basically be built as circuits. You can nest circuits (giving you `while`, basically), but there is not any interesting higher-order control or anything like that. Most programs are not nested while circuits--they consist of things like coroutines, etc...
- Nested streams really are a big part of the novelty here I think, otherwise it's just Kahn networks (pulling from paper)? I know nothing about Kahn networks, maybe David can explain them to us?
- "Here is a set of data structures that help you build incremental stream operators" is nice but in practice, many interesting applications don't deal with streaming operations over "small" data. So you end up contorting your problem to fit into these operators. Clearly this is fine and normal for things like stream computations the authors anticipate--I feel less convinced it will work for incremental program analysis, say, where there additional application-specific structure that you may be nontrivial to encode in terms of DBSP's abstractions.
- A Rust DSL that allows you to write an apparently-non-incremental query, and the implementation of the library then relies upon data structure properties to enact the magic of incrementalization for free.
- The idea of using algebraic properties of the data structures is a nice idea, but there are pros and cons to abstraction. This paper is not a silver bullet--it's just one abstraction. It doesn't immediately make everything incremental, it's about providing you with a reasonable set of abstractions around which--if your applications can be coded up using the provided data structures--you can get incrementality for free. But I expect that to really make a system end-to-end incremental, there may be application-specific concerns that don't play well with these datasets. Let's go back again to the idea of context-sensitive static analysis: maybe there would be a way you could code that up using DBSP, but if you take an off-the-shelf context-sensitive analysis and code it up in Datalog (then implementing it using DBSP), the analysis itself may just plain not be compositional by design, and making it compositional using the DBSP data structures may be possible--but it would be an additional constraint--and it's not obvious to me that finer-grained application-specific structure might be more useful in those cases.
- Nice that it's a Rust DSL and not yet another ad-hoc Datalog variant that will bitrot and die over time.
- Originally I viewed this work as a competitor to vanilla Datalog. I have come to realize that when I think about Datalog engines, I have a fairly "batch-oriented" perspective: I am assuming you can do things like index selection, etc... because you know everything that lives in the fixed-point, and you can leverage that for fast range-indexed joins everywhere.
-> It seems like this paper really doesn't try to compete with SOTA Datalog engines, instead this paper really lives in the space of stream processing ("data flow") languages.
QUESTIONS / CRITICISM
- Section six is the section where I feel the most lost, but also feels like the most important section for me to really grok. Everything else would seem like "okay, non-recursive queries surely can't be too hard and this all makes sense." Can we go through section six together and really discuss it in detail?
- What are the ramifications / expressivity limitations of DBSP's design with respect to forcing you to structure things in terms of abelian groups?
- Is forcing you to write your program as a circuit really as ugly as it seems to me? Is everything else just details? To me, it would seem like they're sweeping a lot under the rug (indeed, some of the hardest problems) by requiring things to be contorted into circuits.
- Follow-up question: is DBSP a compilation target down-the-line? E.g., for the tabled incremental lambda calculus, let's say? What would happen if we tried to do that?
- There's been a ton of work on streams in PL--is anything there relevant?
- How should we compare this system to batch, in-memory Datalogs? Are they orthogonal contributions? Would it be the case that--in its limit--DBSP could be used to build the same kinds of programs as batch-oriented Datalogs, but that its focus is really on the streaming case?
- A bit surprised the Rust crate has almost no users (only 250), I was under the impression this paper was pretty popular--but the crate hasn't caught on--why?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment