Last active
September 17, 2019 03:59
-
-
Save c42f/79503dd14cc6dab69b0d73cef9139dee to your computer and use it in GitHub Desktop.
Cancellation vs Structured Concurrency
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Chris Foster 4 days ago | |
Current notes are in my fork of the Juleps repo at | |
https://github.com/c42f/Juleps/blob/cjf/structured-concurrency/StructuredConcurrency.md | |
but most people seem to ignore the main Juleps repo so I haven't sent a PR yet (also it's very WIP). | |
We could move the notes to Jameson's julep wiki or somewhere else more conducive to collaboration? (edited) | |
Kiran Pamnany 3 days ago | |
This is nicely written -- good literature survey. | |
Kiran Pamnany 3 days ago | |
I'd be very interested in your (or anyone's) thoughts on Erlang's approach. | |
Chris Foster 3 days ago | |
I'm definitely interested in learning from everywhere, including the way that Erlang and Elixir supervisors work on OTP | |
Chris Foster 3 days ago | |
(Note that people like njsmith and elizarov seem quite aware of the Erlang prior art.) | |
Chris Foster 3 days ago | |
What would be best would be to learn from other language developers directly. So once we feel like we've got a decent survey I suggest we cross post it to the Structured Concurrency forum. | |
Jameson 3 days ago | |
That does look nice. Although I continue to reject the premise that Tasks need to be cancellable | |
Takafumi Arakaki 3 days ago | |
I think it is important to discuss cancellation and what njsmith call "black box rule" (which is IMHO more fundamental) separately. I think that's why he discussed cancellation in a separate (also long) blog post. | |
Takafumi Arakaki 3 days ago | |
I posted a longer comment on the "black box rule" here: https://github.com/c42f/Juleps/pull/1 | |
Stefan Karpinski 2 days ago | |
Great writeup | |
Stefan Karpinski 2 days ago | |
@vtjnash what’s your issue with tasks being cancellable? | |
Stefan Karpinski 2 days ago | |
I’m curious what the counterargument to structured concurrency is | |
Stefan Karpinski 2 days ago | |
As in what are you giving up? | |
Stefan Karpinski 2 days ago | |
I feel like all the discussions focus on the benefits, which are compelling, but what are you giving up? It must make some kinds programs harder to write, getting into that side would be helpful | |
Stefan Karpinski 2 days ago | |
Some examples of patterns that can’t be written anymore and how you express them instead | |
Jameson 2 days ago | |
I just don’t think it has anything to do with structured concurrency | |
Stefan Karpinski 2 days ago | |
It’s not necessary, since the essential aspect is nesting of function execution, but two of the key benefits are that errors flow up the task tree and cancellation flows down | |
Stefan Karpinski 2 days ago | |
Given structured concurrency, why not support systematic cancellation? (edited) | |
Jameson 2 days ago | |
I should write a blog post counterpoint titled “cancellation considered harmful” that talks about how structured concurrency doesn’t need cancellation | |
Jameson 2 days ago | |
Or maybe not, because njs (Trio) already seems to have written it: https://vorpus.org/blog/timeouts-and-cancellation-for-humans/#an-escape-hatch | |
Stefan Karpinski 2 days ago | |
You keep saying that one doesn’t need cancellation to do structured concurrency, which is clearly true, but you haven’t made the case for why you shouldn’t have cancellation given structured concurrency, which provides a sensible behavior for it | |
Jameson 2 days ago | |
njs says it at the bottom of that post under the “asyncio” analysis | |
Jameson 2 days ago | |
Also, this is pretty good, if very long https://trio.discourse.group/t/graceful-shutdown/93 (after reading the whole thing, I think the top post covers everything, but the subsequent dialog is still a good discussion) | |
Trio forumTrio forum | |
Graceful Shutdown | |
Use case Imagine a web server. It’s handling many HTTP connections in parallel. The connections may have some kind of timeout: If there’s nothing coming from the cleint for a minute, the server shuts the connection down to prevent resource wastage and DoS attacks. When the server itself is being shut down it stops accepting new connections and gives existing connections 10 second to cleanly shut down. After 10 seconds it forcefully cancels any remaining connectons and exits. The problem Let’s ... | |
Reading time | |
17 mins :clock2: | |
Likes | |
2 :heart: | |
Feb 14th | |
Chris Foster 1 day ago | |
My current feeling is that cancellation at arbitrary points is deeply problematic in most systems which has tried to do it. But there are some exceptions, for example normal OS processes can generally be killed without much consequence. And I've got a suspicion that Erlang supervisors might achieve something similar.What are the essential aspects of isolation which are required to make hard cancellation work? (edited) | |
Chris Foster 1 day ago | |
I do agree with @vtjnash that it's not obvious we must support timely cancellation for every task for us to reap many of the benefits of SC. For example, we could support it only for IO operations. But by the same token this feels incomplete to me. | |
Jameson 1 day ago | |
kill is a horrible PITA and basically impossible to use safely. It’s a great hammer, but hard to use well. Mostly you just need to depend on the system (kernel) running garbage collection after you die (RIP) | |
Jameson 1 day ago | |
All IO objects inherently support cancellation already, we just don’t currently implement the ability to group them conveniently with @sync | |
Jameson 1 day ago | |
FWIW, I don’t know of any cancellation system (sans kill and Java terminate) that actually provides timely cancellation. They all seem to discuss how it should be done and is required, but is impossible. | |
Stefan Karpinski 1 day ago | |
Is the hangup here the guarantee? | |
Jameson 1 day ago | |
the other problem is that all of the cleanup you actually would want to do (closing I/O objects, deleting files, logging the error) would also get canceled, so no cleanup would actually occur | |
Jameson 1 day ago | |
Oh, and much of the time it won’t actually stop. It’s very easy to write something that just gets stuck unaware. | |
Jameson 1 day ago | |
kill -9 is basically fine, that’s rather like exit(137), and the kernel garbage collector will be left to clean up the pieces | |
Chris Foster 1 day ago | |
So what is it that makes kill -9 work when other things don't? It's the strong isolation between resources of one process and another allowing them to be gc'd. | |
But even kill -9 doesn't work sometimes. I mean, it can stop the process but that might take other parts of the system down when the process is unexpectedly missing. | |
Jameson 1 day ago | |
Right. I think the difference in practice is that sending kill -9 actually has the effect of closing all of the open resources. That’s a scenario the other processes are usually written to handle in some way. | |
Jameson 1 day ago | |
That is also stops running the code is somewhat irrelevant, because it’s already been cut off from all side-effect channels | |
Jameson 1 day ago | |
But cancel sort of does exactly the opposite of what you would ever want. It’s like you sent kill to the garbage collector, then resumed executing the code until it had finished cleaning up manually. | |
Stefan Karpinski 1 day ago | |
Can we maybe think constructively about how it could be made to work? | |
Chris Foster 2 hours ago | |
Yeah, that's the intention. I'm trying to understand Jameson's point of view :slightly_smiling_face: | |
What I think is interesting is that in the real world cancellation is definitely a thing: if your process starts behaving weirdly you kill -9 it. | |
If that locks up some parts of your OS, you restart the machine. | |
If that causes a distributed system to fail, you might need to restart a whole bunch of machines.Outside of computing, if a business unit is failing to make profit, it's restructured or terminated and its duties redistributed. If that causes the company to fails to service its debts, it goes into bankruptcy and gets restructured.It's all very messy, but there's some kind of interesting informal hierarchy of supervision going on, and I've got a feeling that this is where I need to start reading about Erlang supervisors. | |
Chris Foster 20 minutes ago | |
the other problem is that all of the cleanup you actually would want to do (closing I/O objects, deleting files, logging the error) would also get canceled, so no cleanup would actually occur | |
@vtjnash I think you mean that all IO operations will return errors because cancellation is level-triggered? So you can't even call close, or send a goodbye message down a socket? Trio has a kind-of answer to this: cleanup scopes exist to explicitly protect against cancellation for the duration of the cleanup.I think this is a consistent answer because only people who are thinking carefully about cleanup will use this feature, so they are in the correct mindset when they enable it explicitly. I assume it can also be nested with timeouts, etc. (edited) | |
Jameson 13 hours ago | |
Right, IO is generally shorthand for any externally visible side effect | |
Jameson 13 hours ago | |
But now you’ve just directly contradicted everything you claimed earlier about cancellation. That’s not a consistent answer. | |
Jameson 13 hours ago | |
Just consider the example in Trio about how to use it correctly: | |
with trio.move_on_after(TIMEOUT): | |
conn = make_connection() | |
try: | |
await conn.send_hello_msg() | |
finally: | |
with trio.move_on_after(CLEANUP_TIMEOUT) as cleanup_scope: | |
cleanup_scope.shield = True | |
await conn.send_goodbye_msg() | |
Jameson 13 hours ago | |
It’s a rambling mess complaining about how this is bad (“shooting yourself in the foot”) and incompatible with the trio design | |
Jameson 13 hours ago | |
But also that you need to do it. | |
Chris Foster 2 hours ago | |
Yeah, I think I've written a few bone headed things above but that's just part of the process of understanding the nature of the problem. (edited) | |
Chris Foster 2 hours ago | |
BTW, I don't mean to claim the trio way you've quoted above is bad or inconsistent; it makes perfect sense to me. Do you think the Trio cancellation semantics make sense for Julia? (edited) | |
Chris Foster 2 hours ago | |
I do realize (now) that preemptive hard cancellation of the kill -9 variety is a completely different approach to the problem of cancelling things. (edited) | |
Chris Foster 1 hour ago | |
There seems to be at least three different flavors of cancellation: | |
1. Cooperative with task-defined handlers (Trio style) where cancel points are explicit, and tasks may run cleanup code. | |
2. Hard preemptive cancellation (kill -9 style) where you need enough isolation so that resources can be GC'd. Incompatible with task-defined cleanup. Has several known in-process APIs which have been a historical disaster; Thread.stop etc. | |
3. Preemptive with task-defined cleanup handlers (InterruptException style) - a weird mixture where cancel points are "almost" everywhere, but task-defined cleanup can still run. Known to be a reliability disaster if just tossed into a language without thinking very hard about what "almost" means. | |
Jameson 20 minutes ago | |
Yes, the dialog here has been good. I’m just very opinionated about this subject, haha. | |
Jameson 19 minutes ago | |
1. Yes, but I think it’s faulty to say it’s explicit in Trio. It’s perhaps explicit in Go, but in Trio it’s merely limited. This design seems to be most nearly a clone of pthreads_cancel? | |
Jameson 13 minutes ago | |
2. Is know to excel in certain cases, and might be also called the Erlang model. As you point out, it’s also a known foot-gun if the language isn’t sufficiently restricted. | |
Jameson 9 minutes ago | |
3. This seems uncommon, although C does (allow for) defining it. It mentions that this is very limited in what functionality is permissible on such a child thread however | |
Jameson 6 minutes ago | |
I think you’re missing 4: Julia-style cancellation, aka OO cancellation. Any IO object can be cancelled by calling close, which initiates termination of outstanding requests. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment