kachayev/aleph-planning.md

## aleph-planning.md

      
    Raw
  

              aleph-planning.md
            
          
    Aleph, Async, HTTP, Clojure

I've been working with Aleph rougly for last 5 years, actively contributing to the library for last 2 (or so). I also put some effort into spreading the word about it, including educational tech talks, like "Deep HTTP Dive Throught Aleph & Netty". But the more I talk to people the more confusion I find, mostly about how Aleph works and what can you expect when adding it to your stack. Clojurists Together has recently announced Aleph to get Q1 funding, I think it's a good time to share my priorities and thoughts on development plans that were mentioned in the blog post. Hope the community would find it interesting and helpful.
Aleph describes itself as "asynchronous communication for Clojure" library. And you should probably pay a good portion of your attention to the first word: "asynchronous". We're living in the JVM world. And thread-per-connection model (known also as "blocking") fits this world way better. It's simpler to develop, to manage, to understand, to operate with. It has a long track of being used in production. "Asynchronous" solutions were born for a reason, for sure. We certainly want to push boundaries to get more from the systems we're implementing. But it comes with a price: complexity. We took part of OS functionality to manage by ourselves, no wonder things got trickier. You would probably think that after so many years with async servers & clients, I should go out to persuade everyone to switch to my side... but in fact I'm sure that it's better for you to stick to thread-per-connection while it works for you. You need a good reason to give up and switch to async. At least that would keep your nerves for a while.
Reasoning

What might be a good reason? I feel like the most popular answer is "large scale" (whatever they mean using the word "large"). But that's arguable. You see, let's imaging the job of your server is to fire SQL query using JDBC and render the response to the client. JDBC is inherently blocking. Would you gain something from switching to async server interface? Well, it depends... You need to think through your workloads carefully to get the correct answer. On the other side there're quite a few situations where you can skyrocket your implementation introducing async I/O. Some of them are obvious, some of them almost never get attention. But let's check them one by one.


Persistent connections. Well, it's 2019. Almost all of our connections are persistent now. But at the same time most of HTTP servers and clients are implemented in such a way that you just don't see them. Why? I think tt's an attemp to hide from you low-level details that might not be very helpful: you can send an HTTP/1.1 request and get back a response without thinking about specific instance of underlying TCP connection, you don't care if that connection was open already or it's a newly spawned one. But it's different when we talk about protocols that expose connection as an essential part, i.e. Websockets, SSE, HTTP/2 etc. If you don't need a new thread per each new client... it opens a lot of doors for you.


I/O-heavy operations, e.g. proxy servers, webcrawlers or load-testing frameworks. I’m talking about those cases where most of your work is bits juggling over network connections. And I don’t mean “proxy” in terms of HTTP proxies infrastructure but in more general sense. Like API gateways that are so common nowadays (because of microservices, right). Or HTTP endpoint to send a message to Kafka or Kinesis Data Streams. Or a lot of other similar systems. I personally don’t have a lot of experience with webcrawlers but recently we added support for fully async name resolving mechanism so you can throw away blocking DNS resolver that JDK ships. Which gives surprisingly good results when you heavily rely on DNS resolutions.


Large files or payloads. Are you about to implement your own CDN or just a way for your clients to upload their photos to Amazon S3? Welcome to async world, fasten your seat belts. Technically it's also about I/O heavy workloads. I'm putting this separately to draw your attention to this specific use case as it's pretty demanding to both server and client implemention. If you think that file uploading is a so commoditized feature in modern applications that we've already created at least a few purely flawless solutions for it... I have some bad news for you.


A client for an API or database or whatever external system. Let’s imaging you want to implement a client library for ElasticSearch. Which exposes HTTP/1.1 interface by itself. But you want to make it more natural to work with hidding underlying HTTP layer as something irrelevant (at least in most cases). Please, don’t act like JDBC. Let the users of your library decide whether they want to block or not. Switching from async to blocking is trivial. Opposite is not true. I'm not talking specifically about HTTP APIs. Aleph is one of the best choice out there for raw TCP APIs (think Zookeeper, Kafka etc). I would love to see more of this happening in the Clojure ecosystem. So if you're thinking about starting a new library or adapting your existing library to async model, - please, feel free to reach me out. I will be glad to help.


I'm sure there's more. If I missed something important, please, ping me in comments. With all that being said let's move on to my current priorities and future plans.
Plans

First of all I want to mention that there's a weird misconception that's so popular (or let me say "trending") in the Clojure community right now: if the library is production ready... it doesn't require any additional work. Like Clojure somehow magically empowers us to create feature-complete bug-free software that we do not need to touch. It just works. Well... maybe it's a valid position for some specific use cases or domains, but that's far from true for Aleph (and for most libraries I've been ever dealing with, but that might be my personal tough luck). Networking is a deep and dirty tar pit. Leveraging thousands of man-years of the effort put to make TCP/HTTP/etc stacks actually work, we obviously moved the needle. But as an industry we're still in the middle of the tar pit. The situation with async programming on top of JVM feels pretty much the same way. Is Aleph production ready? Yes. Is there a room for improvements? Definitely.
Apart from the ongoing maintenance and bug fixes, I want to focus on those features that cover use cases I've mentioned earlier. If any specific technical requirement encourages you to look into Aleph, we don't want the library to fail your expectation in terms of quality of the solutions provided or let you down after releasing your software to production.

Websockets. Aleph has a decent support for Websockets protocol, exposing websocket connection wrapped in a pretty powerful manifold.stream abstraction. There're still some parts that are missing and I really want to fill those gaps as soon as possible.


Ping/Pong frames support (#364)
Handling handshake error and timeouts (#442)
Fix buffers allocation/releases cycle (#430)
Close connection with custom status code (#470)
Idle state timeouts feature requires more testing


Large requests & responses.


Mutliparts encoder/decoder is a long story for Aleph. A lot of work is already done. I think #432 is a final part of the puzzle (even tho' underlying Netty implementation is not perfect, and I'm working to improve it too)
Proper errors handling for FileRegion and ChunkedFile APIs (#459)
HTTP client to decompress response body automatically when necessary (#444), it's also about clj-http API parity
API for 100-Continue handlers (#462), current implementation covers only trivial use cases
API to send part of the file to deal with Range requests (#469)
Safety limitation in terms of max body size or read timeouts to prevent client's misbehavior (#452, #458)
Premature server response handling (#454), this one might be quick tricky to get right (probably we need to fix Netty too)
RFC enforcements and errors handling for transfer-encoding: chunked


Operational improvements. The biggest elephant in the room is obviously a graceful shutdown. The idea here is when you're switching your server off, let's say to update it to the newer version, you don't want to just drop all  connections at some random state hoping your clients would be able to recover. You want to stop accepting new connections, send everything that was already enqueued to being sent, close TCP connections normally. And after all of that is done - stop the server. The implementation is kinda "doable" for a typical HTTP/1.1 stack. But websockets and raw TCP connections are a bit more painful: as far as Aleph has no clue in what state it's safe to close the connection, the user of the library should decide. Current implementation I've written quite some time ago was generally inspired by Golang net/http package with a special mechanism for connection "hijacking" (a way to say that you're the one who's in charge to shut this connection down). The complexity of the solution tho' requires a lot of testing and benchmarking prior to merging it to the master.


Well-known and super annoying issue with Clojure's dynamic class loaders that led to long and scare WARN messages in your log after shutting down the server (#425), already fixed but still waits to be merged
Better API to deal with SNI endpoints (#445)
KQueue transport (#330)
Unix domain sockets (#465)
Being a good JVM citizen by not trying to suppress your OOME (#446)


Pluggable clients. If you want to ship Aleph as a part of your API client you need to make sure that the library would not make more harm than it's absolutely necessary. Meaning, you need to be thoughtful and transparent about resources (like thread pools and file descriptors) allocations/deallocations. Described here.

Wow, that's a lot. But that's not everything.
During the last year, I backported some of our fixes and improvements to Netty and clj-http. I'm going to do more of those treating it as an on-going work. From time to time it's good to see that Aleph is a part of greater ecosystem and it behaves as a nice neighbour.
After so many years with the library, I have a few high-level (almost philosophical) obvervations about underlying complexity of async model in general and specific implemetation in Aleph/Manifold. The more I work on PRs, the more use cases I have to turn these observations into rough understanding of how things might be done differently. I trully hope to find some time (at least a couple of free weekends) to talk about those ideas or even to prototype something, e.g. merging deferred chains into higher-level tasks to keep boundaries of async computations, better story for cancellation, "brackets" to deal with resources allocation/deallocation cycles and more.
I also want to talk more about Aleph and networking. My last tech talk got some attention from the community, now I'm thinking about more general topics (based on my experience with Aleph): async & JVM, pitfalls of HTTP-over-TCP abstraction etc. It would be cool to get > 24 hours a day but that probably won't happen to me, so I'll do my best to squeeze out whatever is possible from what I actually have.
Misc

I want to say huge "Thanks" to Zach Tellman (author of the Aleph and the underlying stack of Manifold, Dirigiste and Byte-Streams), to Clojurists Together and all of you who supports this initiative, and, obviously, to all of you who uses Aleph.
Thanks!

Oleksii Kachaiev, Jan 2019
Contact me on Twitter or email me at kachayev <$> gmail.com.