Skip to content

Instantly share code, notes, and snippets.

@virusdave
Last active February 7, 2021 20:27
Show Gist options
  • Save virusdave/e6c33019a0ff1d4151d10652c661471e to your computer and use it in GitHub Desktop.
Save virusdave/e6c33019a0ff1d4151d10652c661471e to your computer and use it in GitHub Desktop.

WTF?

TL;DR: I wrote a scala program (Scala 2.12, Bazel 3.2, JDK zulu8) that uses Presto JDBC to connect to a company-internal Presto cluster.

It works fine.

Upgrade some stuff, it stops working. When it works and when it fails takes me MANY hours to pin down.

Details

This particular choice of tech versions was dictated by wanting a completely pinned, deterministic build stack (using Nix on Mac OSX Catalina currently).

Until recently, nixpkgs had bazel 3.3 pinned, which meant the upper limit on rules_scala supported only scala 2.12. Bleh.

But, recently bazel 3.7 support landed in nixpkgs so i upgraded to this nixpkgs version and bumped my bazel version to that one. Doing so enabled updating the codebase to Scala 2.13 (yay!), but the newer bazel scala support wanted to run on JDK11 apparently (something about using -XXCompactStrings during compile which didn't work on jdk8). OK, that's simple in nix, so i swapped the system jdk8 for jdk11 and got everything compiling just fine and tests passing after a little work.

Yay.

But my program now seemed to hand when making requests to the presto cluster!

Debugging this was a pain, because the Presto JDBC apparently uses a vendored (or maybe shaded) version of OkHttp to make requests, so I can't easily just drop in debugging stuff in a straightforward way.

Many hours of debugging later, and i decided it was related to my executing the uberjar on jdk11. Running the same uberjar on jdk8 worked fine, as did openjdk13 and jdk14. Hmm.

Some super-hacky debugging code revealed that failures would sometimes happen during the initial request, and sometimes in the one immediately after that. When it failed during the initial request, I'd see the Amazon ELB report a 502. When it failed afterwards... Nothing. Just a frozen process with a lingering connection established.

Using jstack during this time showed it blocked trying to write to the socket.

Trying to debug the TLS process on jdk11 was... interesting. I used a MITM proxy called Proxyman to try and debug, but then... it worked just fine! Disable the proxy and it failed again.

I tried swapping out the entire TLS implementation as suggested by Yuri, using Conscrypt instead, but I saw the same behavior.

Using -Djavax.net.debug=ssl:handshake for debugging (when using the default crypto system) gave me a ton of info, including that TLS 1.2 was being used. I tried forcing both 1.3 and 1.1, but both resulted in rejection from the remote server. I also noticed that in the cases jdk11 no-proxy (fail), jdk14 no-proxy (success) that ALPN was offerred and accepted, but the default TLS implementation logged Ignore impact of unsupported extension: application_layer_protocol_negotiation. However, in the jdk11 proxy (success) case, this wasn't offerred by the proxy during ServerHello.

That seemed at least a little suspicious.

Trying the inverse of Moses' suggestion, I tried disabling ALPN on jdk11 and sure enough, everything works.

I don't yet know what the exact bug is, but it definitely seems related to ALPN on jdk11, across multiple JDK vendors. Or maybe it's on Amazon's side in an TLS-terminating ELB or something. Or just a weird interaction of the two implementations? Who knows.

But at least I'm unblocked, and now enjoying Bazel 3.7 and Scala 2.13 like God intended.

Followups

  • Initial problem occurred on Mac OSX (Catalina)
  • I've reproduced it (and the mitigation) in a docker container Alpine / OpenJDK11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment