virusdave/JVM-TLS-woes.md

## JVM-TLS-woes.md

      
    Raw
  

              JVM-TLS-woes.md
            
          
    WTF?

TL;DR:
I wrote a scala program (Scala 2.12, Bazel 3.2, JDK zulu8) that uses
Presto JDBC
to connect to a company-internal Presto cluster.
It works fine.
Upgrade some stuff, it stops working.  When it works and when it fails takes me MANY
hours to pin down.
Details

This particular choice of tech versions was dictated by wanting a completely pinned,
deterministic build stack (using Nix on Mac OSX Catalina currently).
Until recently, nixpkgs had bazel 3.3 pinned, which meant the upper limit on
rules_scala
supported only scala 2.12.  Bleh.
But, recently bazel 3.7 support landed in nixpkgs
so i upgraded to this nixpkgs version and bumped my bazel version to that one.
Doing so enabled updating the codebase to Scala 2.13 (yay!), but the newer bazel
scala support wanted to run on JDK11 apparently (something about using -XXCompactStrings
during compile which didn't work on jdk8).  OK, that's simple in nix, so i swapped
the system jdk8 for jdk11 and got everything compiling just fine and tests passing
after a little work.
Yay.
But my program now seemed to hand when making requests to the presto cluster!
Debugging this was a pain, because the Presto JDBC apparently uses a vendored
(or maybe shaded) version of OkHttp to make requests, so I can't easily just
drop in debugging stuff in a straightforward way.
Many hours of debugging later, and i decided it was related to my executing the
uberjar on jdk11.  Running the same uberjar on jdk8 worked fine, as did
openjdk13 and jdk14.  Hmm.
Some super-hacky debugging code revealed that failures would sometimes happen
during the initial request, and sometimes in the one immediately after that.
When it failed during the initial request, I'd see the Amazon ELB report a
502.  When it failed afterwards... Nothing.  Just a frozen process with
a lingering connection established.
Using jstack during this time showed it blocked trying to write to the socket.
Trying to debug the TLS process on jdk11 was... interesting.  I used a
MITM proxy called Proxyman
to try and debug, but then... it worked just fine!  Disable the proxy and it
failed again.
I tried swapping out the entire TLS implementation as suggested by Yuri,
using Conscrypt
instead, but I saw the same behavior.
Using -Djavax.net.debug=ssl:handshake for debugging (when using the default
crypto system) gave me a ton of info, including that TLS 1.2 was being
used.  I tried forcing both 1.3 and 1.1, but both resulted in rejection
from the remote server.  I also noticed that in the cases jdk11 no-proxy (fail),
jdk14 no-proxy (success) that ALPN was offerred and accepted, but the default
TLS implementation logged Ignore impact of unsupported extension: application_layer_protocol_negotiation.
However, in the jdk11 proxy (success) case, this wasn't offerred by the proxy
during ServerHello.
That seemed at least a little suspicious.
Trying the inverse of Moses' suggestion, I tried disabling ALPN on jdk11
and sure enough, everything works.
I don't yet know what the exact bug is, but it definitely seems related to
ALPN on jdk11, across multiple JDK vendors.  Or maybe it's on Amazon's side
in an TLS-terminating ELB or something.  Or just a weird interaction of the
two implementations?  Who knows.
But at least I'm unblocked, and now enjoying Bazel 3.7 and Scala 2.13 like God
intended.
Followups


Initial problem occurred on Mac OSX (Catalina)
I've reproduced it (and the mitigation) in a docker container Alpine / OpenJDK11