Skip to content

Instantly share code, notes, and snippets.

@nchammas
Last active January 5, 2024 16:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nchammas/5eb46cbbcc8f5fc197cefbc2b0add819 to your computer and use it in GitHub Desktop.
Save nchammas/5eb46cbbcc8f5fc197cefbc2b0add819 to your computer and use it in GitHub Desktop.

I have a very simple Spark Connect Python script that is showing very strange behavior that I am having trouble debugging. Spark Connect uses gRPC under the hood to communicate with a remote Apache Spark cluster. The cluster I'm pointed at is on a private network, so sharing the script is probably not helpful since you won't be able to run it.

The gist of the issue is that this script connects to the cluster and runs a simple test query. (Think SELECT 1 against a database.)

When I run the script from my home network, it takes 20-30 seconds to run. When I tether my workstation to my phone and run the same script, it takes 1-2 seconds. So the issue is somehow connected to my home network.

I suspected DNS at first, but switching between Cloudflare, Quad9, and Google DNS all yield the same result on my home network.

I profiled my script using cProfile and the call that came up as taking all the time is this one:

method 'next_event' of 'grpc._cython.cygrpc.SegregatedCall'

This call is made via this Spark Connect method.

So I enabled gRPC tracing as described in the troubleshooting guide to dig further:

GRPC_VERBOSITY=debug GRPC_TRACE=all python -u test.py &> home-network.txt
GRPC_VERBOSITY=debug GRPC_TRACE=all python -u test.py &> lte-network.txt

I would share the full output but it's a lot of noise mixed with some secrets and keys that I would have to redact. But I did find what I believe to be the key difference between these two traces.

home-network.txt has a bunch of lines that lte-network.txt doesn't have, which look like they are related to a network issue:

I0104 16:17:04.205512000 140704379816832 completion_queue.cc:965]      grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { tv_sec: 1704403024, tv_nsec: 405510000, clock_type: 1 }, reserved=0x0)
D0104 16:17:04.249492000 140704379816832 grpc_ares_wrapper.cc:366]     (c-ares resolver) request:0x7fb63e1d38f0 readable on c-ares fd: 8
D0104 16:17:04.249705000 140704379816832 grpc_ares_wrapper.cc:670]     (c-ares resolver) request:0x7fb63e1d38f0 on_hostbyname_done_locked qtype=A host=dbc-REDACTED.cloud.databricks.com ARES_SUCCESS
D0104 16:17:04.249733000 140704379816832 grpc_ares_wrapper.cc:712]     (c-ares resolver) request:0x7fb63e1d38f0 c-ares resolver gets a AF_INET result: 
  addr: 44.REDACTED
  port: 443

D0104 16:17:04.249763000 140704379816832 grpc_ares_wrapper.cc:193]     (c-ares resolver) request:0x7fb63e1d38f0 Ref ev_driver 0x7fb63e1fcd60
D0104 16:17:04.249781000 140704379816832 grpc_ares_wrapper.cc:449]     (c-ares resolver) request:0x7fb63e1d38f0 notify read on: c-ares fd: 8
D0104 16:17:04.249794000 140704379816832 grpc_ares_wrapper.cc:204]     (c-ares resolver) request:0x7fb63e1d38f0 Unref ev_driver 0x7fb63e1fcd60
D0104 16:17:04.267750000 140704379816832 grpc_ares_wrapper.cc:366]     (c-ares resolver) request:0x7fb63e1d38f0 readable on c-ares fd: 8
D0104 16:17:04.267957000 140704379816832 grpc_ares_wrapper.cc:193]     (c-ares resolver) request:0x7fb63e1d38f0 Ref ev_driver 0x7fb63e1fcd60
D0104 16:17:04.268002000 140704379816832 grpc_ares_wrapper.cc:449]     (c-ares resolver) request:0x7fb63e1d38f0 notify read on: c-ares fd: 8
D0104 16:17:04.268026000 140704379816832 grpc_ares_wrapper.cc:204]     (c-ares resolver) request:0x7fb63e1d38f0 Unref ev_driver 0x7fb63e1fcd60
# NOTE: Both traces are more or less identical up to this point.
# The following part, however, is specific to home-network.txt.
I0104 16:17:04.407392000 140704379816832 completion_queue.cc:1069]     RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT
I0104 16:17:04.407523000 140704379816832 completion_queue.cc:965]      grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { tv_sec: 1704403024, tv_nsec: 607516000, clock_type: 1 }, reserved=0x0)
I0104 16:17:04.609106000 140704379816832 completion_queue.cc:1069]     RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT
I0104 16:17:04.609209000 140704379816832 completion_queue.cc:965]      grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { tv_sec: 1704403024, tv_nsec: 809195000, clock_type: 1 }, reserved=0x0)
I0104 16:17:04.809394000 140704379816832 completion_queue.cc:1069]     RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT
I0104 16:17:04.809430000 140704379816832 completion_queue.cc:965]      grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { tv_sec: 1704403025, tv_nsec: 9426000, clock_type: 1 }, reserved=0x0)
I0104 16:17:05.011529000 140704379816832 completion_queue.cc:1069]     RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT
I0104 16:17:05.011597000 140704379816832 completion_queue.cc:965]      grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { tv_sec: 1704403025, tv_nsec: 211592000, clock_type: 1 }, reserved=0x0)
I0104 16:17:05.201301000 123145868943360 timer_manager.cc:207]         wait ended: was_timed:1 kicked:0
D0104 16:17:05.201404000 123145868943360 timer_generic.cc:696]         TIMER CHECK BEGIN: now=2615 next=9223372036854775807 tls_min=1609 glob_min=2610

There are 113 instances of RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT in home-network.txt and none at all in lte-network.txt.

So I have a lead, but I don't know where to go from here. None of my coworkers have this issue.

What's the meaning of QUEUE_TIMEOUT? How can I debug this further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment