Skip to content

Instantly share code, notes, and snippets.

@askmeegs
Created September 19, 2020 19:46
Show Gist options
  • Save askmeegs/9952b498588dc230b183f7dabaed9481 to your computer and use it in GitHub Desktop.
Save askmeegs/9952b498588dc230b183f7dabaed9481 to your computer and use it in GitHub Desktop.
BoA Python Services Latency - Log (Sept. 2020)

debugging python latency

curling home / of bank-of-anthos.xyz (running v0.2.0) has 85ms latency

master / latest manifests - curling home / has 4.5s latency running v0.2.0 on the same test clster --> 27ms latency

some code change between v0.2.0 and now is causing the latency

possibilities

hypothesis is: opentelemetry tracing export

run 6/17 last commit 06dbf66d61f55b95f6c9e098b4b9585059811647 --> (after logging, before tracing) 27ms latency

6/30 (PR right before Python OT tracing) dc4fa3b4ccde4df97ee2b2ae55fe17f2f22c0bd5 --> 27 ms

7/1 (Python OT tracing commit) cf756bb45b870217c478e84ec50c1ca4f45af5e3 --> 2-5 seconds

confirmed - opentelemetry is causing ~4 extra seconds of latency to the python services.

Debugging which part of OT tracing is causing latency

Process: now switched back to the latest master. Removed the following 3 things from all the python services at once, then redeployed + timed the "/" (login screen) endpoint.

toggle "ENABLE_TRACING=false" for all python services

27ms

re-toggled to "true" to test nuances ....

Old OT package version?

Yes - the exporter is outdated. Google Cloud moved their exporters to a separate repo / package.

https://google-cloud-opentelemetry.readthedocs.io/en/latest/

my solution - new starting point

  1. Upgrade everything OT to 13b0 (including the flaskinstrumentory), switching library names to conform to latest OT package changes / where the Cloud Ops exporter now lives on Github.
  2. [POSSIBLY] if Flask instrumentor continues to add multiple seconds of latency, remove it from all the python services for now, and file an OT bug.

links for up-to-date package versions

https://opentelemetry-python.readthedocs.io/en/latest/instrumentation/flask/flask.html https://opentelemetry-python.readthedocs.io/en/latest/instrumentation/sqlalchemy/sqlalchemy.html

https://github.com/GoogleCloudPlatform/opentelemetry-operations-python https://pypi.org/project/opentelemetry-exporter-google-cloud/ https://pypi.org/project/opentelemetry-tools-google-cloud/

https://google-cloud-opentelemetry.readthedocs.io/en/stable/examples/cloud_trace_exporter/README.html https://google-cloud-opentelemetry.readthedocs.io/en/stable/examples/tools/cloud_trace_propagator/README.html

still slow after upgrade ... started commented things out

Determined that it's the GCP Cloud Trace exporter causing latency. Continuing to test by just trying to hit the frontend home (login screen):

With cloud trace export setup commented out, all 3 python services -

➜  ~ hey -n 100 -c 10 http://104.196.36.38:80

Summary:
  Total:	0.9168 secs
  Slowest:	0.3131 secs
  Fastest:	0.0224 secs
  Average:	0.0865 secs
  Requests/sec:	109.0697

  Total data:	658900 bytes
  Size/request:	6589 bytes

With cloud trace export included in all services --

➜  ~ hey -n 100 -c 10 http://104.196.36.38:80

Summary:
  Total:	53.4578 secs
  Slowest:	7.4205 secs
  Fastest:	3.5342 secs
  Average:	5.1236 secs
  Requests/sec:	1.8706
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment