Bugzilla 1624408 Downloading Tmx Through Curl Interrupts Mid Way
2020-06-06 21:18:35 +0200
About the bug
Curl can't download TMX files from pontoon when a TMX file that contains enormous number of Entries.
time curl -o de.all-projects.tmx https://pontoon.mozilla.org/de/all-projects/de.all-projects.tmx % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 12.2M 0 12.2M 0 0 249k 0 --:--:-- 0:00:50 --:--:-- 212k curl: (18) transfer closed with outstanding read data remaining real 0m50,538s user 0m1,262s sys 0m1,905s
Average download time of that TMX file is 50 seconds, the operation finished abruptly with the following error:
curl: (18) transfer closed with outstanding read data remaining
TMX data is being streamed through the following layers:
Curl/WGET > Gunicorn > Heroku > Pontoon backend
Heroku implements HTTP 1.1 and supports streaming as well. What's important, Heroku timeouts the HTTP connection when an application doesn't send any bytes during the 50-second window. Timeout window resets after every sent byte. H18 error suggest that the Pontoon's backend raises an error. =
Documentation suggest to use an asynchronous worker when your application needs to stream data. Ponton uses the default Gunicorn synchronous worker. Hypothesis: Gunicorn aborts the HTTP request because of the worker_timeout setting, because downloading of a TMX worker blocks a Gunicorn worker for too long.
The algorithm behind the TMX download view is fairly simple. It's just an iteration over a queryset of Translation memory entries. The backend returns data to the HTTP client for every TMX entry separately. Also, the application don't raise any visible error/exception that could be related to the TMX download view.
- The problem isn't probably related to the Heroku's limits, because the transfer time window resets after every sent byte. Wireshark shows that the Pontoon's backend sends data constantly until it breaks the connection after around 50 seconds.
- Currently Gunicorn users synchronous worker, which isn't the best choice for streaming-friendly applications. Switching to an asynchronous worker could potentially help
- What's strange ->
--compressed, works faster and without errors (?)
- Maybe the TMX download view produces too many bytes and they (?)
In attempt to debug the problem, I created my own instance of Pontoon and ppopulated the database with thousands of Translation Memory entries. The issue ocurred again (which is good) and was fully reproducible. Thanks to that, I spotted the following log entry in Heroku's application logs:
2020-06-06T23:30:14.219137+00:00 app[web.1]: [2020-06-06 23:30:14 +0000]  [CRITICAL] WORKER TIMEOUT (pid:10) 2020-06-06T23:30:14.219184+00:00 heroku[router]: sock=backend at=error code=H18 desc="Server Request Interrupted" method=GET path="/af/test1/af.test1.tmx" host=pontoon-tmx.herokuapp.com request_id=1af27c1d-8d8c-4c4f-89c3-37af8ea1f1bb fwd="220.127.116.11" dyno=web.1 connect=0ms service=30109ms status=503 bytes= protocol=https
- Increase worker_timeout in the Gunicorn config
- Helps, however it may introduce new problems.
- Change the Gunicorn worker type to asynchronous type
- Switch to Gevent didn't help
- Implement buffering in the TMX download view
- I implemented a basic buffering mechanism, but it didn't solve the issue
- Regenerate static periodically and serve them via CDN (?)
- That's probably the least feasible idea
- Increase the network throughput from Heroku somehow (?)
- Not sure how to do that and I couldn't find anything on the