saicitus/Steps_to_reproduce_tcp_keepalive_issue.txt

## Steps_to_reproduce_tcp_keepalive_issue.txt
**Issue**:
1) Start a long running query say ~5 minutes. Ex:
SELECT count(*) from github_events g1,github_events g2 WHERE g1.repo_id=g2.repo_id;
2) Terminate a worker whilst the query is running.
3) Now the system hangs indefinitely.

**Reasoning**
I think what's happening is that removing the network interface doesn't kill the connection. As a result, the coordinator node waits for a response from the worker node, which never comes.

Citus uses PostgreSQL's client connection library (libpq) to communicate between machines; and Postgres uses system defaults to timeout in this scenario. On Linux, I think the default TCP keepalive time (tcp_keepalive_time) is set to 7200 seconds.

An easy way to validate this hypothesis is by tweaking tcp related settings at the OS or database level. Postgres allows you to override them in postgresql.conf  (tcp_keepalives_idle, tcp_keepalives_interval, tcp_keepalives_count).

**Solution**
1) OS Level Change - https://www.gnugk.org/keepalive.html
For linux, see below
Use sysctl -A to get a list of available kernel variables
and grep this list for net.ipv4 settings (sysctl -A | grep net.ipv4).
There should exist the following variables:
- net.ipv4.tcp_keepalive_time - time of connection inactivity after which
                                the first keep alive request is sent
- net.ipv4.tcp_keepalive_probes - number of keep alive requests retransmitted
                                  before the connection is considered broken
- net.ipv4.tcp_keepalive_intvl - time interval between keep alive probes

You can manipulate with these settings using the following command:

sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=40

This sample command changes TCP keepalive timeout to 60 seconds with 3 probes,
40 seconds gap between each. With this, your application will detect dead TCP
connections after 180 seconds (60 + 40 + 40 + 40).


2) In postgresql.conf on coordinator:
tcp_keepalives_idle = 60        # TCP_KEEPIDLE, in seconds;
                                        # 0 selects the system default
tcp_keepalives_interval = 40            # TCP_KEEPINTVL, in seconds;
                                        # 0 selects the system default
tcp_keepalives_count = 3                # TCP_KEEPCNT;
                                        # 0 selects the system default

3) Restart the server.
4) Start a long running query say ~5 minutes long running query. Ex:
SELECT count(*) from github_events g1,github_events g2 WHERE g1.repo_id=g2.repo_id;
5) Terminate a worker whilst the query is running
6) After sometime either an error is thrown(if replication factor is 1) or a warning followed by result is shown(if replication factor >1)
postgres=# SELECT count(*) from github_events g1,github_events g2 WHERE g1.repo_id=g2.repo_id;
WARNING:  could not consume data from worker node
   count
------------
 1743895296
(1 row)

Time: 252927.032 ms
	Issue:
	1) Start a long running query say ~5 minutes. Ex:
	SELECT count(*) from github_events g1,github_events g2 WHERE g1.repo_id=g2.repo_id;
	2) Terminate a worker whilst the query is running.
	3) Now the system hangs indefinitely.

	Reasoning
	I think what's happening is that removing the network interface doesn't kill the connection. As a result, the coordinator node waits for a response from the worker node, which never comes.

	Citus uses PostgreSQL's client connection library (libpq) to communicate between machines; and Postgres uses system defaults to timeout in this scenario. On Linux, I think the default TCP keepalive time (tcp_keepalive_time) is set to 7200 seconds.

	An easy way to validate this hypothesis is by tweaking tcp related settings at the OS or database level. Postgres allows you to override them in postgresql.conf (tcp_keepalives_idle, tcp_keepalives_interval, tcp_keepalives_count).

	Solution
	1) OS Level Change - https://www.gnugk.org/keepalive.html
	For linux, see below
	Use sysctl -A to get a list of available kernel variables
	and grep this list for net.ipv4 settings (sysctl -A \| grep net.ipv4).
	There should exist the following variables:
	- net.ipv4.tcp_keepalive_time - time of connection inactivity after which
	the first keep alive request is sent
	- net.ipv4.tcp_keepalive_probes - number of keep alive requests retransmitted
	before the connection is considered broken
	- net.ipv4.tcp_keepalive_intvl - time interval between keep alive probes

	You can manipulate with these settings using the following command:

	sysctl -w net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=40

	This sample command changes TCP keepalive timeout to 60 seconds with 3 probes,
	40 seconds gap between each. With this, your application will detect dead TCP
	connections after 180 seconds (60 + 40 + 40 + 40).


	2) In postgresql.conf on coordinator:
	tcp_keepalives_idle = 60 # TCP_KEEPIDLE, in seconds;
	# 0 selects the system default
	tcp_keepalives_interval = 40 # TCP_KEEPINTVL, in seconds;
	# 0 selects the system default
	tcp_keepalives_count = 3 # TCP_KEEPCNT;
	# 0 selects the system default

	3) Restart the server.
	4) Start a long running query say ~5 minutes long running query. Ex:
	SELECT count(*) from github_events g1,github_events g2 WHERE g1.repo_id=g2.repo_id;
	5) Terminate a worker whilst the query is running
	6) After sometime either an error is thrown(if replication factor is 1) or a warning followed by result is shown(if replication factor >1)
	postgres=# SELECT count(*) from github_events g1,github_events g2 WHERE g1.repo_id=g2.repo_id;
	WARNING: could not consume data from worker node
	count
	------------
	1743895296
	(1 row)

	Time: 252927.032 ms