Skip to content

Instantly share code, notes, and snippets.

@frsyuki
Created July 25, 2014 01:38
Show Gist options
  • Save frsyuki/3ebc4b56b32cbdc614bc to your computer and use it in GitHub Desktop.
Save frsyuki/3ebc4b56b32cbdc614bc to your computer and use it in GitHub Desktop.
Reading data from a socket hangs for at least 1 hour even if I set Socket.setSoTimeout to non-0 value. It happens very rarely (once a month under 100 or more connections in parallel with continuously 10MB/s or more data transfer).
Although it rarely happens, it leads significant impact to applications if it happens.
When it hanged, thread dump includes following stack trace:
----
"pool-3-thread-6" prio=10 tid=0x00007fd43006e800 nid=0xdf23 runnable [0x00007fd424e56000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
- locked <0x00000000fed8a010> (a java.lang.Object)
at com.treasure_data.storage.s3.S3ReadableByteChannel.read(S3ReadableByteChannel.java:80)
at com.treasure_data.storage.prefetch.Prefetch$Prefetcher.run(Prefetch.java:48)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
----
When it hanged, I attached gdb to the java process and found that O_NONBLOCK is not set to the socket's file descriptor. SO_RCVTIMEO is set to 0 which means no timeout.
I downloaded source code of OpenJDK 6, 7, 8 and 9 and found that SocketInputStream/socketRead0 calls poll(2) and then call recv(2) on Linux. It's a commonly known issue on Linux that poll(2) could report that a socket file descriptor is ready to read even if it's actually not ready to read. This could happen when data has arrived but a packet has wrong checksum and is discarded. Here is the source code from OpenJDK 9 commit 2df45ac1bf49:
----
JNIEXPORT jint JNICALL
Java_java_net_SocketInputStream_socketRead0(JNIEnv *env, jobject this,
jobject fdObj, jbyteArray data,
jint off, jint len, jint timeout)
{
...
if (timeout) {
nread = NET_Timeout(fd, timeout);
if (nread <= 0) {
...
return -1;
}
}
nread = NET_Read(fd, bufP, len);
...
}
----
NET_Timeout, which calls poll(2), could return with return code 1 even if following read blocks if a packet's checksum is wrong. Then NET_Read, which calls recv(2), blocks because O_NONBLOCK flag is not set to the file descriptor and SO_RCVTIMEO is 0.
According to source code, this problem exists in Java 6, 7, 8 and 9.
I found similar reports on OpenJDK bug database:
* https://bugs.openjdk.java.net/browse/JDK-7132648
* https://bugs.openjdk.java.net/browse/JDK-8032910
I found another recent report on Apache HTTP Client's mailing list:
* 2014-04-14 "I have set read time out but thread hang on socketRead0": http://mail-archives.apache.org/mod_mbox/hc-dev/201404.mbox/%3CCAFAd71WYx0UX7Mpgof2Mj=f1080CqEmR+=bRDZ9fjivZQUqV+g@mail.gmail.com%3E
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment