This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Reading data from a socket hangs for at least 1 hour even if I set Socket.setSoTimeout to non-0 value. It happens very rarely (once a month under 100 or more connections in parallel with continuously 10MB/s or more data transfer). | |
Although it rarely happens, it leads significant impact to applications if it happens. | |
When it hanged, thread dump includes following stack trace: | |
---- | |
"pool-3-thread-6" prio=10 tid=0x00007fd43006e800 nid=0xdf23 runnable [0x00007fd424e56000] | |
java.lang.Thread.State: RUNNABLE | |
at java.net.SocketInputStream.socketRead0(Native Method) | |
at java.net.SocketInputStream.read(SocketInputStream.java:129) | |
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204) | |
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182) | |
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138) | |
at java.io.FilterInputStream.read(FilterInputStream.java:116) | |
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) | |
- locked <0x00000000fed8a010> (a java.lang.Object) | |
at com.treasure_data.storage.s3.S3ReadableByteChannel.read(S3ReadableByteChannel.java:80) | |
at com.treasure_data.storage.prefetch.Prefetch$Prefetcher.run(Prefetch.java:48) | |
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) | |
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) | |
at java.lang.Thread.run(Thread.java:619) | |
---- | |
When it hanged, I attached gdb to the java process and found that O_NONBLOCK is not set to the socket's file descriptor. SO_RCVTIMEO is set to 0 which means no timeout. | |
I downloaded source code of OpenJDK 6, 7, 8 and 9 and found that SocketInputStream/socketRead0 calls poll(2) and then call recv(2) on Linux. It's a commonly known issue on Linux that poll(2) could report that a socket file descriptor is ready to read even if it's actually not ready to read. This could happen when data has arrived but a packet has wrong checksum and is discarded. Here is the source code from OpenJDK 9 commit 2df45ac1bf49: | |
---- | |
JNIEXPORT jint JNICALL | |
Java_java_net_SocketInputStream_socketRead0(JNIEnv *env, jobject this, | |
jobject fdObj, jbyteArray data, | |
jint off, jint len, jint timeout) | |
{ | |
... | |
if (timeout) { | |
nread = NET_Timeout(fd, timeout); | |
if (nread <= 0) { | |
... | |
return -1; | |
} | |
} | |
nread = NET_Read(fd, bufP, len); | |
... | |
} | |
---- | |
NET_Timeout, which calls poll(2), could return with return code 1 even if following read blocks if a packet's checksum is wrong. Then NET_Read, which calls recv(2), blocks because O_NONBLOCK flag is not set to the file descriptor and SO_RCVTIMEO is 0. | |
According to source code, this problem exists in Java 6, 7, 8 and 9. | |
I found similar reports on OpenJDK bug database: | |
* https://bugs.openjdk.java.net/browse/JDK-7132648 | |
* https://bugs.openjdk.java.net/browse/JDK-8032910 | |
I found another recent report on Apache HTTP Client's mailing list: | |
* 2014-04-14 "I have set read time out but thread hang on socketRead0": http://mail-archives.apache.org/mod_mbox/hc-dev/201404.mbox/%3CCAFAd71WYx0UX7Mpgof2Mj=f1080CqEmR+=bRDZ9fjivZQUqV+g@mail.gmail.com%3E |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment