RHE Linux 3.0: Not getting EOF on TCP/IP socket connection after shutdown of machine closes it...

Posted on 2004-09-23
Medium Priority
Last Modified: 2012-05-05


I wrote a simple TCP/IP client/server program than currently runs successfully on both HPUX 11.0 and RH8.0 machines. However, I have a problem running it on a RHE 3.0 machine under certain conditions. Following is the scenario that causes my problem.

The Server program is executing on NODE A that is running OS RHE 3.0.
The Client program is executing on NODE B that is running OS RHE 3.0.
The Client program continuously sends text messages to the Server through a TCP/IP socket connection and the Server program receives these messages through a read() of the socket and prints out these messages. Pretty straight forward, right?
However, if the NODE B machine is shutdown, the tcp/ip connection between the Client and the Server programs will be lost. After a short period of time (less than 1 minute), the Server program running on NODE A should receive a zero on a call to the read() the socket function (indicating end-of-file [EOF]). On RedHat 8.0 machines, the Server receives the expected EOF, but on RedHat Enterprise 3.0 machines, the read() does not return a zero (indicating EOF). Instead, on every call to the read() it constantly returns a negative one (-1) indicating the resource is temporarily unavailable. My Server program is expecting this EOF so it may teminate successfully under these conditions.

Has anyone experienced something similar? If so, any solution?
What should I do in researching this problem further?
Preferably, why this happens and possible solution is what I am really looking for.

I am in the process of contacting RedHat to see if there are any known bugs in their kernel w.r.t. TCP/IP software.

BTW, you may ask why I am concerned about a problem in a simple piece of software. Well, I wrote the TCP/IP program as a test program to isolate and minic the behavior of a similar problem in  a much larger piece of software.

Thanks in Advance for you help,

Question by:iharpsoln
  • 4
  • 3
  • 2
LVL 40

Expert Comment

ID: 12138527
Have you examined the man page for read() (man 2 read)? It pretty clearly states "On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number." and "On error, -1 is returned, and errno is set appropriately."

In the case you describe I'd expect a -1, not an EOF since the other end of the connection has "gone away". and the read() can't successfully complete. What is in errno at that point?

Author Comment

ID: 12139223
Hi jlevie:

Thank you for your reply.

I have examined the man page on read() and I accept what it says. However, when a socket connection is established and there is nothing to read, it should return -1 and set errno to some value. In this situation, errno is set to value 11 ("Resource temporarily unavailable"), as stated in my original posting. I am expecting the same behavior for the Server program (RHE 3.0) when the machine that the Client is running on (also RHE 3.0) is shutdown to be the same as the behavior on the RH8.0 anf HPUX 11.0 platforms. That is, the read() return zero and errno set to zero (End-of_file, EOF). This seems to be the case for RH8.0 and HPUX 11.0, but unfortunately, I am not seeing it for RHE 3.0). Furthermore, I see nothing in the man pages that indicates that read() behavior should be any different between RH8.0 and RHE 3.0.

One thing I neglected to mention in my original posting is that the Server program is set up to performs reads in a NON_BLOCKING fashion.
LVL 40

Expert Comment

ID: 12139606
Hmm, it would seem to me that the RH 8.0 & HPUX 11.0 behavior would be wrong. EOF is usually taken to mean that all of the data to be expected has been read, not an error condition caused by the client "disappearing" in this case. Certainly in the case of files, as compared to a network socket, EOF means you've read all of the data in a file and an I/O error on the file should not return an EOF. Given that, the RHEL results of a read() on a broken network link should return a -1 and set errno would be consistant.

It's been a while, but I seem to remember that Solaris  and Irix behave like RHEL 3.0. And I could be misinterpreting it, but my reading of the POSIX standard would imply that the RHEL behavior is correct.
2018 Annual Membership Survey

Here at Experts Exchange, we strive to give members the best experience. Help us improve the site by taking this survey today! (Bonus: Be entered to win a great tech prize for participating!)


Author Comment

ID: 12143605

Hi jlevie:

As a  followup to your last comment, I would like to say that I find it difficult to accept the fact that the RH 8.0 & HPUX 11.0 behavior is wrong. Keep in mind that I wrote the simple TCP/IP Client/Server program to isolate and identify a problem I encountered in my API software. This API software has been running with this behavior for many years on a HPUX platform and for the last couple of years on RH7.1 and RH8.0 platforms. Now that the API software has been ported to RHEL 3.0 I have encounter a problem where the behavior is different which causes the API to function incorrectly.

So, the bottom line question is:
Why does a Server read() of a socket connection to a shutdown Client, return a zero and set errno to zero when running on HPUX and RH8.0 platforms AND does not do the same on a RHEL 3.0 machine?

One last thing....
I just want to stress the fact that the problem I am having only happens when the client's machine node is shutdown/rebooted. If we just terminate the Client process (e.g. kill -9), the programs behave the same across all platforms, i.e. the Server read() returns a zero (indicating EOF) and sets the errno to zero.
LVL 40

Expert Comment

ID: 12155938
While I can't yet cite an authoritative reference as to what POSIX would require in this situation I can offer the results of an informal poll. I posed the question:

"Given that read() returns the number of bytes read. 0 on End Of File, or -1 (and sets errno), what should happen when reading from a network socket with non-blocking I/O is the client connection is severed?"

The unanimous result from people heavily involved with network application developement on Solaris, Irix, & RedHat  platforms was that read() should return a -1 if the client connection is lost for any reason (powerdown, reboot, cutting the wire, etc) since the server would not receive a FIN from the client. On the other hand a well behaved client OS will properly terminate the connection with a FIN if the client application is terminated, which would result in read() returning EOF.

Viewed in the context of the presence or absence of a FIN the behavior of RHEL would seem to be correct.

Assisted Solution

aleric earned 500 total points
ID: 12159999
I think there is a difference at the low level of TCP/IP.  In one case, the operating system detects that
the connection has been broken and in the other case it doesn't.  Especially because it DOES work
when you kill -9 the client, we can conclude that in the case that you don't get an EOF, you actually
get no packets at all anymore and the reaction of the Operating System on that is to keep returning
-1 and errno == 11.  This is understandable because rebooting could mean that the networking is
shutdown without any more packets being sent.
The other case, in which you DO get an EOF then might be caused by any of two things:
1) The rebooting machine reacts differently: it explicitely terminates all connections (by sending
out packets), and/or
2) The receiving machine has a 'keep alive' mechanism that times out and figures out that the
connection is unreachable that way.
Since you speak about '1 minute' timeout - I'd guess it is the latter.
The senario is then: the receiving machine is sending packets that the other machine should
reply to, when there haven't been seen packets for too long a time. After rebooting that machine,
no packets are being received AND the machine doesn't reply on those 'keep alive' packets, therefore
a timeout is the result and the connection is terminated by the local machine (by returning 0 when
calling read(2)).

The reason for this difference might be a difference in socket options on the two machines.
I'd suggest you explicitely set the socket options (with getsockopt(2)) and request SO_KEEPALIVE
and/or SO_RCVTIMEO.  See socket(7).


Expert Comment

ID: 12160023
Obviously I meant setsockopt(2), not getsockopt(2).

Author Comment

ID: 12169790


In regards to your post, you mention a timeout may have something to do with it. There is no timeout in my original code; I was refering to the fact that the rebooting machine took less than 1 minute to shutdown/reboot.

Anyways, since my last posting, I have contacted RedHat and I am in the process of getting a complete and satisfactory answer (I hope). What RedHat has informed me of so far is the following:

The behavior of RHEL 3.0 continuously returning -1 and setting errno to 11 in relation to my problem scenario is correct and that the behavior of RH8.0 eventually returning 0 and setting errno to 0 is incorrect. They (RedHat) quoted the RHEL 3.0 documentation on the read() function in support of their claim.
I can accept what the documentation is saying, but it does not adequately address my original question that I have stated to RedHat and stated here in a previous reply to this topic.

In the light of new information, let me rephrase the question. In RHEL 3.0 the system read() CONTINUOUSLY returns a -1 and sets errno to 11 when the client machine at the other end of a TCP/IP connection is shutdown/rebooted (not kill -9 or Ctrl-C on the client). Then, HOW will the RHEL 3.0 read() ever detect that the other end of the connection is dead/lost, if it CONTINOUSLY return -1 and sets the errno to 11????

LVL 40

Accepted Solution

jlevie earned 500 total points
ID: 12170412
What I do in network applications is to start a timer when I see the read error on the socket. When the timer expires I'll try a read again and if it fails also I assume that the connection has failed and close the socket. Depending on the nature of the application and the network paths involved the timer gets adjusted to allow for transient failures, like routing problems.

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command l…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Planning to migrate your EDB file(s) to a new or an existing Outlook PST file? This video will guide you how to convert EDB file(s) to PST. Besides this, it also describes, how one can easily search any item(s) from multiple folders or mailboxes…
SQL Database Recovery Software repairs the MDF & NDF Files, corrupted due to hardware related issues or software related errors. Provides preview of recovered database objects and allows saving in either MSSQL, CSV, HTML or XLS format. Ensures recov…
Suggested Courses

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question