Problelms with pselect() versus seslect()?

Years ago (in the early development of a system I am still working on), we encountered an issue concerning select().

We used to use select() for a "select server" that is embedded in an application running on a network of Linux computers. This select server allowed an arbitrary number of GUIs to connect to the computers to display a control panel for the application's engineering process.

Back then, we were using a 2.4 kernel. At that time pselect() was not part of the kernel, but was available in gcclib. We had mysterious crashes of applications that plagued us. After weeks of study, we determined that the crashes happened simultaneously with occurrences of user terminals disconnecting their ssh client connections. The problem appeared to be that on occasion a race condition involving the handling of SIGHUP signals which caused the process running the select server to crash by executing the default action for SIGHUP (which is to immediately terminate the process).

The cure for that problem was to stop using select() and use pselect() instead. It has two advantages:
1. it allows a timeout to prevent select from hanging if no data arrives, and
2. it saves and restores the signal mask as an atomic operation

Item 2. allowed me to craft a signal handler that did nothing other than set an action specifier variable to 'terminate' if the SIGHUP actually arrived at the process.

The termination problem went away for a long time and now I discovered two things:

My select server, which uses pselect() prepares the signal mask this way:
   sigaddset(sigsetptr, SIGUSR1);
   sigaddset(sigsetptr, SIGUSR2);

but does NOT do this:  
   sigaddset(sigsetptr, SIGHUP);

In retrospect, it seems that I should be masking the SIGHUP signal handler when calling pselect()? (Because I don't want to use the default action of 'terminate')

Otherwise, I am not sure I understand why this code fixed our earlier problem with SIGHUP.

In other words, is just providing pselect() a sigmask to use for saving and restoring the sigmask context a solution to the problem?

Or does 'sigaddset(sigsetptr, SIGHUP);' block the default action for SIGHUP and enable the application's SIGHUP handler?
Or does 'sigaddset(sigsetptr, SIGHUP);' block the application's SIGHUP handler and enable the default system action for SIGHUP?

I also noticed that a new driver, which I just studied closely for the first time only recently, uses select() instead of pselect().

The latest symptom is that on occasion, our GUIs simply stop being able to connect to the application, and on rare occasions the application just stops running for reasons that are still unclear, and the application must be manually restarted.

Does anyone out there know if the pselect() versus select() problem that was such an issue in the 2.4 kernel is still of concern?
Should I be using pselect() instead of select() in the new driver? The new driver is running in a different thread than the original select server.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Duncan RoeSoftware DeveloperCommented:
I'd say you were lucky that pselect() appeared to fix anything at Kernel 2.4, because it wasn't a kernel call until 2.6.16. Previous to that glibc would emulate it, but that was still vulnerable to the race condition that the kernel call addresses. (I got this information from the SELECT(2) man page dated 2010-08-3, accessed by typing man pselect).
From your Q I wonder whether you might perhaps be slightly confused regarding the separate actions of blocking and handling signals. The effect of blocking signals during select() is that signal handlers for these signals will not execute during the select() call. Quite possibly, it doesn't matter if the SGIHUP handler runs (I have little doubt that your program already has a SIGHUP handler).
The action of blocking any signal is that the signal, if it occurs, is not passed to the program until it is unblocked.
Should I be using pselect() instead of select() in the new driver? That depends on what the handlers for SIGUSR1 and SIGUSR2 do. You mention that your program is multi-threaded - prior to kernel implementation of native threads, thread management was done using SIGUSR1 and possibly SIGUSR2. If that was the reason for blocking them in your code, it would no longer apply when building with modern threads libraries. But you should audit your code for any explicit use of these signals.
fklein23Author Commented:

Thank you for this thoughtful reply. I am still digesting it.

I have not revisited the select/pselect issue since 2006, and when I visited it then, I didn't have enough time to spend on it then. I was the sole programmer on the project and was juggling dozens of issues simultaneously while working 80 hours a week under enormous pressure, so when i got a "solution" to a thorny problem, I put it to bed instantly and moved on the the next most pressing issue.

While revisiting this particular issue, it is clear to me that the original "solution" may have had more to do with migrating to the 2.6 kernel than solving the basic problem.

We are not using EITHER SIGUSR1 or SIGUSR2 at this time. They went away as the result of several alternative methods. Their appearance in the code snippets are obsolete artifact (I think). So, when I didn't see SIGHUP being referenced in the current code gave me pause.

So I am re-reading everything I can find about select() and pselect() in the context of a re-emergence of a mysterious application crash for which we have no other explanation. When these crashes happen, we have no evidence of anything from our syslogs except dead-end at a specific times where our application logs simply stop. The next message I see is when the application is restarted, and I see messages related to initializations.

In ONE (and only one) instance, we saw a syslog message at one of these times that said "select error". Sadly I don't have that syslog anymore because the log file got rotated out before I could finish looking into this matter.

But that reference to a "select error" caused me to want to revisit this problem, and so I once again have very limited time to devote to a solution, due to downsizing and an impending bankruptcy, and once again am virtually alone in re-investigating this problem.

Because of these constraints, I am certain I don't understand this issue completely. That is why I am going back and reading what I can.

When I re-read the man page you suggested, it was helpful, but I would still like to just make certain my interpretation is correct.

I specifically read, back in 2006, that the default action for SIGHUP for POSIX-compliant Linux (like RedHat 9) was to terminate a process that did not specifically implement a SIGHUP handler.  TRUE  FALSE

I have a couple of other questions I will add to this comment later today. Out of time for this issue until later today.

I deeply appreciate your input. More later.

But briefly, I have been running a stress test of the application, and burying the damn thing with HUPs USR1s and USR2s
I notice that I get and occasional message that says
select: Interrupted system call

Not sure where it is coming from, but I can't help but think it is important!

Thanks again...


fklein23Author Commented:
I DID find this, however, and have not had a chance to think about it.

This leads me to think that perhaps I really DO need to block SIGHUP during the pselect() call (by using sigprocmask())

One of my questions was:
By using sigprocmask(), I can in fact block the corresponding signal from executing ANYTHING during the pselect() call.

Later - Frank
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

Duncan RoeSoftware DeveloperCommented:
the default action for SIGHUP for POSIX-compliant Linux (like RedHat 9) was to terminate a process that did not specifically implement a SIGHUP handler. TRUE. from man 7 signal. I put the default action in italics:
SIGHUP        1       Term    Hangup detected on controlling terminal or death of controlling process
By using sigprocmask(), I can in fact block the corresponding signal from executing ANYTHING during the pselect() call. TRUE

The effect of not blocking SIGHUP during a select() call is that you can get error EINTR returned. You can also get EINTR from any signal that is handled or doesn't cause termination so it's best to program for it by re-issuing the select()
  retcod = select(nfds, &readfds, &writefds, &exceptfds, &timeout);
while (retcod == -1 && errno == EINTR);

Open in new window

Duncan RoeSoftware DeveloperCommented:
Had a quick link at the link you posted. They suggest looking for "the" signal that causes EINTR but, as above, I think that's a mistake. After my code above, you would of course do normal error processing if any other error was raised. It's just that EINTR isn't really an error in the normal sense of the word.
fklein23Author Commented:
We have a new theory.
We started shutting down unnecessary cron scripts in the cron.daily directory.
Several are resource intensive, and we made the scripts for several non-executable so cron will ignore them.
I didn't eliminate logwatch, because that is necessary. Instead I put a "sleep(300)" in the Perl script for logwatch.
After doing this, one of the processes shut down the very next day. The others were unaffected.
This means the elimination of cron tasks was not the smoking gun.
Last week, I noticed my xterm client was a bit sluggish, so I ran htop and noticed that the hal-daemon was using 62% of available memory. We only have 500 MB of RAM, so that seems problematic.
We are trying another experiment: turn hald and its helper services off complete and watch for a while.
Duncan RoeSoftware DeveloperCommented:
All I know about HAL is that if you don't have it, various odd things stop working. man hald points to the full documentation. Hald clocking high CPU suggests to me a hardware problem. The man page instructs you how to debug hald (which you might like to try)
fklein23Author Commented:
hald is not clockin high  CPS, it is hogging memory. Many descriptions of this same problem pepper the web. The symptom that others have experienced are that once hald is started, it uses almost no memory. Then over time, it gradually uses more and more memory. As far as I know, hald observes things like when devices change (as in when USB devices are inserted and removed), and provides you with the names of the devices, which can be read from the syslog. In our system, we operate remotely and cannot physically insert or remove any devices at all. Further more, EVERY one of our 20 Linux computers in the plant have exhibited this problem, so it is very unlikely to be a hardware problem. Somehow, 4 of the computers were accidentally configured without hald running, and NONE of those computers have experienced the midnight crash during log file rotation. All the ones running hald have had that problem. the memory problem has reached about 60% of memory assigned to hal and its helpers!!! All the advice I have read about on the Internet has led me to conclude that, since we are not using any of these machines as multi-user computers on a regular basis, never add or remove devices, and do not run X-Windows on them, hald is not essential. We only have 500 MB of RAM memory on each of the machines, so giving up 10 to 60% of memory to a questionably essential service seems risky. I have concluded that it isn't worth any more research, since the machines that do not run hald are the ones behaving best of the whole set, turning hal off seems a reasonable strategy. I had considered just restarting haldaemon once per week with a cron script, but this seems a safer approach.

Now I can focus on the other problem: the issue of running out of sockets and having the select server stop being able to communicate with GUIs. I am thinking the reason is that there are timers used in our select server that are supposed to be in a one-to-one correspondence with the bits in the bitset tracking the sockets as they come and go, and that these timers are being mis-managed. Still haven't proved it, but I have definitely tracked down evidence that just before each failure, there is a log message telling us that a socket timed out. So Monday morning, I am going to get down to trying to recreate this problem in a controlled, accessible lab environment and put a SIGUSR1 handler (currently that one is not being used) that will artificially time out a socket, and then I can trace the resultant behavior.

If you'd like to stay in the loop, stay tuned. I will post a solution when I find it, because this has been incredibly frustrating. So far 6 people have read through the socket server code for hours and can't find the flaw.  I think it is going to be something pretty subtle, and it would be nice to save anyone else from going through this ridiculous ordeal over such a simple thing as managing socket connections! I am hoping that what comes out of this will be a good example of how NOT to do a socket server!
Duncan RoeSoftware DeveloperCommented:
Look forward to it
fklein23Author Commented:
I guess I have to do something, because a couple of "abandoned" questions are preventing me from posting more questions.

I am not intending to abandon these, it is just that no solution has been found.
I do not have the luxury of leaving this question open. I HAVE to find a solution.
It is possible that this solution may have been solved by a separate question posed by me and answered by user 'xterm', but our plant is down for several weeks and it could be many weeks before I know for sure, since the problem was so intermittent.
fklein23Author Commented:
No solutions have emerged, and I can't post any additional questions as long as this one is considered "abandoned".
Duncan RoeSoftware DeveloperCommented:
This question started off concerning pselect() and has somehow wandered off to consider HAL.
To summarise an answer to the original Q:
Use select()
Write signal handlers for all likely troublesome signals. Follow my instructions in this post
fklein23Author Commented:
This question spawned a different question which I think had a solution involving hal.
I already closed the spawned question.

But the original problem with our SIGHUP handler is still a concern, so this question is still open.

I liked your post from question Q_27656605, about signal handlers and think it contains good information.

Here is one comment that I think may be important:
If I have a signal handler (which is effectively an interrupt process) set a flag as an message too a foreground process, wouldn't it be better to have the signal handler post to a semaphore? Then the foreground process, instead of checking a flag could simply test the semaphore by doing a "non-blocking wait" to see if the semaphore was posted. This would avoid any possibility of a race condition around the flag.

Thanks - Frank
Duncan RoeSoftware DeveloperCommented:
I think a semaphore would be overkill. I don't see any possible race condition here - at the hardware level you load the flag to the accumulator and skip if it is zero. The next cycle after you load it may have been set but so what? - you'll see it next time around. Because you didn't attempt to write to the flag, there is no race.

If the flag was set there is still no race as long as you zeroise the flag before taking action (such as re-reading config). If the flag was set again after you zeroised it then you will see it next time through. A semaphore would be no different. (You don't care how often the flag was set before you take action and you wouldn't have used a counting semaphore).

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
fklein23Author Commented:
Actually, I am referring to this text from a different answer you posted and referred to as "this post", namely:

"What I'm saying here is that it would be ideal if you could have your signal handler simply set a flag (that a signal has been received), and that code which depends on the configuration should re-read it and clear the flag (when the flag is set)."

There are multiple respondents, in separate threads that manage sockets in our application, so if the signal handler behaves as you describe above, there is indeed a flag being set by the one signal handler and being cleared by the multiple threads that might call select(). Select() neither sets nor clears the flag you describe. It is the sigaction() that sets it and the application that clears it. Although a race condition is not likely, it is possible, and so I would actually be more comfortable with a semaphore.
fklein23Author Commented:
Duncan, you are a patient and persistent person and I really do think that your suggestion will help me fix the original problem. So even though I haven't had time to verify it yet, I am going to consider the solution complete and if I run into problems with it, I will post a new question. It may be another month or more before I get back to this problem (reduction in force problems here have made my "to do" list almost irrelevant!)

So thanks for your help with this.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.