SIGALRM lost ?

I'm developing a multithread C++ app on a 2.4.19 Linux system.  Occasionally( say 1% of the time ), the process got stuck and I don't understand why.

I have 2 threads:  The master thread repeatedly checks an event list every minute.  It then process an event and removes it from the event queue when the event start time approaches.   The worker threads monitors a message queue and reload the event list when requested to do so.

After much debugging, I saw on one of our systems the alarm( 60 ) was called but sigwait() didn't catch SIGALRM.  How could this have happened?  
//------------------ main.cpp ----------------------
pthread_mutex_t DataLock = PTHREAD_MUTEX_INITIALIZER;
Schedule MySchedule;
 
void ProcessSchedule()
{
    if ( MyScheduleList.nextEvent().startTime() > time( NULL ))
    {
        MyScheduleList.nextEvent().run(); // launch a child process
        MyScheduleList.pop(); 
    }
 
    int waitAgain = time( NULL ) - MyScheduleList.top().startTime();
    if ( waitAgain > 60 ) waitAgain = 60;
    if ( waitAgain < 1 )  waitAgain = 1;
 
    alarm( waitAgain ); // check schedule again in about 1 min
}
 
 
int main()
{
    // ---- block all signals ---
    sigset_t allSignals;
    sigfillset( &allSignals );
    sigdelset( &allSignals, SIGCHLD );
    pthread_sigmask( SIG_BLOCK, &allSignals, NULL );
 
    struct sigaction ignore;
    ignore.sa_handler = SIG_IGN; 
    sigaction( SIGCHLD, &ignore, NULL );
 
    // --- launch message queue listener thread ---
    pthread_t threadId;
    pthread_create( &threadId, NULL, MsgHandlerFunc, NULL );
 
    while( true )
    {
        int signalCaught = 0;
        sigwait( &allSignals, &signalCaught );
 
        if ( signalCaught == SIGALRM ) 
        {
            if ( pthread_mutex_lock( &DataLock ) == 0 )
            {
                ProcessSchedule();
                pthread_mutex_unlock( &DataLock );
            }
        }
        else if ( signalCaught == SIGUSR1 ) // for debug only
        {
            if ( pthread_mutex_lock( &DataLock ) == 0 )
            {
                MySchedule::Dump();
                pthread_mutex_unlock( &DataLock );
            }
        }
        else // unexpected signals
        {
            Debug::Out( "caught signal %d, exit application", signalCaught );
            break;
        }
    }
 
    return 0;
}
 
 
 
// ------------------- Msg queue handler thread ---------------------
void* MsgHandlerFunc( void* pData )
{
    MsgQueue myMsgQueue = MsgQueue( MY_KEY );
 
    while( true ) 
    {
        // receive message
        if ( myMsgQueue.receiveMsg() == MsgQueue::MSG_RELOAD_SCHEDULE )
        {
            if ( pthread_mutex_lock( &DataLock ) == 0 )
            {
                MySchedule.reloadFromFile();                
                ProcessSchedule();
                pthread_mutex_unlock( &DataLock );
            }
            myMsgQueue.sendMsg( MSG_OK );
        }
    }
 
    pthread_exit( NULL );
    return NULL;
}

Open in new window

LogicInnovationsAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Infinity08Commented:
The first thing I think of is that signals don't queue. ie. if a second SIGALRM signal is sent before the first is handled, there will still only be one SIGALRM signal, not 2.

Would that explain your problem ?
0
LogicInnovationsAuthor Commented:
Theoretically speaking it's possible, but it's hard to believe my first thread get starved for over a minute.
0
LogicInnovationsAuthor Commented:
Actually even 1 SIGALRM will ensure the continuation of the app, so no that won't explain my problem.
0
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Infinity08Commented:
Sorry for the delay - this question somehow disappeared off my radar.

Ok, I had a look over your code (after my initial shot in the dark), and I notice that you don't block the signals in the MsgHandlerFunc thread (you only block the signals in the main thread).

You need to make sure that the SIGALRM (and other signals you want to handle with sigwait) are blocked in every thread, except the main thread with the sigwait, or the signal might be delivered to the wrong thread.
0
Infinity08Commented:
>> You need to make sure that the SIGALRM (and other signals you want to handle with sigwait) are blocked in every thread, except the main thread with the sigwait

That was an unfortunate way of saying it. Of course the signals have to be blocked in the main thread too, but they will be handled in that thread by sigwait.

So, to avoid confusion : block the signals in all threads. And have the sigwait in your main thread, which will then be the only thread that gets the signals.
0
LogicInnovationsAuthor Commented:
Actually according to POSIX standards,  a thread inherits signal mask of the thread that created it.  So as long as a signal is masked first thing in main, it will be masked everywhere.

0
Infinity08Commented:
You're right. So, that doesn't explain the problem either.

Maybe I'm misunderstanding your problem.

>> Occasionally( say 1% of the time ), the process got stuck and I don't understand why.

Could you give a bit more detail ? Under what conditions does it block ? Does it "unblock" again ? After how much time ? How does this behavior manifest itself to the user ?
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
LogicInnovationsAuthor Commented:
Turned out the version of pthread library on the system is llinuxthread, not NTPL.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Programming

From novice to tech pro — start learning today.