Solved

Linux Cluster Application not running

Posted on 2013-11-25
18
380 Views
Last Modified: 2013-12-22
We have a computational cluster. OS is RHEL 5.10 Server and the Cluster Sw is ROCKS 4.3. The issue we are having is that NX Nastran jobs will not run anymore. The following is the detailed message of the problem.

Basically what is going on is when the user issues a run request the run just doesn’t run. No error is thrown. Usually intermediate files are created as a run kicks off, but they aren’t in this case. If you do a qstat the job shows up but it just sits there. There is no indication on the license server (FlexLM) that a license was ever requested. These are analysis jobs. It used to work fine.

Thaks for the help!
0
Comment
Question by:capperdog13
  • 11
  • 7
18 Comments
 
LVL 61

Expert Comment

by:gheist
Comment Utility
Is flexlm running on same server?
Did you cnabge something so DNS times out?
Is there something added to network latency so that flexlm does not manage in 1/100th of second?
0
 

Author Comment

by:capperdog13
Comment Utility
I am looking into the answers to your questions. Will follow up ASAP.

Thanks for the resonse!
0
 

Author Comment

by:capperdog13
Comment Utility
Forget to ask you. You typed cnabge. Did you mean "change"?
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
Yes, change.
Basically flexlm has short timeouts, so if network goes via router it is already 50/50 chance for client to time out.
0
 

Author Comment

by:capperdog13
Comment Utility
Flexlm does not run on the same server.
Nothing has changed that we know of for a DNS time out
Are average ping time to the license server is 0.268ms

Does this get us headed in the right direcection?
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
check flexlm server. default timeout on client is 0,1s (100ms)
0
 

Author Comment

by:capperdog13
Comment Utility
There seems to be no attempt from the cluster to request a license from the Flexlm server. No record of a request from the Cluster Frontend is recorded on the Flexlm server... No logs, nothing qued, no errors etc...
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
0
 

Author Comment

by:capperdog13
Comment Utility
We can certainly try to increase the time on the Flexlm server, but on the Cluster there are no errors being thrown.

IE: what is going on is when the user issues a run request (On the Linux Cluster) the run just doesn’t run. No error is thrown. Usually intermediate files are created as a run kicks off, but they aren’t in this case. If you do a qstat the job shows up but it just sits there.

Since this is on the Cluster locally and Flexlm never recieves a request do you still consider a time out between the two the problem?
0
How to Backup Ubuntu to Amazon S3

CloudBerry Backup offers automatic cloud backup and restoration for Linux. It has both GUI and command line interface (CLI) ensuring its flexibility in use. Find out more

 

Author Comment

by:capperdog13
Comment Utility
Additionally, this just started happening last week and these two machines are both on our local network. Before there was no problem with the cluster running these NX Nastran jobs and Flexlm server providing the license.
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
There areno errors orlogs because it is mucous corner of IT of licencing...
Maybe windows licence server got antivirus update....
0
 

Author Comment

by:capperdog13
Comment Utility
I would say that could be an issue, but our Flexlm server hands out licenses for many applications including NX Nastran to other machines... If a virus update to the server has caused the issue with licensing to the Linux Cluster we would be seeing the same type of problem across the board to other machines making requests to Flexlm...

Not sure what is happening on the cluster with NX Nastran, but it seems to me thatis where the problem lies. Would you agree?
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
I have had similar issue with windows clients over vpn...
Basically logs say nothing.
0
 

Author Comment

by:capperdog13
Comment Utility
Was out for holiday. The Cluster is accessed locally from our network. VPN is not used.

Will try to track down the issue with Nastran on the cluster.

Thanks!
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
check if the connection on licence server remains in time_wait state - it is clear sign that licence check was stopped half-way and timeout needs to be rised...
0
 

Author Comment

by:capperdog13
Comment Utility
K. We may have an environment variable issue on the Cluster. Just found out this has been an issue in the past. Will run a Nastran job myself and see if it hits the FlexLM server. If it does not, which is the case for all others, I will set the variable in my account on the cluster and run again. If successful we have our solution.

Will check the time out as well once we are successful in hitting FlexLM. Many thanks!
0
 

Accepted Solution

by:
capperdog13 earned 0 total points
Comment Utility
Problem ended up an expired license for the PBS Grid SW running on the cluster. Jobs are running again. Thanks!
0
 

Author Closing Comment

by:capperdog13
Comment Utility
The problem had nothing to do with FlexLM. It was an expired license for the PBS Grid SW.
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Setting up Secure Ubuntu server on VMware 1.      Insert the Ubuntu Server distribution CD or attach the ISO of the CD which is in the “Datastore”. Note that it is important to install the x64 edition on servers, not the X86 editions. 2.      Power on th…
Hello, As I have seen there a lot of requests regarding monitoring and reporting for exchange 2007 / 2010 / 2013 I have decided to post some thoughts together and link to articles that have helped me. Of course a lot of information you can get…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now