Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 392
  • Last Modified:

Linux Cluster Application not running

We have a computational cluster. OS is RHEL 5.10 Server and the Cluster Sw is ROCKS 4.3. The issue we are having is that NX Nastran jobs will not run anymore. The following is the detailed message of the problem.

Basically what is going on is when the user issues a run request the run just doesn’t run. No error is thrown. Usually intermediate files are created as a run kicks off, but they aren’t in this case. If you do a qstat the job shows up but it just sits there. There is no indication on the license server (FlexLM) that a license was ever requested. These are analysis jobs. It used to work fine.

Thaks for the help!
0
capperdog13
Asked:
capperdog13
  • 11
  • 7
1 Solution
 
gheistCommented:
Is flexlm running on same server?
Did you cnabge something so DNS times out?
Is there something added to network latency so that flexlm does not manage in 1/100th of second?
0
 
capperdog13Author Commented:
I am looking into the answers to your questions. Will follow up ASAP.

Thanks for the resonse!
0
 
capperdog13Author Commented:
Forget to ask you. You typed cnabge. Did you mean "change"?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
gheistCommented:
Yes, change.
Basically flexlm has short timeouts, so if network goes via router it is already 50/50 chance for client to time out.
0
 
capperdog13Author Commented:
Flexlm does not run on the same server.
Nothing has changed that we know of for a DNS time out
Are average ping time to the license server is 0.268ms

Does this get us headed in the right direcection?
0
 
gheistCommented:
check flexlm server. default timeout on client is 0,1s (100ms)
0
 
capperdog13Author Commented:
There seems to be no attempt from the cluster to request a license from the Flexlm server. No record of a request from the Cluster Frontend is recorded on the Flexlm server... No logs, nothing qued, no errors etc...
0
 
gheistCommented:
0
 
capperdog13Author Commented:
We can certainly try to increase the time on the Flexlm server, but on the Cluster there are no errors being thrown.

IE: what is going on is when the user issues a run request (On the Linux Cluster) the run just doesn’t run. No error is thrown. Usually intermediate files are created as a run kicks off, but they aren’t in this case. If you do a qstat the job shows up but it just sits there.

Since this is on the Cluster locally and Flexlm never recieves a request do you still consider a time out between the two the problem?
0
 
capperdog13Author Commented:
Additionally, this just started happening last week and these two machines are both on our local network. Before there was no problem with the cluster running these NX Nastran jobs and Flexlm server providing the license.
0
 
gheistCommented:
There areno errors orlogs because it is mucous corner of IT of licencing...
Maybe windows licence server got antivirus update....
0
 
capperdog13Author Commented:
I would say that could be an issue, but our Flexlm server hands out licenses for many applications including NX Nastran to other machines... If a virus update to the server has caused the issue with licensing to the Linux Cluster we would be seeing the same type of problem across the board to other machines making requests to Flexlm...

Not sure what is happening on the cluster with NX Nastran, but it seems to me thatis where the problem lies. Would you agree?
0
 
gheistCommented:
I have had similar issue with windows clients over vpn...
Basically logs say nothing.
0
 
capperdog13Author Commented:
Was out for holiday. The Cluster is accessed locally from our network. VPN is not used.

Will try to track down the issue with Nastran on the cluster.

Thanks!
0
 
gheistCommented:
check if the connection on licence server remains in time_wait state - it is clear sign that licence check was stopped half-way and timeout needs to be rised...
0
 
capperdog13Author Commented:
K. We may have an environment variable issue on the Cluster. Just found out this has been an issue in the past. Will run a Nastran job myself and see if it hits the FlexLM server. If it does not, which is the case for all others, I will set the variable in my account on the cluster and run again. If successful we have our solution.

Will check the time out as well once we are successful in hitting FlexLM. Many thanks!
0
 
capperdog13Author Commented:
Problem ended up an expired license for the PBS Grid SW running on the cluster. Jobs are running again. Thanks!
0
 
capperdog13Author Commented:
The problem had nothing to do with FlexLM. It was an expired license for the PBS Grid SW.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 11
  • 7
Tackle projects and never again get stuck behind a technical roadblock.
Join Now