Solved

Linux Cluster Application not running

Posted on 2013-11-25
18
383 Views
Last Modified: 2013-12-22
We have a computational cluster. OS is RHEL 5.10 Server and the Cluster Sw is ROCKS 4.3. The issue we are having is that NX Nastran jobs will not run anymore. The following is the detailed message of the problem.

Basically what is going on is when the user issues a run request the run just doesn’t run. No error is thrown. Usually intermediate files are created as a run kicks off, but they aren’t in this case. If you do a qstat the job shows up but it just sits there. There is no indication on the license server (FlexLM) that a license was ever requested. These are analysis jobs. It used to work fine.

Thaks for the help!
0
Comment
Question by:capperdog13
  • 11
  • 7
18 Comments
 
LVL 61

Expert Comment

by:gheist
ID: 39676587
Is flexlm running on same server?
Did you cnabge something so DNS times out?
Is there something added to network latency so that flexlm does not manage in 1/100th of second?
0
 

Author Comment

by:capperdog13
ID: 39677808
I am looking into the answers to your questions. Will follow up ASAP.

Thanks for the resonse!
0
 

Author Comment

by:capperdog13
ID: 39678128
Forget to ask you. You typed cnabge. Did you mean "change"?
0
 
LVL 61

Expert Comment

by:gheist
ID: 39678269
Yes, change.
Basically flexlm has short timeouts, so if network goes via router it is already 50/50 chance for client to time out.
0
 

Author Comment

by:capperdog13
ID: 39678428
Flexlm does not run on the same server.
Nothing has changed that we know of for a DNS time out
Are average ping time to the license server is 0.268ms

Does this get us headed in the right direcection?
0
 
LVL 61

Expert Comment

by:gheist
ID: 39679340
check flexlm server. default timeout on client is 0,1s (100ms)
0
 

Author Comment

by:capperdog13
ID: 39681375
There seems to be no attempt from the cluster to request a license from the Flexlm server. No record of a request from the Cluster Frontend is recorded on the Flexlm server... No logs, nothing qued, no errors etc...
0
 
LVL 61

Expert Comment

by:gheist
ID: 39681513
0
 

Author Comment

by:capperdog13
ID: 39681530
We can certainly try to increase the time on the Flexlm server, but on the Cluster there are no errors being thrown.

IE: what is going on is when the user issues a run request (On the Linux Cluster) the run just doesn’t run. No error is thrown. Usually intermediate files are created as a run kicks off, but they aren’t in this case. If you do a qstat the job shows up but it just sits there.

Since this is on the Cluster locally and Flexlm never recieves a request do you still consider a time out between the two the problem?
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:capperdog13
ID: 39681563
Additionally, this just started happening last week and these two machines are both on our local network. Before there was no problem with the cluster running these NX Nastran jobs and Flexlm server providing the license.
0
 
LVL 61

Expert Comment

by:gheist
ID: 39681875
There areno errors orlogs because it is mucous corner of IT of licencing...
Maybe windows licence server got antivirus update....
0
 

Author Comment

by:capperdog13
ID: 39682103
I would say that could be an issue, but our Flexlm server hands out licenses for many applications including NX Nastran to other machines... If a virus update to the server has caused the issue with licensing to the Linux Cluster we would be seeing the same type of problem across the board to other machines making requests to Flexlm...

Not sure what is happening on the cluster with NX Nastran, but it seems to me thatis where the problem lies. Would you agree?
0
 
LVL 61

Expert Comment

by:gheist
ID: 39682921
I have had similar issue with windows clients over vpn...
Basically logs say nothing.
0
 

Author Comment

by:capperdog13
ID: 39689980
Was out for holiday. The Cluster is accessed locally from our network. VPN is not used.

Will try to track down the issue with Nastran on the cluster.

Thanks!
0
 
LVL 61

Expert Comment

by:gheist
ID: 39690062
check if the connection on licence server remains in time_wait state - it is clear sign that licence check was stopped half-way and timeout needs to be rised...
0
 

Author Comment

by:capperdog13
ID: 39690085
K. We may have an environment variable issue on the Cluster. Just found out this has been an issue in the past. Will run a Nastran job myself and see if it hits the FlexLM server. If it does not, which is the case for all others, I will set the variable in my account on the cluster and run again. If successful we have our solution.

Will check the time out as well once we are successful in hitting FlexLM. Many thanks!
0
 

Accepted Solution

by:
capperdog13 earned 0 total points
ID: 39724170
Problem ended up an expired license for the PBS Grid SW running on the cluster. Jobs are running again. Thanks!
0
 

Author Closing Comment

by:capperdog13
ID: 39734442
The problem had nothing to do with FlexLM. It was an expired license for the PBS Grid SW.
0

Featured Post

Do email signature updates give you a headache?

Are you constantly making changes to email signatures? Are the images not formatting how you want them to? Want high-quality HTML signatures on all devices, including on mobiles and Macs? Then, let Exclaimer solve all your email signature problems today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Over the last ten+ years I have seen Linux configuration tools come and go. In the early days there was the tried-and-true, all-powerful linuxconf that many thought would remain the one and only Linux configuration tool until the end of times. Well,…
Join Greg Farro and Ethan Banks from Packet Pushers (http://packetpushers.net/podcast/podcasts/pq-show-93-smart-network-monitoring-paessler-sponsored/) and Greg Ross from Paessler (https://www.paessler.com/prtg) for a discussion about smart network …
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now