Avatar of Lotfi BOUCHERIT
Lotfi BOUCHERITFlag for Algeria

asked on 

centos 7.0 runs correctly but then disconnects

hello,
we have a centos 7.0 installed in an old g6 HP system (that does not have ILO!), that contains a very important system.
this system gets hung two times a day, and we need to restart it to get it functional again,
i would like to know, how can we find the root cause of those blockings? is there any logs or events that we can consult to find those errors?
thank you in advance,
Linux* CentOS

Avatar of undefined
Last Comment
arnold
Avatar of David Favor
David Favor
Flag of United States of America image

1x common debugging pattern for this type of problem.

1) In one window run...

inotifywait -mrq --format '%e %w%f' -e ACCESS --exclude '\.(bak|sw[xp])\$' /var/log > ~/fs.log 2>&1

Open in new window


2) Now in another window run this command.

while : ; do date && sleep 1 ; done

Open in new window


3) Then wait...

At some point the machine will die.

Hopefully the while loop will show the exact time of machine death... then...

The ~/fs.log file will show last few log files touched, which will help narrow down where to look for the problem.

4) Another likely command to run will be this...

while : ; do (echo "#####" && date && ps auxwwf) >> ~/ps.log 2>&1 ; done

Open in new window


This will correlate process table entries with machine death time.

For example, if someone has made the mistake of calling out to some 3rd party service (like to a CRM to record an opt-in), if the 3rd party API is slower than your machine, or 3rd party API just goes down periodically, then your machine will start piling up processes... likely PHP processes... which never end...

Once you have 100s-1000s+ of processes that never end, machine death will occur.

5) More data to capture is swap data, which is simple to capture...

while : ; do (echo "#####" && date && top -b -n 1 && sleep1) >> ~/top.log 2>&1 ; done

Open in new window


If you see swap space usage trend up + max out your space space, this means your OOM Killer (Out Of Memory Killer) triggers, which randomly kills processes, trying to keep the machine alive.

This always fails, as some important process always seems to get killed off.
Avatar of David Favor
David Favor
Flag of United States of America image

Aside: If you get completely stumped, just hire someone to help with this.

Generally a long term Linux Savant will be able to... sense... "disturbances in the Force"... so they'll be able to check many things quickly, to have a good guess about problems.

The primary starting point is... how... ssh + key repeat speed... feels upon login.

Difficult to explain... and... if you've been working on Linux for decades... various Distro versions... sensation... point to where to start debugging.
SOLUTION
Avatar of David Favor
David Favor
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
are you sure about no ILO on the server? My HP DL380 SFF G6's have ILO
Avatar of Lotfi BOUCHERIT

ASKER

@David Johnson,
It does have a place to plug ILO, but it's not plugged..
Avatar of arnold
arnold
Flag of United States of America image

Plugged. meaning no ethernet cable connected.

It might be simpler to ask what is the system running, monitor the memory utilization, , and restart the service to release the memory.

Does the freeze up occur around the same time?

Do you have an snmp type polling data monitor to help?
Avatar of Scott Silva
Scott Silva
Flag of United States of America image

There is also the possibility that the hardware is starting to give up...
Or it just needs some TLC by cleaning out dust... Reseating all memory and cards...
Old servers need a bit of work at least once in a while...
Avatar of Lotfi BOUCHERIT

ASKER

Hello,
Thank you all for your reply,
This morning, we found the bellow message error displayed
After searching on internet, i found that it's related to storage raid controller disfunctions..
For information, this server has a 10T RAW partition, that is mapped as iSCSI device in another machine, and VMware vCenter Converter, does not seem to see this partition so we can, at list P2V this machine, (actual data with thin provisioning is just about 80G). Could you please tell us how we can do it?
Thank you in advance,
User generated image
ASKER CERTIFIED SOLUTION
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
Avatar of arnold
arnold
Flag of United States of America image

Check, a similar message is seen at times when the cd/DVD rom is scanned.

A.message on the absence of media in a removable drive will not lead to a system halt.

Check to confirm to which devi e the sd 0:1:0:0 points.
Avatar of arnold
arnold
Flag of United States of America image

As to your second question,
This isn external resource not curated by VMware, what do you mean by venter can not see it.
1) only one system should access an IScai target.
Check the server presenting this 10TB volume to see which system is authorized to access it. Commonly IP based restriction/authorization.
Linux
Linux

Linux is a UNIX-like open source operating system with hundreds of distinct distributions, including: Fedora, openSUSE, Ubuntu, Debian, Slackware, Gentoo, CentOS, and Arch Linux. Linux is generally associated with web and database servers, but has become popular in many niche industries and applications.

71K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo