RHEL 5.6 VMs on vSphere 4 going read-only and unable to login

Hello, we have several vSphere VMs running RHEL 5.6 that will run fine, then all of a sudden, the file system goes read-only and/or won't allow logins of any type until the VM is rebooted.  We have upgraded BIOs so far with no effect and are wondering if anyone has heard of this happening and any suggestion of root cause.  Could it be a RHEL/VMware combo thing or something related to either side?  Thank you for your help!
emjay180VMware EngineerAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

IanThCommented:
is the file system local or iscsi / nfs
why arent you running 4.1 u2 ?
simonseztechCommented:
Usually when RHEL fall to read only mode is that it lost access to his storage for certain amount of times. If you use shared storage (iSCSI or NFS) try using two NIC to access your storage (multipathting in iSCSI) and if NFS try using 2 NIC active/active with IP HASH load balancing. If you have local storage try moving that vm to local to rule out network timeout.
emjay180VMware EngineerAuthor Commented:
Yes, it's EMC VNX storage, and they may be vSphere 4.1 u2 boxes, I just put 4 because I know it's not 5.  

Does the degraded storage connection cause the login lockout for all users too?  It always comes back when server is rebooted, but it happens everyday.  

Is there anyway to adjust the threshold so it doesn't trigger "read-only" mode so easily?  

Thank you for your help!
Price Your IT Services for Profit

Managed service contracts are great - when they're making you money. Yes, you’re getting paid monthly, but is it actually profitable? Learn to calculate your hourly overhead burden so you can master your IT services pricing strategy.

simonseztechCommented:
Please see KB from vmware.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=51306

If your RHEL lost his storage the KB won't help you. CentOS and RHEL became read-only as a mean to preserve integrity of the OS. As per KB you can try the following command :
mount -o remount /


Look at /var/log/vmkernel on CLI of the ESX Host for issues.

Do you have a failover path ?
simonseztechCommented:
Are you using NFS or iSCSI.

If using iSCSI check the error on your vmkernel log and follow the suggestion here
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381
Danny McDanielClinical Systems AnalystCommented:
You may be suffering this because of poor storage performance, too.  If you have a long wait for storage, it will seem like a loss or disconnect from storage to the linux machine and then it will mark the drive(s) as read-only to prevent corruption.  Look at your storage performance monitors.
MysidiaCommented:
" As per KB you can try the following command :
mount -o remount /"
That won't actually work if CentOS has remounted the device read-only due to errors detected;  the media will be marked read-only.   Anyways, there is no safe hitless remount: you need to reboot, and go through the filesystem checks.

Please run the "dmesg"  command after it happens and capture a sample of the output including the earliest error and text immediately before it  to show  _which_ condition is resulting in a remount.

Here's the main thing I have to say about it:
This is caused by EITHER:  (1) a filesystem error (independent of the hardware)  or  (2) a hardware or software error somewhere on the path between your VM and the array disk spindles, causing either critical I/O operations to timeout,  or fail with an unrecoverable error.

In case of it caused by a corrupt filesystem,  eg sudden ext3 fault.   The thing to do is force a full fsck,  eg     shutdown  -r -F now
In the worst case, a re-install  with restore of data from backup might be required, to put an end to repeated ext3 faults,  if there is not an associated SCSI I/O command error with the remounts.  


Also,  it's possible to have BOTH problems, and in fact...   (2) can cause (1).
Failures of your storage system or 'loss of connectivity' or 'intermittent connectivity' can sometimes lead to filesystem damage,  causing recurring problems.

If it's just happening to one relatively idle VM, then corruption would be the first thing i'd suspect.     Any  Ext3 fault may cause a remount or panic,  not  just  unrecoverable SCSI errors.

You have the option of editing /etc/fstab
 and changing the options from  defaults   TO
  defaults,errors=continue
  defaults,errors=panic
  OR
  errors=remount-ro

Panic will force a kernel panic,  which should lead to a reboot,  assuming you have an entry in /etc/sysctl.conf   that says something like
  kernel.panic = 300
  or something in /etc/rc.d/rc.local  that says
  echo 300>/proc/sys/kernel/panic
  (E.g.   "Reboot 5 minutes after a panic")

 remount-ro   is the default

  And  'continue'  is not recommended,  which essentially means "just ignore the errors and continue as if the failed operations succeeded"

   Of course,   continue is strongly discouraged,  because this may result in serious data loss / fs damage;  your  Write operations are failing,  this can cause immediate corruption of system data,    remount-ro  and  panic   are   basically your reasonable options.

Next,
(1) Make sure you are not experiencing a storage array outage.   If you are experiencing an array crash, or failover event, it may very well cause this,  you may want to contact your vendors'  support.

A hard drive, array, or other component may be failing and not performing like it is supposed to perform;   check that all arrays are healthy with no HDD failures.

If it's happening during a controller failover I would say the controller failover is not working "correctly";  if it is occuring, and taking so long,  that your VMs  encounter significant delays or unrecoverable read/write events.  Check the status of the storage though,  it's prime suspect.

    In case the problem is caused by failover delay,  my top recommendation (other than resolving with storage array vendor support assistance) is:
    (a)   Make sure the SCSI controller type for your virtual machines is:  Paravirtual ISCSI
            (You will have to have VMware tools installed, and the PV driver included in your initrd);   Deploy all new VMs using the PVSCSI drivers, instead of the LSI driver.

     The LSI driver in the past has had bugs or  lack of sufficient tolerance for failover
     delays, or other access delays that may be more common in a SAN/NAS environment.

     Consider,  the stock LSI driver supports hardware that is normally used in a
     direct-attach environment,     whereas the PVSCSI driver was written from scratch,
     specifically for VMs,  and  NAS/SAN and other I/O delays are more likely in that case,
      and do not necessarily indicate a hardware component failure.

    (b)   Increase the  command timeout.    I place the following in /etc/rc.d/rc.local on CentOS5
for i in `ls /sys/block | grep -P ^sd`;do
echo "180" > /sys/block/$i/device/timeout
done
     Note, however,    this timeout is only the timeout for certain operations on seeks.
     not read/write ops

(2) Make sure storage network design is correct.    Unless your hosts are utilizing 10-gigabit networking with Network I/O control on a dvSwitch:    iSCSI/NFS   storage networking must be isolated from virtual machine networking --   do not allow a VM  to send or receive traffic over the same LAN that is used for storage access.   Make sure that you have at least 1 pair of NICs are used dedicated for the storage networking,  that both NICs work, and are config'ed for interface teaming active/active or failover  (active/active is preferred);  if costs allow,  3 NICs are preferred,  with at least 1 pair of 2 dedicated storage Ethernet switches connected together,  and storage NICs of the ESX hosts spread across redundant storage switches,  so you have a "tie breaker";  3 or more NICs allows you to choose  beacon probing for your storage port group as a failure detection mechanism,  which can be used to detect failures that simple 'carrier status'  failure detection will not find.


The entire storage LAN should exist on a separate VLAN or a separate physical infrastructure, with its own switches  (Separate physical infrastructure is very strongly recommended),  with dedicated same-subnet VMkernel ports on the ESX hosts for access.   If  there is a network bottleneck,  it might result in loss of ESX(i) host access
to storage,  which can cause very serious problems, and this one specifically.

(3) Make sure your storage network is not failing.   I would recommend that you utilize SNMP-based graphing tools to help monitor  the amount of traffic occuring over the storage network devices,  switch port error counters, discards, etc.

(3) Make sure your storage sizing is adequate.    For example, make sure you have a sufficient number of disk spindles for the number of read and write requests against each array.  Make sure you have sufficient bandwidth on disk systems and network to service the number of requests.

This is hard.   You might use a tool such as  Veeam Monitor Free  edition (or other edition),  to help graph and alert on the latency of your Virtual disks and your datastores; or other monitoring tools such as VOps Storage monitor or vCenter Ops Manager.

You can also check into this through Vcenter;

Click a Virtual machine in the inventory,  go to the Performance tab, pick advanced mode.
check graphs for  Data store   and Virtual Disk.

You will want to check LATENCY  to make sure your storage is performing adequately.

You will have to click 'chart options'  under the performance tab to view   the advanced counters are available.


You can only display two kinds of units in the same graph, so it's a little cumbersome.
but
You want to look under
Datastore > real time
  then pick chart options and
look into the performance of  these Counters  (you won't be able to put all of them on the same graph):  
      Write Latency
      Read Latency
      Average Read Requests per Second
      Average Write Requests per Second
      Read Rate, Write Rate

Now go to the virtual machine and check into the counters for
    Read Latency
    Write Latency
    Average Reads/second
    Average Writes/second


If you see average read latency  or write latency consistently exceeding  30 ms,  then you have a performance issue.

If it is going over 100ms,  then I would say you have a critical sizing problem:
either too much load in terms of  Number of Read/Write   operations per second,
or Queued operations  (E.g.  IOPS being requested exceeding the IOPS rate the storage
can sustain for your kind of workload), or too much load in terms of bandwidth
(E.g.  Your total storage throughput being used  too close to the bw capacity present)..

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
emjay180VMware EngineerAuthor Commented:
Thank you for all the comprehensive ideas!  I'll get to work trying to shake something out from the processes you've suggested.
Luciano PatrãoICT Senior Infraestructure  Engineer  Commented:
Hi

Besides of adding a second NIC in the iSCSI network(and use port bidding), if is iSCSI, or a second FC card to the host to prevent this issues, most of the times a simple restart of the VM fix this problems in the Linux.

But you need to detail more your VMware vs Storage environment to understand and try to help the best way to prevent this.

In VMware vs Storage is always a good option to have everything with hight availably. Double cards, double Switchs(FC or LAN) and connections should always be balanced between cards and Swichs(card 1 port 1 to Switch 1, card 1 port 2 to Switch 2)

Try also to check the logs from the EMC, you should have some disconnections and connections on it.

Some print screens from the vSwitch, Networking and Storage adapters configurations can help.

Hope this can help

Jail
emjay180VMware EngineerAuthor Commented:
Hello, answer seems to have been upgrading the VMtools to latest version (vmwaretools 8.6.5 build-731933).  It's has been operating well without dropping into read-only mode for several days now.
emjay180VMware EngineerAuthor Commented:
Final actual solution was to upgrade VMtools.  But, this was best troubleshooting overview.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VMware

From novice to tech pro — start learning today.