Solved

Read-only file system

Posted on 2011-02-25
51
1,248 Views
Last Modified: 2012-06-27
My system (Centos 5) has gone into read-only file system mode. I'm not sure what caused this. My guess is that perhaps there's a hard drive fault or maybe I/O activity got too high or something.

According to SMART, the drive is fine.

I'm now trying to figure out how to fix this. Any help would be appreciated. Thanks!
0
Comment
Question by:Julian Matz
  • 29
  • 12
  • 7
  • +3
51 Comments
 
LVL 6

Assisted Solution

by:de2Zotjes
de2Zotjes earned 279 total points
ID: 34986300
If you have data you want to keep on the ro-filesystem: copy it to another disk!

After that, where is the filesystem mthat is now ro mounted? If it is not the root fs you can try switching to runlevel 1, unmount it and attempt to repair it using fsck.
If it is the rootfs that is ro you are going to have to reboot the system. My preference would be to boot the installation cd you used to install the system, choose repair mode and do an fsck on the rootfs of your system.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 34986524
It is the root fs. I tried earlier to remount by doing this:

$ mount -o remount /

However, this locked me out of my SSH session, and I wasn't able to re-connect. Since I no longer have access to the system, it's now out of my hands, unfortunately.

AFAIK, the data center staff are now waiting for fsck to complete, which usually takes up to 2 hours.

However, this seems to be a recurring issue. I mean this is the first time I saw the filesystem go into read-only, but I do have a feeling there's something wrong with the drive. Every time the server needs to be rebooted, it will force a file system check. And that has been several times in the past 2 months.

What would you recommend? Clone and swap the drive maybe? Or would cloning also clone potential errors (if it isn't hardware related, of course)? What other causes could there be besides hard drive failure?

Below is the output from smartctl.

Thanks for your help.

# smartctl -a /dev/sda
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Second Generation Serial ATA family
Device Model:     WDC WD3200AAKS-00VYA0
Serial Number:    WD-WCARW0979534
Firmware Version: 12.01B02
User Capacity:    320,072,933,376 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Feb 26 07:02:50 2011 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (8760) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 104) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   201   155   021    Pre-fail  Always       -       2941
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       105
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   100   253   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11159
 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       104
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       204
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       215
194 Temperature_Celsius     0x0022   107   089   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Open in new window

0
 
LVL 4

Assisted Solution

by:FastSi
FastSi earned 25 total points
ID: 34986591
Well CentOS includes a config to enable RO file system might be worth checking the file to see what it displays.

if you could paste the contents of /etc/sysconfig/readonly-root
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 34986636
Data center staff just informed me that the server is back online, and that the drive is healthy. I'm sure they are basing that on the same SMART results as above.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 34986641
[~]# cat /etc/sysconfig/readonly-root

# Set to 'yes' to mount the system filesystems read-only.
READONLY=no
# Set to 'yes' to mount various temporary state as either tmpfs
# or on the block device labelled RW_LABEL. Implied by READONLY
TEMPORARY_STATE=no
# Place to put a tmpfs for temporary scratch writable space
RW_MOUNT=/var/lib/stateless/writable
# Label on local filesystem which can be used for temporary scratch space
RW_LABEL=stateless-rw
# Label for partition with persistent data
STATE_LABEL=stateless-state
# Where to mount to the persistent data
STATE_MOUNT=/.snapshot
0
 
LVL 6

Assisted Solution

by:de2Zotjes
de2Zotjes earned 279 total points
ID: 34986773
I don't trust there smart readings, a little too many neat 200 values.

A good test for your harddrive is to just dd all data to dev-null (it is copy, you don't loose anything :) ):
dd if=/dev/sda of=/dev/null bs=4096 &

Open in new window

You can get intermediate statistics by giving
kill -USR1 %1

Open in new window

Put that command in a loop with a sleep and you should get a fair idea of the state of the disk (read performance should not show any weird dips)

0
 
LVL 1

Expert Comment

by:praveen_expert
ID: 34986868
Remount the drives to overcome this problem
0
 
LVL 7

Expert Comment

by:droyden
ID: 34987406
mount -o remount,rw /
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 34989254
> Remount the drives
If you read my comment (34986524), you'll see I tried to remount, which caused SSH to crash.

Thanks, de2Zotjes. The first command you suggested - could running that have any negative side effect? There's about 250GB of data on the drive. I understand it's just copying, but would you recommend securing a backup first?

I have a secondary disk installed, but it will take me a while to complete a full backup.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 34989264
Oh yes, I forgot to ask: what exactly does the second command do and how would I monitor or determine the results of the first command?
0
 
LVL 7

Assisted Solution

by:droyden
droyden earned 146 total points
ID: 34989278
The first command reads every part of the hard drive to ensure there are no read errors
the second command issues a custom kill flag to the dd command which will output how much data has been read and at what speed, this will give an indication on the progress
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 34994753
Cheers! I'll give it a try.

I actually just found the following in some older logs. It was from a date where my server seemingly crashed. Hopefully this might help determine what the problem is or at least provide some more clues? Unfortunately though, all I can really make from it is that it seems there was some kind of I/O error, and that the filesystem was remounted in read-only.
Dec  3 04:03:10 kernel: ata1.00: exception Emask 0x10 SAct 0x3 SErr 0x10000 action 0xe frozen
Dec  3 04:03:10 kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed
Dec  3 04:03:10 kernel: ata1: SError: { PHYRdyChg }
Dec  3 04:03:10 kernel: ata1.00: cmd 60/08:00:16:41:f6/00:00:01:00:00/40 tag 0 ncq 4096 in
Dec  3 04:03:10 kernel:          res 40/00:04:16:41:f6/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:03:10 kernel: ata1.00: status: { DRDY }
Dec  3 04:03:10 kernel: ata1.00: cmd 60/08:08:a6:d3:3a/00:00:0b:00:00/40 tag 1 ncq 4096 in
Dec  3 04:03:10 kernel:          res 40/00:04:16:41:f6/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:03:10 kernel: ata1.00: status: { DRDY }
Dec  3 04:03:10 kernel: ata1: hard resetting link
Dec  3 04:03:16 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  3 04:03:46 kernel: ata1.00: qc timeout (cmd 0xec)
Dec  3 04:03:46 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Dec  3 04:03:51 kernel: ata1.00: revalidation failed (errno=-5)
Dec  3 04:03:51 kernel: ata1: failed to recover some devices, retrying in 5 secs
Dec  3 04:03:51 kernel: ata1: hard resetting link
Dec  3 04:03:51 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  3 04:03:51 kernel: ata1.00: configured for UDMA/133
Dec  3 04:03:51 kernel: ata1: EH complete
Dec  3 04:03:51 kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Dec  3 04:03:51 kernel: sda: Write Protect is off
Dec  3 04:03:51 kernel: SCSI device sda: drive cache: write back
Dec  3 04:04:47 kernel: ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x10000 action 0xe frozen
Dec  3 04:04:47 kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed
Dec  3 04:04:47 kernel: ata1: SError: { PHYRdyChg }
Dec  3 04:04:47 kernel: ata1.00: cmd 60/00:00:8e:4e:03/01:00:0c:00:00/40 tag 0 ncq 131072 in
Dec  3 04:04:47 kernel:          res 40/00:2c:66:e1:37/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:04:47 kernel: ata1.00: status: { DRDY }
Dec  3 04:04:47 kernel: ata1.00: cmd 60/28:08:d6:e0:37/00:00:08:00:00/40 tag 1 ncq 20480 in
Dec  3 04:04:47 kernel:          res 40/00:2c:66:e1:37/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:04:47 kernel: ata1.00: status: { DRDY }
Dec  3 04:04:47 kernel: ata1.00: cmd 60/00:10:8e:4d:03/01:00:0c:00:00/40 tag 2 ncq 131072 in
Dec  3 04:04:47 kernel:          res 40/00:2c:66:e1:37/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:04:47 kernel: ata1.00: status: { DRDY }
Dec  3 04:04:47 kernel: ata1.00: cmd 60/10:18:2e:e1:37/00:00:08:00:00/40 tag 3 ncq 8192 in
Dec  3 04:05:29 kernel:          res 40/00:2c:66:e1:37/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:05:29 kernel: ata1.00: status: { DRDY }
Dec  3 04:05:29 kernel: ata1.00: cmd 60/08:20:46:e1:37/00:00:08:00:00/40 tag 4 ncq 4096 in
Dec  3 04:05:29 kernel:          res 40/00:2c:66:e1:37/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:05:29 kernel: ata1.00: status: { DRDY }
Dec  3 04:05:29 kernel: ata1.00: cmd 60/50:28:66:e1:37/00:00:08:00:00/40 tag 5 ncq 40960 in
Dec  3 04:05:29 kernel:          res 40/00:2c:66:e1:37/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Dec  3 04:05:29 kernel: ata1.00: status: { DRDY }
Dec  3 04:05:29 ekernel: ata1: hard resetting link
Dec  3 04:05:29 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  3 04:05:29 kernel: ata1.00: qc timeout (cmd 0xec)
Dec  3 04:05:29 kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Dec  3 04:05:29 kernel: ata1.00: revalidation failed (errno=-5)
Dec  3 04:05:29 kernel: ata1: failed to recover some devices, retrying in 5 secs
Dec  3 04:05:29 kernel: ata1: hard resetting link
Dec  3 04:05:29 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  3 04:05:29 kernel: ata1.00: configured for UDMA/133
Dec  3 04:05:29 kernel: ata1: EH complete
Dec  3 04:05:29 kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Dec  3 04:05:29 kernel: sda: Write Protect is off
Dec  3 04:05:29 kernel: SCSI device sda: drive cache: write back
Dec  3 04:05:51 kernel: EXT3-fs error (device sda2): ext3_lookup: unlinked inode 1612 in dir #2
Dec  3 04:05:51 kernel: Aborting journal on device sda2.
Dec  3 04:05:51 kernel: EXT3-fs error (device sda2): ext3_lookup: unlinked inode 1612 in dir #2
Dec  3 04:05:51 last message repeated 3 times
Dec  3 04:05:51 kernel: ext3_abort called.
Dec  3 04:05:51 kernel: EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Dec  3 04:05:51 kernel: Remounting filesystem read-only
Dec  3 17:35:18 syslogd 1.4.1: restart.

Open in new window

0
 
LVL 7

Expert Comment

by:droyden
ID: 34994776
Can you attach? The.drive might actually be failing after all
0
 
LVL 7

Expert Comment

by:droyden
ID: 34994785
Ah sorry on phone, that looks like a hw fault. Maybe controller, can you run diagnostics on it at all?
0
 
LVL 6

Assisted Solution

by:mohansahu
mohansahu earned 50 total points
ID: 34995901
Hi,

Run fsck command-line utility on the failed Linux hard drive. It detects and repairs minor file system errors.

http://www.cyberciti.biz/tips/linux-filesytem-goes-read-only.html
http://www.cyberciti.biz/tips/linux-find-out-if-harddisk-failing.html
(OR)
mount -o remount,rw X / where X is the filesystem
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35000418
I have full root access but no physical access to the system. On the 23/01/2011, I had another similar issue, after which the motherboard was swapped. However, on the 3rd of February, I see the same messages were logged as above from the 3rd of December. Attached is some info about my SATA controller and motherboard.
*-core
       description: Motherboard
       product: PDSBM
       vendor: Supermicro
       physical id: 0
       version: PCB Version
       serial: 0123456789
     *-firmware
          description: BIOS
          vendor: Phoenix Technologies LTD
          physical id: 0
          version: 6.00
          date: 12/18/2007
          size: 107KiB
          capacity: 960KiB
          capabilities: isa pci pnp upgrade shadowing escd cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14seri
al int17printer int10video usb smartbattery biosbootspecification

description: SATA controller
             product: N10/ICH7 Family SATA AHCI Controller
             vendor: Intel Corporation
             physical id: 1f.2
             bus info: pci@0000:00:1f.2
             logical name: scsi0
             logical name: scsi2
             version: 01
             width: 32 bits
             clock: 66MHz
             capabilities: storage msi pm ahci_1.0 bus_master cap_list emulated
             configuration: driver=ahci latency=0
             resources: irq:233 ioport:30f0(size=8) ioport:30e4(size=4) ioport:30e8(size=8) ioport:30e0(size=4) ioport:30b0(size=16) memory:d0500400-d05007ff

Open in new window

0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35138831
Sorry, I know this has been open for a while, but I'm still trying to figure this out... Just a shot in the dark: is there any possibility this could be caused by a "virus"?

By virus I mean trojan, backdoor, malicious script, malicious executable/binary, etc. etc.
0
 
LVL 7

Expert Comment

by:droyden
ID: 35138960
Anything hdd related in dmesg?
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35139234
Yes, but nothing relevant from what I can tell. Can I attach the output for you to take a look also?

The reason I asked about the virus is because I looked through my mail account that receives root mail, etc. and I noticed that every time my server has crashed, one of the last (if not the last) mails I receive from the server is a suspicious file alert from lfd. I looked back, and it's always the same message (i.e. the same "suspicious" file):

Time:   Tue Mar 15 04:06:12 2011 +0000
File:   /tmp/bds
Reason: Linux Binary
Owner:  xxx:xxx(534:537)
Action: No action taken

After I identified this, I submitted that file to a virus scanner, and it looks like it's some kind of backdoor shell. I don't know what it does because it appears to be a binary file. Here are the scan results:
http://goo.gl/dqQfV

0
 
LVL 7

Expert Comment

by:droyden
ID: 35139326
Yes that looks like a backdoor, what is the ownership on the file?
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35139378
It's one of the non-priviledged users on the server who has a Joomla website running on it. The server's running suPHP and suexec, and I'm sure that file ended up there through some vulnerability on his website.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35139534
I doubt it's related actually. Probably just a coincidence since lfd seems to send these alerts every hour.
0
 
LVL 7

Expert Comment

by:droyden
ID: 35139602
If the user does not have shell access then his joomla install probably has a security hole in it, there have been quite a few over the last few years..
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35139785
Yes, I agree. The probabilty of that is pretty close to 100%. Joomla sites are constantly giving me trouble. I'm the only one with shell access.

I noticed those hardware errors almost always occur at about 4:05 AM..
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35140083
I tried running this command, by the way:

dd if=/dev/sda of=/dev/null bs=4096 &

... but it was causing a very high load, so I didn't want to keep it running for too long.
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 21

Author Comment

by:Julian Matz
ID: 35140204
This has just happened again, about 15 minutes after I ran the dd command:

Mar 15 16:50:25 kernel: ata1: failed to recover some devices, retrying in 5 secs
Mar 15 16:50:25 kernel: ata1: hard resetting link
Mar 15 16:50:25 kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Mar 15 16:50:25 kernel: ata1.00: configured for UDMA/133
Mar 15 16:50:25 kernel: ata1: EH complete
Mar 15 16:50:25 kernel: SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Mar 15 16:50:25 kernel: sda: Write Protect is off
Mar 15 16:50:25 kernel: SCSI device sda: drive cache: write back
Mar 15 16:51:48 kernel: EXT3-fs error (device sda5): ext3_free_blocks_sb: bit already cleared for block 27944845

Open in new window


I tried to remount, but again I got locked out of SSH, and server is unresponsive!
0
 
LVL 7

Assisted Solution

by:droyden
droyden earned 146 total points
ID: 35141013
aha, that looks like SDA is on its way out. does it pass a smart test? it could be cabling etc? although if its a proper server I doubt the latter.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35141149
I was told the cables were thoroughly checked and all OK. SMART test passed. I attached the results above. Here's a link:

http:#34986524

Those are the results from about 2-3 weeks ago. I don't have access to any more recent results right now as I cannot access the server while it's currently running fsck.
0
 
LVL 7

Assisted Solution

by:droyden
droyden earned 146 total points
ID: 35141376
try a running a smart test to make sure:

smartctl -t short /dev/sda

thats for a short one (~2min) you can check the log with: smartctl -l selftest /dev/sda
then try a long one

smartctl -t long /dev/sda

and use the same previous command to check the status. you could also check the heat of the drive with:
hddtemp /dev/sda

0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35141531
Attached are the results of SMART test I just performed.

This is what I just received from the support tech:

"Your server is now online. The FSCK has ended correcting many file
system errors. Please confirm that everything is working properly from
your side."

So, apparently, there were a lot of errors. Yet SMART says drive is healthy. What else could be causing the errors?

If the most likely reason is a failing drive, regardless of SMART, I'll just have to try a drive clone.
# smartctl -a /dev/sda
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Second Generation Serial ATA family
Device Model:     WDC WD3200AAKS-00VYA0
Serial Number:    WD-WCARW0979534
Firmware Version: 12.01B02
User Capacity:    320,072,933,376 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Mar 15 19:19:40 2011 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (8760) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 104) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   200   155   021    Pre-fail  Always       -       2958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       123
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11567
 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       122
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       222
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       233
194 Temperature_Celsius     0x0022   105   089   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Open in new window

0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35141581
Short test
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     11568         -

Open in new window

0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35142615
Long and short test

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     11569         -
# 2  Short offline       Completed without error       00%     11568         -

Open in new window

0
 
LVL 7

Expert Comment

by:droyden
ID: 35142744
hmm, its def shows nothing bad. can you check the hddtemp? if its not the drive failing and cables are ok I'm not sure :(
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35142799
hddtemp /dev/sda
/dev/sda: WDC WD3200AAKS-00VYA0: 41°C

Open in new window


That is after 3 hours uptime plus ~2 hour fsck ... so, about 5 hours running.
0
 
LVL 7

Expert Comment

by:droyden
ID: 35142947
41C is quite high, mine are always between 23-26, is this in a server room? if so then that could be the cause.. server rooms should be no more than ~20C

/dev/sda: WDC WD2500BEVT-00ZCT0: 23¦C
/dev/sdb: SAMSUNG HD203WI: 24¦C
/dev/sdc: ST32000542AS: 26¦C
/dev/sdd: SAMSUNG HD203WI: 25¦C
that's my drives at home to give you some comparison.. no special cooling etc
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35143001
Yes, the server is located in a professional datacenter, in a server room. I mean I have no visual confirmation of that or anything but to the best of my knowledge ... :)

23 - 26 °C seems extremely cool to me. I thought ~40°C was about the average running temperature of a HDD. Out of interest I will test the drives in my other servers (at different datacenters).

By the way, I have two drives in this specific server. The other one (/dv/sdb) is around the same - 40°C.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35143100
They are all pretty much the same to be honest...

bravo:~#  hddtemp /dev/hda
/dev/hda: WDC WD800JB-00JJC0: 39°C

alpha:~# hddtemp -w /dev/sda
/dev/sda: WDC WD800JD-00LSA0: 48 C

alpha:~# hddtemp -w /dev/sdb
/dev/sdb: WDC WD800JD-60JRA0: 50 C

And my last server uses raid ... and I'm not sure how to check that.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35143114
The second drive above has clocked 70342 hours at that temperature :) Well, I don't know that for sure, but it has clocked that many hours, according to SMART.
0
 
LVL 6

Accepted Solution

by:
de2Zotjes earned 279 total points
ID: 35145042
I would swap that hd for a new one. In my experience when the dd-command I adviced you to execute causes such high load that you loose the machine it is a clear indication that the disk is about to die. As for smart not reporting any suspicious figures, that is quite common. I have seen that happening on numerous occasions.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35148774
Thanks, I think that is what I will go with. One last question then... What would be my options for cloning the drive?

I don't have physical access to the hardware personally, but I know the support staff will boot from a LiveCD and then use dd to copy the drive on to a new one.

Are there any other solutions in case plan A fails?
0
 
LVL 6

Assisted Solution

by:de2Zotjes
de2Zotjes earned 279 total points
ID: 35149945
I would start by taking out essential data in a tarball, write the tarball directly to another system:
tar cf - /important-path | ssh remote system 'cat - > tarballname'

That way if plan A fails you still have something...

There are no easy ways to get the the data out if the clone fails, your best bet is to get the drive to a data recovery firm. They will use their own diskheads and logic to recover as much as possible.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35150028
Sorry, what I meant was, are there any other options for trying to clone it (besides using dd copy). I do have a backup of files on another drive.

Restoring accounts and files wouldn't be the worst thing. It's everything else that caused me the headaches last time - re-configuring the server (mail, apache, php, extensions, modules, SSH, firewall, etc. etc.) and making everything work in harmony again.

That's why I really need the clone to work. Last time, the clone didn't work and we ended up re-installing the OS and then restoring accounts on to a new drive. I'd prefer not to have to go through that again.
0
 
LVL 6

Assisted Solution

by:de2Zotjes
de2Zotjes earned 279 total points
ID: 35150386
dd copy is the best way. It copies at block level so any filesystem errors will not prevent data from being transferred. There really is no alternative other then the data recovery firm..
0
 
LVL 7

Expert Comment

by:droyden
ID: 35150648
jebus, 50C is way too hot for a server room imo, even wikipedia agrees!
http://en.wikipedia.org/wiki/Data_center#Environmental_control

(16-24C)

Additionally, if it is not the hard drive I would suggest checking the rest of the hardware - I would start with the raid controller. Depending on what type of server it is there may be linux userspace diag tools. Failing that you will be able to get the server tech's to run the diags on it, this is all assuming the server has a support contract/warranty
0
 
LVL 6

Expert Comment

by:de2Zotjes
ID: 35150829
A loaded harddisk will easily  go to about 20 degrees (K or C)  above the ambient temperature.

50 degrees C is indeed way too hot for a server room. The measurement is for the disk, not the air in the datacenter.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35150873
That's for room temperature though, no? Not the temperature of the actual drive.

I trust the techs have already run the majority of diagnostic tests. They informed me that, as it stands, a new drive is the best or only option right now. They're giving me some time to secure backups and transfer a couple of accounts, and then will wait for my go-ahead to start cloning.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35150946
I mean 50 C is the drive temp. I'd imagine the temperature of the server room is much lower. As far as I know, at least 2 of their DCs have brand new, state-of-the-art air-conditioning, power backup and even biometric sensors and security cameras. In total, I think they have at least 4 DCs, so I'm sure they're running according to standards.
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35409362
I've just had a brand new drive installed. The old one was cloned on to a new one.

I'm still having some trouble though. Today, I noticed that /tmp and /var/tmp were in read-only mode. I was able to fix it by unmounting and remounting using following commands:

root [/]# mount -o noexec,nosuid,rw /tmp
root [/]# mount -o defaults,usrquota,bind,noauto /tmp /var/tmp

Open in new window


However, the syslog reports there are still FS errors on /dev/sda2:

Apr 17 00:08:39 kernel: EXT3-fs warning (device sda2): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure
Apr 17 00:08:39 kernel: EXT3-fs warning (device sda2): ext3_clear_journal_err: Marking fs in need of filesystem check.
Apr 17 00:08:39 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
Apr 17 00:08:39 kernel: EXT3 FS on sda2, internal journal
Apr 17 00:08:39 kernel: EXT3-fs: recovery complete.
Apr 17 00:08:39 kernel: EXT3-fs: mounted filesystem with ordered data mode.

Open in new window


I also got the following messages:

Apr 15 09:17:15 kernel: EXT3-fs warning (device sda2): ext3_unlink: Deleting nonexistent file (366), 0
Apr 15 10:10:01 kernel: EXT3-fs warning (device sda2): ext3_unlink: Deleting nonexistent file (356), 0
Apr 15 10:50:13 kernel: EXT3-fs warning (device sda2): ext3_unlink: Deleting nonexistent file (134), 0
Apr 16 05:58:38 kernel: EXT3-fs error (device sda2): ext3_lookup: unlinked inode 383 in dir #2
Apr 16 05:58:38 kernel: EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal

Open in new window


What could be causing this? Could it be that some still corrupt files were copied over to the new drive, or is it more likely an issue not directly related to the drive?

0
 
LVL 6

Assisted Solution

by:de2Zotjes
de2Zotjes earned 279 total points
ID: 35409833
It is possible that filesystem errors were copied along during the cloning. I would expect the datacenter techs to be savvy enough to know that and run the full set of fsck's on the filesystems of a cloned disk. Then again, these are people that do some manual labour and for some reason there is this great divide between head and hands :)

I recommend a restart of the system and forcing fsck's during boot ( issue "touch /forcefsck " before the reboot)

b.t.w. to get from mounted ro to rw you do not need to umount on a recent linux box. You can use the mount command and add remount as one of the mount options. i.e: mount -o remount,noexec,nosuid,rw /tmp
0
 
LVL 21

Author Comment

by:Julian Matz
ID: 35435409
I know what you mean!

Thanks! In that case, it does look to me like the errors might have been copied over. I haven't had the server crash on me since, so that's a good sign. Just before the hard drive was replaced, it had come to the stage where the server was crashing every second day. On that note, it does appear to have been the drive.

The reason I first unmounted the partion using umount was because I was tempted to run FSCK myself, but then changed my mind.

I would have also rebooted the server except for the fact that I wouldn't have been able to monitor boot progress. The server wouldn't immediately have come back online, and then I'd have been wondering if it would come back at all without some kind of user interaction. I might ask for a KVM switch though.

I've asked the DC staff to look into a few things for me. Usually they get back to me very quickly, but it's taking them a bit longer this time.

I'll definitely wrap this up very soon though.
0
 
LVL 21

Author Closing Comment

by:Julian Matz
ID: 35708685
Thanks for all the great help. It was obviously a problem with the drive itself. I haven't had any crashes since the swap :)
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Join & Write a Comment

In my business, I use the LTS (Long Term Support) versions of Linux. My workstations do real work, and so I rarely have the patience to deal with silly problems caused by an upgraded kernel that had experimental software on it to begin with from a r…
How to update Firmware and Bios in Dell Equalogic PS6000 Arrays and Hard Disks firmware update.
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now