Link to home
Start Free TrialLog in
Avatar of Chireru
Chireru

asked on

ReiserFS causing kernel panic

About a week ago, I had my 40gig "die" on me... it was running reiserfs.  The symptoms were: upon access of certain files, the system would dump error messages, or give i/o errors, or kernel panic.  I managed to get a lot of the stuff I wanted off of it through painful copy stuff until it kernel panics, reboot the box, and continue, avoiding the file that set it off.

Now, my 120gig (where I backed everything up to) is now exhibiting the same symptoms.  ReiserFSCKing will give the odd errors, regardless of the number of times it's run (is reiserfs too dumb to mark bad sectors, and possibly moving stuff onto/rebuilding indexes on bad sectors?).  A --rebuild-tree kernel paniced on me, but on second-run, it came out fine.  Following the maxtor RMA procedure, I downloaded their diagnostics program, and it says the drive is fine.

Anyone got any idea what is going on?  is it a bad version of reiserfs?  is maxor lying?  The kernel panic mentions it was during a swap...  how can I fsck swap?
There are 4 drives in this machine, this is the second one to do this.  (neither of them have been the root or swap disks)
Linux osaka 2.6.10-gentoo-r6-OSAKA #10 Fri Feb 4 22:04:26 EST 2005 i686 AMD Athlon(tm) Processor AuthenticAMD GNU/Linux


Here's the kernel panic... it's during a reiserfsck:
Mar 30 22:12:47 osaka printing eip:
Mar 30 22:12:47 osaka c0151d0e
Mar 30 22:12:47 osaka *pde = 00000000
Mar 30 22:12:47 osaka Oops: 0000 [#1]
Mar 30 22:12:47 osaka PREEMPT
Mar 30 22:12:47 osaka Modules linked in:
Mar 30 22:12:47 osaka CPU:    0
Mar 30 22:12:47 osaka EIP:    0060:[<c0151d0e>]    Not tainted VLI
Mar 30 22:12:47 osaka EFLAGS: 00010202   (2.6.10-gentoo-r6-OSAKA)
Mar 30 22:12:47 osaka EIP is at sync_buffer+0xe/0x50
Mar 30 22:12:47 osaka eax: 41129601   ebx: d7e55d1c   ecx: c0011354   edx: 00000002
Mar 30 22:12:47 osaka esi: d7e55d24   edi: c16ffdc8   ebp: 00000000   esp: d7e55ccc
Mar 30 22:12:47 osaka ds: 007b   es: 007b   ss: 0068
Mar 30 22:12:47 osaka Process reiserfsck (pid: 23072, threadinfo=d7e54000 task=dd2dc020)
Mar 30 22:12:47 osaka Stack: c16ffdc8 00000000 c04f0892 c0011354 c0151d00 d7e55d30 dd2dc020 d7e55d18
Mar 30 22:12:47 osaka c0151d00 c04f0931 00000002 c16ffdc8 c0011354 00000002 00000000 dd2dc020
Mar 30 22:12:47 osaka c0129c10 d7e55d30 d7e55d30 00000018 c0011354 00000002 00000001 dd2dc020
Mar 30 22:12:47 osaka Call Trace:
Mar 30 22:12:47 osaka [<c04f0892>] __wait_on_bit_lock+0x52/0x60
Mar 30 22:12:47 osaka [<c0151d00>] sync_buffer+0x0/0x50
Mar 30 22:12:47 osaka [<c0151d00>] sync_buffer+0x0/0x50
Mar 30 22:12:47 osaka [<c04f0931>] out_of_line_wait_on_bit_lock+0x91/0xa0
Mar 30 22:12:47 osaka [<c0129c10>] wake_bit_function+0x0/0x60
Mar 30 22:12:47 osaka [<c0129c10>] wake_bit_function+0x0/0x60
Mar 30 22:12:47 osaka [<c0151d83>] __lock_buffer+0x33/0x40
Mar 30 22:12:47 osaka [<c01538a7>] block_invalidatepage+0xb7/0xe0
Mar 30 22:12:47 osaka [<c0132ee7>] find_get_pages+0x37/0x70
Mar 30 22:12:47 osaka [<c013d0a7>] do_invalidatepage+0x27/0x30
Mar 30 22:12:47 osaka [<c013d12b>] truncate_complete_page+0x7b/0x80
Mar 30 22:12:47 osaka [<c013d310>] truncate_inode_pages+0xf0/0x2c0
Mar 30 22:12:47 osaka [<c01581bc>] kill_bdev+0x3c/0x50
Mar 30 22:12:47 osaka [<c01592ab>] blkdev_put+0x13b/0x140
Mar 30 22:12:47 osaka [<c0151ab3>] __fput+0x163/0x180
Mar 30 22:12:47 osaka [<c01500f9>] filp_close+0x59/0x90
Mar 30 22:12:47 osaka [<c011754c>] put_files_struct+0x5c/0xd0
Mar 30 22:12:47 osaka [<c01182f2>] do_exit+0x1a2/0x460
Mar 30 22:12:47 osaka [<c0118625>] do_group_exit+0x35/0xb0
Mar 30 22:12:47 osaka [<c0121521>] get_signal_to_deliver+0x1f1/0x2f0
Mar 30 22:12:47 osaka [<c0102524>] do_signal+0x94/0x120
Mar 30 22:12:47 osaka [<c0119d4f>] do_setitimer+0x1af/0x1e0
Mar 30 22:12:47 osaka [<c011e96f>] sys_alarm+0x3f/0x70
Mar 30 22:12:47 osaka [<c010dc70>] do_page_fault+0x0/0x5c7
Mar 30 22:12:47 osaka [<c01025e5>] do_notify_resume+0x35/0x38
Mar 30 22:12:47 osaka [<c0102766>] work_notifysig+0x13/0x15
Mar 30 22:12:47 osaka Code: cd b8 00 e0 ff ff 21 e0 ff 48 14 8b 40 08 a8 08 75 04 31 c0 eb dc e8 72 e4 39 00 eb f5 83 ec 08 8b 44 24 0c 8b 40 20 85 c0 74 1b <
8b> 40 04 8b 80 94 00 00 00 85 c0 74 0e 8b 40 38 85 c0 74 07 8b
Mar 30 22:12:47 osaka su(pam_unix)[22092]: session closed for user root
Mar 30 22:12:48 osaka init: PANIC: segmentation violation at 0xffffe420! sleeping for 30 seconds.
Mar 30 22:12:48 osaka <1>Unable to handle kernel paging request at virtual address ffffe02c
Mar 30 22:12:48 osaka printing eip:
Mar 30 22:12:48 osaka c017d0ad
Mar 30 22:12:48 osaka *pde = 00001067
Mar 30 22:12:48 osaka *pte = 0f78f062
Mar 30 22:12:48 osaka Oops: 0000 [#2]
Mar 30 22:12:48 osaka PREEMPT
Mar 30 22:12:48 osaka Modules linked in:
Mar 30 22:12:48 osaka CPU:    0
Mar 30 22:12:48 osaka EIP:    0060:[<c017d0ad>]    Not tainted VLI
Mar 30 22:12:48 osaka EFLAGS: 00010282   (2.6.10-gentoo-r6-OSAKA)
Mar 30 22:12:48 osaka EIP is at elf_core_dump+0x29d/0xc7a
Mar 30 22:12:48 osaka eax: f14c9ba8   ebx: 00000000   ecx: dadcc420   edx: 0000007b
Mar 30 22:12:48 osaka esi: f2492000   edi: f2493fc4   ebp: f2492000   esp: f2493d6c
Mar 30 22:12:48 osaka ds: 007b   es: 007b   ss: 0068
Mar 30 22:12:48 osaka Process init (pid: 23117, threadinfo=f2492000 task=d99c4500)
Mar 30 22:12:48 osaka Stack: f14c9b60 d99c4500 0000000b 00000048 f76b8354 00000006 00000000 f7dc06b4
Mar 30 22:12:48 osaka f5c8a0d8 00000000 00000002 c02780a4 00000048 f619d400 f26f7860 f26f7d60
Mar 30 22:12:48 osaka f26f7760 f14c9b60 dadcc420 00000000 f619d400 f26f7860 f26f7760 f14c9b60
Mar 30 22:12:48 osaka Call Trace:
Mar 30 22:12:48 osaka [<c02780a4>] inotify_dentry_parent_queue_event+0x64/0xa0
Mar 30 22:12:48 osaka [<c015cc01>] do_coredump+0x1b1/0x1ce
Mar 30 22:12:48 osaka [<c011ef0f>] free_uid+0x1f/0x80
Mar 30 22:12:48 osaka [<c011f7f5>] __dequeue_signal+0xe5/0x1a0
Mar 30 22:12:48 osaka [<c011f8e5>] dequeue_signal+0x35/0x90
Mar 30 22:12:48 osaka [<c012153a>] get_signal_to_deliver+0x20a/0x2f0
Mar 30 22:12:48 osaka [<c0102524>] do_signal+0x94/0x120
Mar 30 22:12:48 osaka [<c0122578>] sys_rt_sigaction+0xa8/0xc0
Mar 30 22:12:48 osaka [<c010dc70>] do_page_fault+0x0/0x5c7
Mar 30 22:12:48 osaka [<c01025e5>] do_notify_resume+0x35/0x38
Mar 30 22:12:48 osaka [<c0102766>] work_notifysig+0x13/0x15
Mar 30 22:12:48 osaka Code: 8c 68 28 8b 57 24 89 50 2c 8b 57 28 89 50 30 8b 57 2c 89 50 34 8b 57 30 89 50 38 8b 57 34 89 50 3c 8b 57 38 89 50 40 8b 4c 24 48 <
0f> b7 3d 2c e0 ff ff 8b 06 8b 40 70 8b 50 28 89 0c 24 c7 44 24
Mar 30 22:12:55 osaka <4>ReiserFS: warning: is_tree_node: node level 256 does not match to the expected one 2
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-5150: search_by_key: invalid format found in block 12898932. Fsck?
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6 37 0x0 SD]
Mar 30 22:12:55 osaka ReiserFS: warning: is_tree_node: node level 59099 does not match to the expected one 1
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-5150: search_by_key: invalid format found in block 12714251. Fsck?
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6 84 0x0 SD]
Mar 30 22:12:55 osaka ReiserFS: warning: is_tree_node: node level 256 does not match to the expected one 1
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-5150: search_by_key: invalid format found in block 13081635. Fsck?
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6 39 0x0 SD]
Mar 30 22:12:55 osaka ReiserFS: warning: is_tree_node: node level 256 does not match to the expected one 2
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-5150: search_by_key: invalid format found in block 12898932. Fsck?
Mar 30 22:12:55 osaka ReiserFS: hdg1: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6 28 0x0 SD]
ASKER CERTIFIED SOLUTION
Avatar of wesly_chen
wesly_chen
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Chireru
Chireru

ASKER

Thanks for the reply.

I also thought that it was a hardware problem...  I hate hardware problems :(

1. Possible, but a little extreme to start with.

2. Hard disk cable is a possibility, but both harddrives were on seperate cables.. I don't have any spares around right now, so I'll hold off on that one.  I've had bad luck with power supplies before.. even buying very good ones lands me with sneaky problems like this.  I've put the drive in a different machine, and, so far so good (it's about 1/4 way through a copy, with no errors yet).  

3. I am one minor revision out, kernel.org's changelog doesn't indicate any changes that would affect this, so I doubt this is the case.

If the copy is successful, this narrows it down to PSU, bus, controller, bad reiserfs driver/software, or kernel problem.
The PSU is a 420W Enlight, and had grown to be my fileserver (4 HD's, 2 CDroms, video, 2 NICs, floppy).. but now that I step back, that is quite a bit for a 420W PSU.  For now I've unplugged the CDroms, floppy and one NIC, that should ease the load a bit.  We'll see how it acts after my copying is done.
> (4 HD's, 2 CDroms, video, 2 NICs, floppy)....
Wooooo! You probably need 500W PSU.
CPU, Video (newer one with fan on card), hard disk, CD-ROM are electric current consumers.
Avatar of Chireru

ASKER

The copy completed successfully, and fsck shows the drive clean in the other computer, so all is well with the disk physically, and it's filesystem.

That's a bit of a relief.

However, when I plug it back into the offending system, I'm getting these errors (for starters... I havn't tried to do a lot with it yet)
Mar 31 22:42:33 osaka attempt to access beyond end of device
Mar 31 22:42:33 osaka hdh1: rw=0, want=14140282800, limit=78155279
Mar 31 22:42:33 osaka attempt to access beyond end of device
Mar 31 22:42:33 osaka hdh1: rw=0, want=14140282800, limit=78155279
Mar 31 22:42:33 osaka attempt to access beyond end of device
Mar 31 22:42:33 osaka hdh1: rw=0, want=14140282800, limit=78155279

BIOS showed the power levels decent (3.3v @ 3.5, 5v @ 5.05, 12v @ 12.18 and steady).  I'm gonna try some more things (investigate these new errors, pull ide cables from the other box, upgrade the kernel) tomorrow when I get a chance.
> Mar 30 22:12:48 osaka <1>Unable to handle kernel paging request at virtual address ffffe02c
Run the memory diagnostics, too.
http://www.memtest86.com/
Or download the Ultimate Boot CD, which has a lot of diagnostics tools.
Avatar of Chireru

ASKER

I saw that earlier too... I had thought that it was the physical harddrive doing it, so I unmounted swap, and still got errors.  I will try memtest.. i have it on cd around here... somewhere...
Avatar of Chireru

ASKER

After a bit of checking, it looks like my onboard ATA100 controller is the culprit.  Makes sense... both of the problematic drives were on it.

Need to do more testing, but otherwise, I'm gonna have to head out tomorrow and buy myself an IDE controller.
> it looks like my onboard ATA100 controller is the culprit.
Have you upgrade the BIOS of motherboard?
In some cases, it's the motherboard bug, which could be fixed by upgrading the BIOS.
Avatar of Chireru

ASKER

After confirming that the harddrive works fine in another computer, and on another controller, when I put it back on it's original controller with it's original cable, it's no longer giving me any problems.

Any suggestions on how I can test the bus & controller for errors without just moving stuff on & off of my harddrive waiting for it to corrupt the filesystem (and possibly lose more data)?
> I put it back on it's original controller with it's original cable, it's no longer giving me any problems.
Did you plug back all the devices on power supply?

As my experience, BIOS and power supply are most likely the cause.
Avatar of Chireru

ASKER

OK, after doing some testing, it looks like it's the controller.  (attempted to copy stuff off the drive.. fail means there was IO errors)

Test1: Original configuration = failed
Test2: Removed as much power draw as I could = failed
Test3: Moved harddrive to another controller and IDE cable = passed
Test4: Used new IDE cable as in test3 on original controller = failed
Test5: Moved harddrive back to controller on test3, and added as much power drain as I could find, including all harddrives on a single cable = passed

So, at this point it could be the controller, or maybe the driver for the controller... either way, I'm happy to go out and buy a new controller to solve this and not lose any more data.

Thanks for your help Wesly, I'll keep you updated as to whether the new controller fixes it or not.

Good luck and cross my fingers for you.
Avatar of Chireru

ASKER

Looks like that solved it.
Thanks