Solved

Memory Fault(coredump) - Bad memory?

Posted on 2009-04-07
11
4,985 Views
Last Modified: 2013-12-06
I am doing a "vgscan -v" to try fixing some volume groups that won't activate after a hard shutdown. It runs for a few seconds then gives "Memory Fault(coredump)".

This might be related..... Previously I was getting this error, but fixed this by increasing the kernel parameter maxssiz from 8mb to 16mb:
received a SIGSEGV for stack growth failure.
Possible causes: insufficient memory or swap space,
or stack size exceeded maxssiz.

This may need another topic but what I need to do is activate 5 volume groups that are each giving these errors (disk resides on EMC Clarion):
"vgchange -a y /dev/vge28"
vgchange: Warning: Couldn't attach to the volume group physical volume "/dev/dsk/c10t3d4":
Cross-device link
vgchange: Warning: couldn't query physical volume "/dev/dsk/c10t3d4":
The specified path does not correspond to physical volume attached to
this volume group
vgchange: Warning: couldn't query all of the physical volumes.
vgchange: Couldn't activate volume group "/dev/vge28":
Quorum not present, or some physical volume(s) are missing.

"ioscan -fn" confirms that the disk is there:
disk       71  8/12/1/0.98.4.19.0.3.4   sdisk       CLAIMED     DEVICE       DGC     CX700WDR5

                                       /dev/dsk/c10t3d4   /dev/rdsk/c10t3d4

"strings /etc/lvmtab" looks correct:
/dev/vge28
/dev/dsk/c10t3d4

0
Comment
Question by:piggly
  • 5
  • 4
  • 2
11 Comments
 
LVL 1

Expert Comment

by:Lensters
ID: 24094698
Check your /var/adm/syslog/syslog.log for I/O errors.  I'm not familiar with the EMC Clarions but you should also run their diagnostic self tests.
0
 
LVL 20

Assisted Solution

by:tfewster
tfewster earned 500 total points
ID: 24095186
The "Memory Fault" is vgscan failing rather than a real memory issue. This might be due to a corrupt LVM header on a disk - which could be fixed with `vgcfgrestore` _IF_ you are sure that this disk belongs to this server! I've seen many cases where a LUN was visible to more than one server and had been reused without cleaning up the original "owner" system.

My general diagnostic approach would be:

As Lensters says, check syslog.log for hardware errors

Check the disk is readable with `dd if=/dev/dsk/c10t3d4  of=/dev/null`; Let it run for a couple of minutes before interrupting it with ^C

Check if it has a readable LVM header, e.g. with `xd -j8200 -N16 -tu /dev/rdsk/c10t3d4`; For help in interpreting the output, please see http://forums13.itrc.hp.com/service/forums/questionanswer.do?threadId=1224440 or http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1250141  Please also note the warnings there.


Double-check what the configuration _should_ be before making any repairs or changes - If you have regular system config captures or take Ignite images, you'll have this information somewhere. Otherwise you'll have to map out your systems architecture and storage.
0
 

Author Comment

by:piggly
ID: 24098243
Thanks for the responses.

I did see a few errors in the /var/adm/syslog/syslog.log:

Apr  6 14:24:17 mars vmunix: DIAGNOSTIC SYSTEM WARNING:
Apr  6 14:24:17 mars vmunix:    The diagnostic logging facility has started receiving excessive
Apr  6 14:24:17 mars vmunix:    errors from the I/O subsystem.  I/O error entries will be lost
Apr  6 14:24:17 mars vmunix:    until the cause of the excessive I/O logging is corrected.
Apr  6 14:24:17 mars vmunix:    If the DEMLOG daemon is not active, use the DIAGSYSTEM command
Apr  6 14:24:17 mars vmunix:    in SYSDIAG to start it.
Apr  6 14:24:17 mars vmunix:    If the DEMLOG daemon is active, use the LOGTOOL utility in SYSDIAG
Apr  6 14:24:17 mars vmunix:    to determine which I/O subsystem is logging excessive errors.
Apr  6 14:24:17 mars LVM[2507]: lvlnboot -v

I ran 'dd if=/dev/dsk/c10t3d4 of=/dev/null' and also 'dd if=/dev/rdsk/c10t3d4 of=/dev/null' the disk was showing activity on the SAN. No problems here.

I then did a 'vgcfgrestore -n /dev/vge28 /dev/rdsk/c10t3d4'

Next, I tried activating the volume group again by 'vgchange -a y /dev/vge28'
The a few of the errors regarding the physical disk are gone but I'm still receiving this:

vgchange: Couldn't activate volume group "/dev/vge28":
Quorum not present, or some physical volume(s) are missing.


I would also like to mention that these disks are just EMC clones of our production server. These specific LUNS are used for the informix database. There are a total of 33 volume groups like these and i'm only having the issue on the latest 5 that were allocated. An option would be to blow away the LUNS/disks and all the configurations. Removing the LUN/disk is easy to do but I would be stuck on wiping hpux clean of any configuration as I haven't had to do this before. I follow a specific set of instructions to make these volume groups/logical volumes available to informix.
0
 
LVL 1

Expert Comment

by:Lensters
ID: 24100477
It is definitely a hardware problem.  Running dd probably confirmed that the controller is ok but LVM also checks data integrity, so the problem is most likely media.  I think that the LVM messages are confusing,  Remember there is an "abstract level of hardware" that it is looking at.
0
 
LVL 20

Accepted Solution

by:
tfewster earned 500 total points
ID: 24101461
I agree that there's a hardware issue, but it might not be directly related to the issue with this volume group. It may be that LUNs have been deallocated from this system while active - Use Support Tools Manager to locate the issue: http://docs.hp.com/en/diag/logtool/lgt_startm.htm
(Or for a quick pointer, `ioscan -fnC disk` and look to see if any devices are showing a status of NO_HW). A reboot may be needed.

What version of HP-UX are you using? It may make a difference to the LVM and diagnostic tools you have.

Back to the vg issue - I suspect that the Production system has more than one disk in the volume group that you're trying to clone & import, which doesn't match the configuration in lvmtab on "mars".  So let's start afresh, as if this volume group had never been used on "mars" before.

On the production system, create a mapfile of the volume group & disks it contains with`vgexport -p -v -m vge28_mapfile vge28`
I'm assuming the volume group name will be the same on "mars" as it is on Production.

Re-clone the EMC LUNS, as we've modified one of them with the `vgcfgrestore`

Copy the mapfile to "mars"

Clean up mars: Copy the vg "group" file, /dev/vge28/group; Blow away the existing vge28 configuration with `vgexport vge28`; Recreate  /dev/vge28 and copy the "group" file back in place.
(if this was truly a new vg, the "group" file would have to be created with `mknod`)

Import the volume group using the mapfile:
/usr/sbin/vgimport -m vge28_mapfile  -v  -s vge28
The "-s" searches attached disks for a Volume Group ID that matches the one in the mapfile; The (difficult) alternative is to specify the disk paths to be imported, but they may be different between Prod and "mars"

This is probably the same as your existing procedure, apart from the cleanup on mars ? If you want to attach a copy of your instructions, I'll double-check them to make sure I haven't made any unwarranted assumptions about your setup...
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:piggly
ID: 24107036
Unfortunately I wasn't able to bring up xstm, cstm, or mstm. We are using a very old release of hp-ux, 10.20, and because of the applications they run, we aren't able to upgrade. It's quite unfortunate I know as this OS level is close to 15 years old.

The volume groups are indeed the same between both servers. There is one disk (lun) per volume group. The volume group is then chunked up into five logical volumes. 4 logical volumes 2 gb each and 1 logical volume 520 mb. The reason for this is because of our older release of informix.

I did attach the instructions/notes I've always followed. They are from the previous administrator. I also attached a 'vgdisplay -v' from the test (mars) server as well as from production. This 'vgdisplay -v' was done before any problems occurred. On prod, vge30, 31, and 32 have the 5th logical volume of 520mb named incorrectly, but the size is still okay.

The question I have with saving the mapfile from production is the physical disk names. For example on vge28 that we've been dealing with, the disk is c12t3d4 on production and the cloned lun is c10t3d4 on test (mars). If I save the mapfile, will it try using the same disk (c12t3d4)?

Thanks for assisting with this so far. I really appreciate it. It has been a great learning experience so far!
notes-hpux.txt
mars-vgdisplay.txt
prod-vgdisplay.txt
0
 
LVL 20

Assisted Solution

by:tfewster
tfewster earned 500 total points
ID: 24136225
I've not forgotten this problem, but I'm off work this week and don't have access to an HP-UX system. In any case, I don't have any HP-UX 10.20 systems to test on - Though I know a company that still runs a 10.01 system ;-)  Anyway, LVM hasn't changed very much between 10.20 and 11i

I don't think I'm seeing the full picture yet; Although the old `vgdisplay` listing shows "mars" could see all these VGs at one time, you mentioned that "i'm only having the issue on the latest 5 that were allocated", so it sounds like something has changed. You also mentioned a "hard shutdown" - what was the problem there and what did you have to do to resolve it?

A couple of additional checks to do:
Please check the kernel parameter MAXVGS is set to at least 64 (using SAM -> kernel configuration)

Please check you don't have any duplicate minor numbers for the volume groups:
`ls -l /dev/vg*/group`


The instructions you attached look fine for setting up new VGs on the Production system, but they're not very good for dealing with EMC "clone" disks; You wouldn't want to recreate the VG info on the cloned LUN attached to "mars" (pvcreate/vgcreate/lvcreate), as the cloned LUN already contains valid LVM header info (and every time it was re-cloned, the production LUN header info would overwrite that on the Copy LUN). The re-usable way to do it would be to vgimport the LVM header info onto "mars".

My suggestion to do a `vgimport -s` with the mapfile was to get around the issue of the device paths changing, but as your EMC configuration tool reports back the actual device address that the HP-UX server allocates, that means you can do the vgimport by specifying the device path.

Whichever way you want to try, (Your instructions or `vgimport`), running `vgexport vge28` on "mars" should remove the vge28 VG info on that system so you can retry from a clean state.

By the way, how are you doing your cloning? Using the EMC to make a Business Copy "snapshot" of the Production LUN and then detaching the Copy and allocating/activating it on "mars"?
0
 

Author Comment

by:piggly
ID: 24142504
Tfewster,

Thank you so much for your help on this issue. Here's the exact steps (in order) I took to get everything working. I did this on volume groups vge28, 29, 30, 31, and 32:

On test:
vgexport vge28
mkdir /dev/vge28
ls -l /dev/vg*/group
mknod /dev/vge28/group c 64 0x380000

On prod:
vgexport -s -p -v -m vge28_mapfile vge28
ftp the vge28_mapfile file over to test

On clariion:
synchronize snapview clones
fracture once sync'd
update host info to view new device names

On test:
vgimport -m vge28_mapfile -v -s vge28
vgchange -a y /dev/vge28



To answer your questions,

"I'm only having the issue on the latest 5 that were allocated", so it sounds like something has changed. You also mentioned a "hard shutdown" - what was the problem there and what did you have to do to resolve it?"

The last 5 volume groups added to each of these servers was done by me using the directions given to me. These were the only volume groups I was having issues with.

The hard shutdown occurred when we had to reboot the storage processors on our clariion for maintenance. When bringing the system back up after the maintenance was completed, we waited 45 minutes for it to boot before deciding to "hit the button". It booted back up normally afterwards except for the problem with these 5 volume groups.


"Please check the kernel parameter MAXVGS is set to at least 64 (using SAM -> kernel configuration)"

I checked this and verified that its setting is at 80.


"By the way, how are you doing your cloning? Using the EMC to make a Business Copy "snapshot" of the Production LUN and then detaching the Copy and allocating/activating it on "mars"?"

Our cloning is done right on the EMC clariion using the snapview clones. We have for example a 10 GB lun assigned to production and we create an additional 10GB lun as a clone and assign it to test. To refresh the data we set the luns to synchronize and once sync'd up, we fracture them. The cloned lun is always active on mars.
0
 
LVL 20

Expert Comment

by:tfewster
ID: 24143488
Hi piggly - I'm glad to hear you've fixed the problem even if we didn't pin down exactly what the issue was! It's time to expand those instructions in case you decide to move on...

I think the vgchange error messages you were seeing actually meant "I can see the LUN (/dev/dsk/c10t3d4) that should belong to this VG, but the LVM header info on the disk (e.g. the VGID) doesn't match with the copy in my lvmtab".  

The vgimport method means that problem won't re-occur, as "mars" uses the info that's already on the LUN to create its internal copy.  

You probably only saw this problem when the server was rebooted, as it hadn't detected that the disks LVM headers had been changed by the resynching during normal running. The long boot time while it tries to activate all the VGs is understandable, especially if activation runs into problems.

In operation, you should halt apps and `vgchange -a n` the volume groups that are being re-synched, to avoid "mars" getting confused by the disk content changing. In fact, you should really do the same on the Prod system to ensure a consistent copy - Though that would mean arranging for application service downtime :-(    Anyway, I expect you are already aware of the potential problems on mars if the data isn't consistent and have a suitable process for doing this.

I would keep an eye on syslog for the next few days, and the next time you synch the LUNS, to make sure those errors don't reoccur.  It's possible that the Diagnostic system warnings occurred because "mars" was trying to access its LUNs while they were being re-synched - The EMC Clariion would probably have automatically suspended (write) access to the snapview clones during the synch.

I'm happy to carry on this discussion if there's anything you're still not happy about, but I think we've found likely explanations for all the weird behaviours.

Regards,
tfewster
0
 

Author Comment

by:piggly
ID: 24245740
Thanks tfewster for all your help. We were able to get a good clone of the production system by halting informix and doing a 'vgchange -a n' on all the informix disks. The test (mars) system was down the entire duration of the sync. I made a few changes to the logical volumes on prod before the sync and duplicated the exact setup using the export/import method once again. Everything went very smooth.

Syslog does still put out the I/O error while shutting down and booting:

Apr 22 18:32:10 mars vmunix: DIAGNOSTIC SYSTEM WARNING:
Apr 22 18:32:10 mars vmunix:    The diagnostic logging facility has started receiving excessive
Apr 22 18:32:10 mars vmunix:    errors from the I/O subsystem.  I/O error entries will be lost
Apr 22 18:32:10 mars vmunix:    until the cause of the excessive I/O logging is corrected.
Apr 22 18:32:10 mars vmunix:    If the DEMLOG daemon is not active, use the DIAGSYSTEM command
Apr 22 18:32:10 mars vmunix:    in SYSDIAG to start it.
Apr 22 18:32:10 mars vmunix:    If the DEMLOG daemon is active, use the LOGTOOL utility in SYSDIAG
Apr 22 18:32:10 mars vmunix:    to determine which I/O subsystem is logging excessive errors.

0
 
LVL 20

Expert Comment

by:tfewster
ID: 24246169
Thanks for the feedback, it's always nice to hear good news!

I'm surprised that the test system should give those errors on shutdown & startup if there was nothing happening on the Clariion to cause a "disk denial"; A quick Google shows that the earlier HP-UX 10.20 diagnostics were indeed called with `sysdiag` before the `*stm` package came out, and the interface looks similar to STM, so it would be worth trying to use `sysdiag` to try to pin down the error messages.

The output of `dmesg` may show the "raw" errors even if the diagnostic logging tool is losing some of the messages.

It's conceivable that a non-HP-UX add-on driver or software layer (e.g. EMC Dynamic Multipathing) is installed, but without a deep examination of the system and startup process (e.g. what loads after the Diagnostics?), it would be hard to find.

However, if the diagnostics and "DEMLOG" daemon are running (as described in the syslog messages) and sysdiag/syslog/dmesg don't show any issues during normal running, I wouldn't worry too much about it.

Regards,
tfewster
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

In tuning file systems on the Solaris Operating System, changing some parameters of a file system usually destroys the data on it. For instance, changing the cache segment block size in the volume of a T3 requires that you delete the existing volu…
Java performance on Solaris - Managing CPUs There are various resource controls in operating system which directly/indirectly influence the performance of application. one of the most important resource controls is "CPU".   In a multithreaded…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now