asked on

Domino 8.5.3 running on Dell servers attached to RDM IBM SAN with disk problems

Our problem is that our domino databases keep becoming corrupt. Not all of them, just a few at a time. Domino will report that it cannot read the database, and then in the filesystem we can see that the DB is 0 bytes.

This all started about 3 weeks ago, and the first corrupt DB showed up after we upgrade to the latest version of BE from 12.5 We've worked a lot with Symantec and haven't gotten anywhere, we've removed the remote agent from the Domino server and still corruption occurs.

We were running Domino Defrag but we've since disabled that.

We've sent SAN diagnostics to IBM and they didn't say anything was wrong.

The domino server is running on Server 2008 R2 vm with the operating system on a C:, the D: is a Raw Device Mapping to another dedicated SAN array.

This was working perfectly for 6 months or so in this setup without a hitch...

We are perplexed as to what can cause this, and we aren't getting anywhere with our vendors.

I'm looking for advice on what to investigate and whether anyone else is running Domino in this way successfully.

FYI, last reboot of the virtual domino server triggered Checkdisk and it said it found a lot of empty space that was marked as allocated...I'm guessing those are the DBs dissappearing.

Sjef Bosman

What types of databases are corrupted? Is it always the same database? Is it only mail databases, or also application databases?

Could it be that someone used Notes to access the databases directly, i.e. bypassing the server?

ITDharam

ASKER

We haven't been able to find a pattern to the corruption, and I should clarify that it isn't just NSF files, some NTF files, mail.box, and full text indices also become corrupt.

It isn't the same databases, although it has happened that 1 database becomes corrupted again.

For now we've just been deleting the file remnant, and then pulling down a fresh replica from another cluster member and it doesn't appear that we've ever lost anything at this point.

We have never been in the habit of opening files on the server files system through anything other than the Domino administrator, so no, I think it is safe to say that nobody is opening the files directly.

One oddity, and I haven't been able to get a 100% satisfactory answer whether this is a problem or not. Server1 used to be a physical server with the D: drive consisting of a dedicated RAID 10 array on our IBM MD3400 SAN. We decided to virtualize the mail server so we setup a new OS install with Domino. We just repointed the LUN to our VMWare cluster and attached the disk as an RDM to that VM. It worked fine for several months but then Symantec pointed this out, but didn't not say it is actually a problem, in fact, they said that it is not a problem for Symantec BE. OK, I'm getting to the point here, the actual thing they pointed out was that the Domino server LUN was/is visible to our Backup Exec server, it shows up in disk management, but isn't initialized and doesn't have a drive letter assigned. The reason for this is that in a SAN, BE, VMWare environment, the BE server is supposed to have access to the VMWare LUNs for SAN backups. Because of the grouping limitations, BE also sees the Domino LUN...I'm just throwing this out there because we're pretty much at a loss.

Thanks for the response.

Sjef Bosman

Hope you find a working solution!

ITDharam

ASKER

Ha ha, thanks for the help. I knew it was a long shot. I'll keep this open for a bit and see if anyone else responds otherwise you get the points for taking a moment out of your day.

Sjef Bosman

Hehe :-) Can you move back some steps? They all say that you're not supposed to do what you did, but apparently it worked for you. I can imagine you're reluctant to downgrade from 12.5, but it could be a good test in order to prove it is or it isn't the BE version you now use. A lot of work, you say... Yep, sorry...

SOLUTION

akhafaf

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ITDharam

ASKER

Gentlement, sorry for the delay, I have multiple clients and I don't get to visit this problem daily.

sjef_bosman, I'd appreciate if you could expand on the part where you say "They all say you aren't supposed to do what you did..." I'm sure that applies to a lot of things I've done, can you clarify what you're referring to though? Also, we removed the BE agent from this particular domino server, we're now backing up from one of our other domino servers, but this problem continues so I don't think it is a Symantec problem...

akhafaf, here are your answers:
The problem does not occur daily, this first started 1 month ago, and since then we'll find DBs unavailable on a somewhat random basis, we may go up to 3 days without any files becoming corrupt, and then we might have 8 in a day. We've also noticed that the corruption will happen on the same file multiple times and so far hasn't touched many others.

We'll get a series of messages
01/13/2013 03:19:20 PM Warning: Fixup purged corrupt document UNID (534F80F6:425C50A3:87257A41:007D09EA) from D:\Lotus\Domino\data\mail\jgunn.nsf
01/13/2013 03:19:20 PM Document NT001B92E6 in database D:\Lotus\Domino\data\mail\jgunn.nsf is damaged: This database cannot be read due to an invalid on disk structure
01/13/2013 03:19:20 PM Document (UNID OF534F80F6:425C50A3-ON87257A41:007D09EA) in database D:\Lotus\Domino\data\mail\jgunn.nsf has been deleted

And another message that says "cannot allocate space"

At this point the file will show as 0Kb and these messages will repeat if you attempt to open it.

Compact was scheduled daily, and Defrag was set as a scheduled program but was disabled. The primary domino admin believes Defrag was making the problem worse.

I've been told that in some cases running fixup -f will fix access but only at a certain stage and I don't have details on that, I'm told this tends to be a temporary fix.

What we end up having to do is dbcache flush, delete the file from domino administrator, and then we create an accelerated replica and we're back up and running.

Here is something else interesting, we're getting found.000, found.001 folders in the root of the data drive. I always thought this was associate with disk corruption so we contacted IBM and we sent them the logs of our SAN and they said that while they found one disk that was reporting 'predictive failure warnings', they didn't see anything that could case problems. We replaced the drive anyways and still encountered problems. We sent the logs again and this time they say one of the HBAs in the server isn't connected, we determined that the HBA was bad and that it was only connected via one SAN path, which happens to be its default path so I'm not sure that the HBA was failing on an active connection, if that happened I'd imagine a few corrupt DBs would be a blessing.

Thanks for the response, I hope to hear back on some brilliant ideas!

SOLUTION

Sjef Bosman

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ITDharam

ASKER

Well, I wasn't specifically told it was a problem. I asked Symantec and they said that it is specifically NOT a problem for BE. I haven't found anything worthwhile, besides my intuition, that says this is a problem.

The setup was working for approx 6 months with no errors in this configuration, and then we updated from 12 to 12.5 and 2 days later this corruption started. We've removed BE Remote agent, and I've just confirmed there is nothing in the notes.ini referring to Symantec or Extension Manager so in effect we have reverted to a prior configuration that was known to be working.

I spoke with the Domino admin and she is saying that as this problem continues, it is becoming apparent that the corruption does occur on the same DBs, however, there is not apparent reason or connection between those DBs. And it is still the case that one DB will corrupt, and then 5 more will corrupt the next day type of thing...

Thanks,

ASKER CERTIFIED SOLUTION

Andrew_Luder

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ITDharam

ASKER

Andrew_Luder, thanks for that, this may be the most relevant direction to take, but we'll never know.

After nearly a month of fairly regular corruptions I did go ahead and rebuild the server from scratch and the corruption hasn't happened since (about 4 days now but we're pretty hopeful)

I took the RDM volume from the original VM, and just wiped it and formatted it with VMFS and built the new VM on that same storage (10 disks in RAID 10 array), so the other suggestion that allocating an RDM to a virtual machine while the BE server can still see that LUN is the cause of this problem is high on my list as well.

Thanks for the help gentlemen, until next time...