Domino 8.5.3 running on Dell servers attached to RDM IBM SAN with disk problems

Our problem is that our domino databases keep becoming corrupt.  Not all of them, just a few at a time.  Domino will report that it cannot read the database, and then in the filesystem we can see that the DB is 0 bytes.

This all started about 3 weeks ago, and the first corrupt DB showed up after we upgrade to the latest version of BE from 12.5   We've worked a lot with Symantec and haven't gotten anywhere, we've removed the remote agent from the Domino server and still corruption occurs.

We were running Domino Defrag but we've since disabled that.

We've sent SAN diagnostics to IBM and they didn't say anything was wrong.

The domino server is running on Server 2008 R2 vm with the operating system on a C:, the D: is a Raw Device Mapping to another dedicated SAN array.  

This was working perfectly for 6 months or so in this setup without a hitch...

We are perplexed as to what can cause this, and we aren't getting anywhere with our vendors.

I'm looking for advice on what to investigate and whether anyone else is running Domino in this way successfully.

FYI, last reboot of the virtual domino server triggered Checkdisk and it said it found a lot of empty space that was marked as allocated...I'm guessing those are the DBs dissappearing.
Who is Participating?
Andrew_LuderConnect With a Mentor Commented:
This may help. I found the Windows 2008 Microsoft hotfix discussed below fixed my Domino 8.5.3 "cannot allocate space" and "insufficient memory" issues  with large and/or heavily fragmented databases.

DominoDefrag - News: How fragmentation on incorrectly formatted NTFS volumes affects Domino!!Projects%5Cpmt.nsf&documentId=027517F9D756864D86257A670069EC1E&action=openDocument

Also make sure Domino data and temp directories are excluded from any Windows file level anti-virus scanning (which you've covered I think)
Sjef BosmanGroupware ConsultantCommented:
What types of databases are corrupted? Is it always the same database? Is it only mail databases, or also application databases?

Could it be that someone used Notes to access the databases directly, i.e. bypassing the server?
ITDharamAuthor Commented:
We haven't been able to find a pattern to the corruption, and I should clarify that it isn't just NSF files, some NTF files,, and full text indices also become corrupt.

It isn't the same databases, although it has happened that 1 database becomes corrupted again.

For now we've just been deleting the file remnant, and then pulling down a fresh replica from another cluster member and it doesn't appear that we've ever lost anything at this point.

We have never been in the habit of opening files on the server files system through anything other than the Domino administrator, so no, I think it is safe to say that nobody is opening the files directly.

One oddity, and I haven't been able to get a 100% satisfactory answer whether this is a problem or not.  Server1 used to be a physical server with the D: drive consisting of a dedicated RAID 10 array on our IBM MD3400 SAN.  We decided to virtualize the mail server so we setup a new OS install with Domino.  We just repointed the LUN to our VMWare cluster and attached the disk as an RDM to that VM.  It worked fine for several months but then Symantec pointed this out, but didn't not say it is actually a problem, in fact, they said that it is not a problem for Symantec BE.  OK, I'm getting to the point here, the actual thing they pointed out was that the Domino server LUN was/is visible to our Backup Exec server, it shows up in disk management, but isn't initialized and doesn't have a drive letter assigned.  The reason for this is that in a SAN, BE, VMWare environment, the BE server is supposed to have access to the VMWare LUNs for SAN backups.  Because of the grouping limitations, BE also sees the Domino LUN...I'm just throwing this out there because we're pretty much at a loss.

Thanks for the response.
Making Bulk Changes to Active Directory

Watch this video to see how easy it is to make mass changes to Active Directory from an external text file without using complicated scripts.

Sjef BosmanGroupware ConsultantCommented:
Hope you find a working solution!
ITDharamAuthor Commented:
Ha ha, thanks for the help.  I knew it was a long shot.  I'll keep this open for a bit and see if anyone else responds otherwise you get the points for taking a moment out of your day.
Sjef BosmanGroupware ConsultantCommented:
Hehe :-)  Can you move back some steps? They all say that you're not supposed to do what you did, but apparently it worked for you. I can imagine you're reluctant to downgrade from 12.5, but it could be a good test in order to prove it is or it isn't the BE version you now use. A lot of work, you say... Yep, sorry...
akhafafConnect With a Mentor Commented:
Hi there  ITDharam,,,

You Mentioned ,,,,
>>>Domino will report that it cannot read the database, and then in the filesystem we can see that the DB is 0 bytes<<<  let me ask the following
- Does this problem take problem on daily bases or just sometime ?? if it does ,, Does it take place on a certain time of the day e.g. at 10:00 AM every day??
- What do you get on the log files of domino when you attempt to access this currupted databases ??
-  Did you run the maintainance commands ( fixup , updall and compact ) on these databases??? Do you have these commans scheduled .. ( On the confiuration tab go to programs and configure them ) then check what happens.

Best Wishes
ITDharamAuthor Commented:
Gentlement, sorry for the delay, I have multiple clients and I don't get to visit this problem daily.

sjef_bosman, I'd appreciate if you could expand on the part where you say "They all say you aren't supposed to do what you did..." I'm sure that applies to a lot of things I've done, can you clarify what you're referring to though?  Also, we removed the BE agent from this particular domino server, we're now backing up from one of our other domino servers, but this problem continues so I don't think it is a Symantec problem...

akhafaf, here are your answers:
The problem does not occur daily, this first started 1 month ago, and since then we'll find DBs unavailable on a somewhat random basis, we may go up to 3 days without any files becoming corrupt, and then we might have 8 in a day.  We've also noticed that the corruption will happen on the same file multiple times and so far hasn't touched many others.

We'll get a series of messages
01/13/2013 03:19:20 PM  Warning: Fixup purged corrupt document UNID (534F80F6:425C50A3:87257A41:007D09EA) from D:\Lotus\Domino\data\mail\jgunn.nsf
01/13/2013 03:19:20 PM  Document NT001B92E6 in database D:\Lotus\Domino\data\mail\jgunn.nsf is damaged: This database cannot be read due to an invalid on disk structure
01/13/2013 03:19:20 PM  Document (UNID OF534F80F6:425C50A3-ON87257A41:007D09EA) in database D:\Lotus\Domino\data\mail\jgunn.nsf has been deleted

And another message that says "cannot allocate space"

At this point the file will show as 0Kb and these messages will repeat if you attempt to open it.

Compact was scheduled daily, and Defrag was set as a scheduled program but was disabled.  The primary domino admin believes Defrag was making the problem worse.

I've been told that in some cases running fixup -f will fix access but only at a certain stage and I don't have details on that, I'm told this tends to be a temporary fix.

What we end up having to do is dbcache flush, delete the file from domino administrator, and then we create an accelerated replica and we're back up and running.

Here is something else interesting, we're getting found.000, found.001 folders in the root of the data drive.  I always thought this was associate with disk corruption so we contacted IBM and we sent them the logs of our SAN and they said that while they found one disk that was reporting 'predictive failure warnings', they didn't see anything that could case problems.  We replaced the drive anyways and still encountered problems.  We sent the logs again and this time they say one of the HBAs in the server isn't connected, we determined that the HBA was bad and that it was only connected via one SAN path, which happens to be its default path so I'm not sure that the HBA was failing on an active connection, if that happened I'd imagine a few corrupt DBs would be a blessing.

Thanks for the response, I hope to hear back on some brilliant ideas!
Sjef BosmanConnect With a Mentor Groupware ConsultantCommented:
About the non-standard thing: that's what I read in your last paragraph here, the way you virtualised Domino. Hence my remark: if it worked, and you someone advised you to modify your configuration, you could may revert to an earlier configuration that worked.

Are you backing up from Domino server B, and the file corruption occurs on A? Did you remove Symantec from server A, including the Extension Manager inserts in notes.ini ?
ITDharamAuthor Commented:
Well, I wasn't specifically told it was a problem.  I asked Symantec and they said that it is specifically NOT a problem for BE.  I haven't found anything worthwhile, besides my intuition, that says this is a problem.

The setup was working for approx 6 months with no errors in this configuration, and then we updated from 12 to 12.5 and 2 days later this corruption started.  We've removed BE Remote agent, and I've just confirmed there is nothing in the notes.ini referring to Symantec or Extension Manager so in effect we have reverted to a prior configuration that was known to be working.

I spoke with the Domino admin and she is saying that as this problem continues, it is becoming apparent that the corruption does occur on the same DBs, however, there is not apparent reason or connection between those DBs.  And it is still the case that one DB will corrupt, and then 5 more will corrupt the next day type of thing...

ITDharamAuthor Commented:
Andrew_Luder, thanks for that, this may be the most relevant direction to take, but we'll never know.

After nearly a month of fairly regular corruptions I did go ahead and rebuild the server from scratch and the corruption hasn't happened since (about 4 days now but we're pretty hopeful)

I took the RDM volume from the original VM, and just wiped it and formatted it with VMFS and built the new VM on that same storage (10 disks in RAID 10 array), so the other suggestion that allocating an RDM to a virtual machine while the BE server can still see that LUN is the cause of this problem is high on my list as well.

Thanks for the help gentlemen, until next time...
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.