Solved

Domino 8.5.3 running on Dell servers attached to RDM IBM SAN with disk problems

Posted on 2013-01-10
11
1,179 Views
Last Modified: 2016-11-23
Our problem is that our domino databases keep becoming corrupt.  Not all of them, just a few at a time.  Domino will report that it cannot read the database, and then in the filesystem we can see that the DB is 0 bytes.

This all started about 3 weeks ago, and the first corrupt DB showed up after we upgrade to the latest version of BE from 12.5   We've worked a lot with Symantec and haven't gotten anywhere, we've removed the remote agent from the Domino server and still corruption occurs.

We were running Domino Defrag but we've since disabled that.

We've sent SAN diagnostics to IBM and they didn't say anything was wrong.

The domino server is running on Server 2008 R2 vm with the operating system on a C:, the D: is a Raw Device Mapping to another dedicated SAN array.  

This was working perfectly for 6 months or so in this setup without a hitch...

We are perplexed as to what can cause this, and we aren't getting anywhere with our vendors.

I'm looking for advice on what to investigate and whether anyone else is running Domino in this way successfully.

FYI, last reboot of the virtual domino server triggered Checkdisk and it said it found a lot of empty space that was marked as allocated...I'm guessing those are the DBs dissappearing.
0
Comment
Question by:ITDharam
11 Comments
 
LVL 46

Expert Comment

by:Sjef Bosman
ID: 38763760
What types of databases are corrupted? Is it always the same database? Is it only mail databases, or also application databases?

Could it be that someone used Notes to access the databases directly, i.e. bypassing the server?
0
 
LVL 8

Author Comment

by:ITDharam
ID: 38764255
We haven't been able to find a pattern to the corruption, and I should clarify that it isn't just NSF files, some NTF files, mail.box, and full text indices also become corrupt.

It isn't the same databases, although it has happened that 1 database becomes corrupted again.

For now we've just been deleting the file remnant, and then pulling down a fresh replica from another cluster member and it doesn't appear that we've ever lost anything at this point.

We have never been in the habit of opening files on the server files system through anything other than the Domino administrator, so no, I think it is safe to say that nobody is opening the files directly.

One oddity, and I haven't been able to get a 100% satisfactory answer whether this is a problem or not.  Server1 used to be a physical server with the D: drive consisting of a dedicated RAID 10 array on our IBM MD3400 SAN.  We decided to virtualize the mail server so we setup a new OS install with Domino.  We just repointed the LUN to our VMWare cluster and attached the disk as an RDM to that VM.  It worked fine for several months but then Symantec pointed this out, but didn't not say it is actually a problem, in fact, they said that it is not a problem for Symantec BE.  OK, I'm getting to the point here, the actual thing they pointed out was that the Domino server LUN was/is visible to our Backup Exec server, it shows up in disk management, but isn't initialized and doesn't have a drive letter assigned.  The reason for this is that in a SAN, BE, VMWare environment, the BE server is supposed to have access to the VMWare LUNs for SAN backups.  Because of the grouping limitations, BE also sees the Domino LUN...I'm just throwing this out there because we're pretty much at a loss.

Thanks for the response.
0
 
LVL 46

Expert Comment

by:Sjef Bosman
ID: 38764768
Hope you find a working solution!
0
 
LVL 8

Author Comment

by:ITDharam
ID: 38765273
Ha ha, thanks for the help.  I knew it was a long shot.  I'll keep this open for a bit and see if anyone else responds otherwise you get the points for taking a moment out of your day.
0
 
LVL 46

Expert Comment

by:Sjef Bosman
ID: 38765313
Hehe :-)  Can you move back some steps? They all say that you're not supposed to do what you did, but apparently it worked for you. I can imagine you're reluctant to downgrade from 12.5, but it could be a good test in order to prove it is or it isn't the BE version you now use. A lot of work, you say... Yep, sorry...
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 15

Assisted Solution

by:akhafaf
akhafaf earned 150 total points
ID: 38767161
Hi there  ITDharam,,,

You Mentioned ,,,,
>>>Domino will report that it cannot read the database, and then in the filesystem we can see that the DB is 0 bytes<<<  let me ask the following
- Does this problem take problem on daily bases or just sometime ?? if it does ,, Does it take place on a certain time of the day e.g. at 10:00 AM every day??
- What do you get on the log files of domino when you attempt to access this currupted databases ??
-  Did you run the maintainance commands ( fixup , updall and compact ) on these databases??? Do you have these commans scheduled .. ( On the confiuration tab go to programs and configure them ) then check what happens.

Best Wishes
0
 
LVL 8

Author Comment

by:ITDharam
ID: 38779021
Gentlement, sorry for the delay, I have multiple clients and I don't get to visit this problem daily.

sjef_bosman, I'd appreciate if you could expand on the part where you say "They all say you aren't supposed to do what you did..." I'm sure that applies to a lot of things I've done, can you clarify what you're referring to though?  Also, we removed the BE agent from this particular domino server, we're now backing up from one of our other domino servers, but this problem continues so I don't think it is a Symantec problem...

akhafaf, here are your answers:
The problem does not occur daily, this first started 1 month ago, and since then we'll find DBs unavailable on a somewhat random basis, we may go up to 3 days without any files becoming corrupt, and then we might have 8 in a day.  We've also noticed that the corruption will happen on the same file multiple times and so far hasn't touched many others.

We'll get a series of messages
01/13/2013 03:19:20 PM  Warning: Fixup purged corrupt document UNID (534F80F6:425C50A3:87257A41:007D09EA) from D:\Lotus\Domino\data\mail\jgunn.nsf
01/13/2013 03:19:20 PM  Document NT001B92E6 in database D:\Lotus\Domino\data\mail\jgunn.nsf is damaged: This database cannot be read due to an invalid on disk structure
01/13/2013 03:19:20 PM  Document (UNID OF534F80F6:425C50A3-ON87257A41:007D09EA) in database D:\Lotus\Domino\data\mail\jgunn.nsf has been deleted

And another message that says "cannot allocate space"

At this point the file will show as 0Kb and these messages will repeat if you attempt to open it.

Compact was scheduled daily, and Defrag was set as a scheduled program but was disabled.  The primary domino admin believes Defrag was making the problem worse.

I've been told that in some cases running fixup -f will fix access but only at a certain stage and I don't have details on that, I'm told this tends to be a temporary fix.

What we end up having to do is dbcache flush, delete the file from domino administrator, and then we create an accelerated replica and we're back up and running.


Here is something else interesting, we're getting found.000, found.001 folders in the root of the data drive.  I always thought this was associate with disk corruption so we contacted IBM and we sent them the logs of our SAN and they said that while they found one disk that was reporting 'predictive failure warnings', they didn't see anything that could case problems.  We replaced the drive anyways and still encountered problems.  We sent the logs again and this time they say one of the HBAs in the server isn't connected, we determined that the HBA was bad and that it was only connected via one SAN path, which happens to be its default path so I'm not sure that the HBA was failing on an active connection, if that happened I'd imagine a few corrupt DBs would be a blessing.

Thanks for the response, I hope to hear back on some brilliant ideas!
0
 
LVL 46

Assisted Solution

by:Sjef Bosman
Sjef Bosman earned 175 total points
ID: 38779421
About the non-standard thing: that's what I read in your last paragraph here, the way you virtualised Domino. Hence my remark: if it worked, and you someone advised you to modify your configuration, you could may revert to an earlier configuration that worked.

Are you backing up from Domino server B, and the file corruption occurs on A? Did you remove Symantec from server A, including the Extension Manager inserts in notes.ini ?
0
 
LVL 8

Author Comment

by:ITDharam
ID: 38779480
Well, I wasn't specifically told it was a problem.  I asked Symantec and they said that it is specifically NOT a problem for BE.  I haven't found anything worthwhile, besides my intuition, that says this is a problem.

The setup was working for approx 6 months with no errors in this configuration, and then we updated from 12 to 12.5 and 2 days later this corruption started.  We've removed BE Remote agent, and I've just confirmed there is nothing in the notes.ini referring to Symantec or Extension Manager so in effect we have reverted to a prior configuration that was known to be working.

I spoke with the Domino admin and she is saying that as this problem continues, it is becoming apparent that the corruption does occur on the same DBs, however, there is not apparent reason or connection between those DBs.  And it is still the case that one DB will corrupt, and then 5 more will corrupt the next day type of thing...

Thanks,
0
 

Accepted Solution

by:
Andrew_Luder earned 175 total points
ID: 38797501
This may help. I found the Windows 2008 Microsoft hotfix discussed below fixed my Domino 8.5.3 "cannot allocate space" and "insufficient memory" issues  with large and/or heavily fragmented databases.


DominoDefrag - News: How fragmentation on incorrectly formatted NTFS volumes affects Domino

http://www.openntf.org/internal/home.nsf/news.xsp?databaseName=CN=NotesOSS2/O=NotesOSS!!Projects%5Cpmt.nsf&documentId=027517F9D756864D86257A670069EC1E&action=openDocument


Also make sure Domino data and temp directories are excluded from any Windows file level anti-virus scanning (which you've covered I think)
0
 
LVL 8

Author Closing Comment

by:ITDharam
ID: 38803582
Andrew_Luder, thanks for that, this may be the most relevant direction to take, but we'll never know.

After nearly a month of fairly regular corruptions I did go ahead and rebuild the server from scratch and the corruption hasn't happened since (about 4 days now but we're pretty hopeful)

I took the RDM volume from the original VM, and just wiped it and formatted it with VMFS and built the new VM on that same storage (10 disks in RAID 10 array), so the other suggestion that allocating an RDM to a virtual machine while the BE server can still see that LUN is the cause of this problem is high on my list as well.

Thanks for the help gentlemen, until next time...
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Possible fixes for Windows 7 and Windows Server 2008 updating problem. Solutions mentioned are from Microsoft themselves. I started a case with them from our Microsoft Silver Partner option to open a case and get direct support from Microsoft. If s…
Restoring deleted objects in Active Directory has been a standard feature in Active Directory for many years, yet some admins may not know what is available.
This tutorial will give a an overview on how to deploy remote agents in Backup Exec 2012 to new servers. Click on the Backup Exec button in the upper left corner. From here, are global settings for the application such as connecting to a remote Back…
This tutorial will walk an individual through the steps necessary to configure their installation of BackupExec 2012 to use network shared disk space. Verify that the path to the shared storage is valid and that data can be written to that location:…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now