Link to home
Start Free TrialLog in
Avatar of sync957p
sync957p

asked on

Mail db's are getting corrupt each day

Hi there.

I have this work scenario :

- 2 servers on a cluster - identical machines (Compaq Proliant)
- About 350 users
- Data directory with only 2,5 Gb free on both servers (mail files use about 70Gb)
- No errors on event viewer except : "The driver detected that the device \Device\Harddisk0\DR0 has its write cache enabled. Data corruption may occur. " once in a while (our hardware xpert told us it was a normal event)
- Transaction Logging was enabled ... i disabled it
- Scheduled programs on both servers : Compact -B and fixup (no switches) everyday (offhours)
- Some pretty large mail files (about 1 Gb) wich I'm hoping to reduce in a shortwhile
- Servers R5.0.11 (r5 portuguese clients)
- Machine with problems also does Active Directory, DNS and DHCP
- Backup Exec 8.1 with Lotus Notes agent
- No Antivirus for Domino enabled (planning for Groupshield)
- Antivirus enabled for file system : NAI Netshield

Our $Inbox'es in mail files get corrupt randomly for some mail files. This is happenning in a daily basis now.
Sometimes the error is replicated across the cluster, but it almost only happens in one of the servers.

Error goes "Invalid attachment ID in database.... document will be deleted"

3 days ago a user lost his inbox and i saw that on the other server in the cluster still had the mail file ok, so i
forced a replication (mail good to mail bad - push documents) and the result was that the user lost all of
his documents (?!?!?)

Help?

Avatar of HemanthaKumar
HemanthaKumar

It could be your cluster which is giving this problem. Clustering is something that you have to understand in detail and implement it. So time being switch off the clustering and monitor if there are any corruption occuring.

When you do replication remove the replication cutoff date in the corrupted db and then do a push type replication.

~Hemanth
Avatar of sync957p

ASKER

I got to this company about 5 months ago.

Clustering was implemented by a major Lotus Notes reseller and developer in Portugal (IBS - based in Sweden).
The configuration looks alright. We only use it for failover, no load balancing envolved. The errors occur on only
one server (except when errors replicate from that server to the good one)... are u shure
this could be to cluster problems? I'm going to deactivate it and wait to see what happens.

thanks

p.s.

also... the servers have been running five for the last 3 years... they've been running fine for 5 months
now since the last upgrade (5.0.4 to 5.0.11)... this has happenned only in the last 2 weeks.
*fine
Are the both servers at same 5.0.11 level ?
Look at your antivirus at the filelevel.  Does it's logs show activity on the corrupt mailfiles ?  

cheers,

Tom
What you describe sounds like the Antivirus at the filelevel detects a virus in an attachment in the mailfile, and deletes it.  But because it's filelevel, the Notes database isn't updated.

Install the Domino based antivirus solution asap, and stop scanning your databases at the filesystem level.

cheers,

Tom
Can you explain more on what do you mean by corruption ?
on to something here, but I'm suspicious.  Domino does NOT normally write mail atachments to disk as part of the delivery process, so the file system anti-virus should NOT be touching the mail file (unless the user was SENDING a virus and saved iot to teh mail box).  When this sort of thing happens, it is almost always MAIL.BOX that becomes corrupt.

Also, DOmino rarely writes atatchments to disk ever, and most file anti-virs solutions can't really check NSF files. When you sometimes run into this is when you have both file and Domino AV installed and teh file AV catches files as the Domio AV is scanning them (some of the Domio AV solutions do detach the file to disk, instead of scanning it in memory off the data note).
I would also look at the backup agent.  I don't remember if Backup Exec uses the Domino backup API, its own open file solution, no open file support, or something else.

By the way, anything using the DOmino backup API is going to need transaction logging turned on.  When did you turn it off, and why?
Hi and thanks 4 your answers.

- Antivirus at file system level skips nsf files (i made it this way)

- I disabled transaction logs because it was not being used (if we dont use it for restores/recovers what's really the point
of using it?... my boss dont really likes incremental backups)

- I really dont think worms are getting into our system .... description ahead :

* No user is administrator of its own machine, except IT staff
* Kspam is being used to stop all potencially dangerous file extensions before they reach the mail boxes
* Egroup policy orchestrator with Virus Scan enterprise 7.0 on each machine - definitions updated almost daily
* All service packs and patches are up to date in our Win2k machines (or at least they were until yesterday)

- Both servers have the exact same release

- The antivirus log doesent show logs for nsf files but... it shows entries for tmp's related to notes (i'm going
to take a good look at this and get back to you guys)

- By corruption i mean (i get these different errors):

* mail database won't open (error in the client : view is corrupted or don't exists)
* database opens but $inbox is missing
* error in console related with the 2 client errors above : invalid attachment ID in db xxxx.nsf ... document will be deleted

- qwaletee : i don't think backup exec needs transaction log... only if we were using incremental backups, right ? (correct me in this if i'm wrong, plz)

Thanks all
I think I have found the answer in the lotus support site.

http://www-1.ibm.com/support/docview.wss?uid=swg21093844

Seems like your design task and cluster replication are colliding !
In orderto backup open files, using the Domino API, yo would still use transaction logs.  Whathappens is that the backup software tells Domno taht it should not write transaction to the dtabase until it is finished backup of the DB. The DB becomes R/O, and gets backed up, followed by tLogs.  Once the tLog data is backed up, the DB is released.

Without tLog, the database file can't be made R/O, because there is no place to record user changes or mail delivery or anything else.

---------

There wouldn't be any correspondence between the installation of kSpam and the corruption problem, would there?

---------

Set the file AV to skip Notes temp files.  if it interferes with them, then Notes can easily corrpt data.  This could be true even if there was no virus in teh file, since Notes may be unable to complete a transaction correctly while the AV has the file open.
QWrong link Hemantha.  Tha only results in duplicate design elements, as the redfresh design adds all the "missing" design elements before the replicator has a chance to cpy them in from teh source replica, and of course the "design task" design elements have different IDs from the originals.  Replication ultimately adds the original design elements in, without displacing the ones that design added.

Note that this can hapen even in a non-clustered environment.  It is just typically ore pernicious in a clustered environment, mostly because admins typically rely on teh cluster's engineering environment to create the replicas.  Wreals havoc if you do a late-day create-replica-via-adminp!!!!

The problem here is a different one.  The inbox is getting corrupt, not duplicated.

However, a dup inbox could explain ONE thing... why replication causes docs to "disapear."  If they disapear only from the inbox, not from All Documents ($All), then one of two things happened:
1) The docs were moved out of the inbox folder

2) The docs were seen in one inbox folder, but later you saw a different inbox folder, and some docs are in one of the two, some in the other

You can easily see if you have dup inboxes.  Either look at the folder list in Designer, see if there are two named ($Inbox), or use shift-contrl to see hidden folders.  Notes changes one ($Inbox) to the "special" inbox view.  If there is a second, it treats it as a hidden, non-system view.  So, if there were a dup, you would see one folder named Inbox (normal Notes treatment), and one named ($Inbox).
qwaletee :

From Backup Exec 8.6 Manual >

"The following types of Lotus Domino R5 database configurations can be backed up using
the Lotus Domino Agent:

Ÿ Domino Server Databases - Domino Server databases can be Logged or Unlogged.
They are located in a folder in the Domino data directory, typically
Lotus\Domino\Data, but may also be linked to the Domino data directory using
Lotus Linked Databases.

- Logged Domino Server Databases - A logged Domino Server database logs
transactions for one or more Lotus databases. If transaction logging is enabled on
the server, all database transactions go into a single transaction log. For more
information, see gAbout Lotus Domino Transaction Logsh on page 885.

- Unlogged Domino Server Databases - An unlogged Domino Server database
does not have transaction logging enabled, or the transaction logging has been
disabled for specific server databases. Unlogged Domino Server databases will be
backed up when a full, differential, or incremental backup is performed, but the
database can only be restored to the point of the latest database backup.

Ÿ Local Databases - Lotus databases are considered Local when they cannot be found in
the Domino data directory, cannot be shared, and cannot be logged. This type of
database requires a backup of the database itself when using any of the Lotus Domino
R5 backup methods. The database can be restored only to the point of the latest
database backup."

wich means, that the soft will backup open files in unlogged dbs, but has no support
for differential or incremental unlogged backups / restores (full backups are always needed), right?

Well, i dont wish to be pesky or something, but the truth is that the server was running just fine, no changes
have been done (except for dbs growing in size, and yes kspam was installed a long time ago).

If you are sure about BE (i may have misunderstood the manual) i will activate transaction log again (in archive type).

HemanthaKumar : the simptons are different... i have no "$inbox" neither "inbox" ... the technote mentions 2.
Well, BE will certianly back up unlogged DBs that are not open.  I just don't know what it would do if it encountered an unlogged DB that WAS currently open.  It could:

1) skip it
2) attempt to back it up, knowing that the backup may be bad (if teh data changed while being backed up)
3) do some other type of locking mechanism on the file
4) do one of the pseudo-logging tricks that many backup packages used to do (at the OS level, hook any reads/writes on the DB, and place them in a sort of disk sector transaction log, while backing up a DB that now has direct changes prevented.
I'd place my money on  a disk space error. If Backup Exec is set to backup open files (without the Domino Agent) it will make a copy first, then back it up.

Also I have some experience with Notes doing some strange things when the disk space starts to run low. When you start to have nsf's in the 1gb range, big chunks are going to be needed for houskeeping, indexing, compacting etc.  

Also, we had a problem at one time using the NT drive mirroring. When space ran low, disk access started running very slow which led to other problems.

Kind of a simplistic answer, but worth a look.

Dave
Dave,

Good thoughts.  I have seen inboxes disapear in full disk conditions.  Of course, the server usually also crashes, an this did not happen here.
So you think 70 Gb partition with 2Gb free (i know i'm on the edge here, but i should be
fixing this in a shortwhile) could cause the corruption ?  What about the other server
in the cluster that's not corrupting dbs? (same disk space)


thanks
This could be related to your ODS ?  

You mentioned that this was happening after a migration. So it could be related to the one mentioned in this article ???

http://www-1.ibm.com/support/docview.wss?rs=899&uid=swg21095979&loc=en_US&cs=utf-8&lang=en+en
I didn't mentioned that. I mentioned that a few months ago I upgraded the servers from 5.0.4b to 5.0.11.
I don't think its related.

I deactivated compact -B task (that runned every night) and the problem went away (i still haven't activated transaction
log)... but i really need to run compact everynight (or at least every week)... does this info helps?
Hemanth!  Please don't post URLs ilke that!  The scrolling canmake one go bonkers.  All you need is the UID parameter.
A bit of explanation:
Compact -B usually does an -in-place compact.
This would require no additional space.
But there are situations where it may do a "new copy" type of compact, even with -B.
It depends on the state of some indexes, the ODS version, and some voodoo I don't understand.

During the compact, if a file need to be copied instead of in-place compacted, you may run out of disk space, preveting mail delivery.

-----

You've got a real catch-22.

You need compact so that your tiny bit of available disk remains available.
But compact itself, depending on the options and database status, can eat huge amounts of disk space.
The real answer is that you need more disk, ASAP.

What you might want to do is develop a script that runs compact on the files in size order, with the smallest file running first.
That may allow you to save enough space before hitting a file that would be big enough to zap all your space.
Qwal, You are not the only who don't have time to read all the posts.
I just do quick copy/paste of the url and worry less about the formatting.

Hey, lighten up a little.


I'm not worried about your formatting... I'm worriwed about mine.  Well, everyone's.  Juts hard to read with thos darn URLs stretching out EEs table layout.

Whatever.
Ok.

I just upgraded my array. I now have a 135 Gb partition for mail files (about 65 Gb free space)

I runned compact and the damn databases are getting corrupt again! :(

What now? (damn i wish I was surf teacher instead)
now i have all scheduled server programs stoped "until further orders"

i was running updall, fixup and compact with different schedules, but i am afraid to do ANYTHING
to the dbs
I was thinking about this problem ...

I had a situation once where (and my recollection is fuzzy on this) the template for the PAB was updated beyond the server release level.
This really screwed things up, but only to the point where db's were misbehaving ins strange ways. The problem was caused by the fact the we
have an environment where we have two R5 servers and an old 4.5 server (don't EVEN get me going on why).  Could it be that your templates
are somehow mixed up and the servers are refreshing asychronously and when replicated bad things happen ?

Dave
sync957p, Is there any error msgs on console while compacting ?
Did stopping the programs work?

A thought occured to me...

Perhaps redesign was running at the same time as fixup, or as update, due to the large file sizes forcing overlap in the time periods.
This can cuase index ocrruption (btree errors).  Which might cuase fixup to dlete the "bad note" (view/folder)
HemanthaKumar :

"Document NT000002CA in database E:\Lotus\Domino\Data\Mail\username.nsf is damaged: Document attachment is invalid"
is the error on the console for each corrupted database... sometimes it says "Database xxx.nsf is CORRUPT! Now Read-Only!"
In the client i can have these errors


Ddenu :

I don't have a mixed environment but i'm going to take a look at the templates, thanks

Qwaletee :

You might be on to something there. It's early to say if it worked... until now i got no corruptions since last compact.
How do i find out if redesign is running at the same time that other programs... i don't have design scheduled.
There are five ways to schedule tasks on a Domino server:

1) Using Program documents in the Directory/NAB, as I believe you did

2) TasksAtXXXX

3) OS-level scheduler

4) Scheduled agent that executes a command (or that creates a program document on the fly)

5) External entity that pushes a command to the console (a management console program might do this, like BMC or Intelliwatch)

3/4/5 are relayively rare.  But almsot everyone uses 2, since DOmino installs with it.

Look in Notes.Ini for lines like ServerTasksAt1=, ServerTasksAt2=, etc.  When DOmino starts up, it automatically adds items from these lines to its schedule.

The standard Lotus install places, as I recall, Catalog and Design in ServerTasksAt1
Hi.

Back again to this issue after spending long hours implementing a document management system in
our insurance company.

I reported this question to Lotus and still got no valid answer. They've asked me to send names.nsf, log.nsf and notes.ini
to their FTP site, wich i just done.

Compact, fixup and updall still on hold.

Last night, around 5:00 am when i was finally done and ready to go home i started receiving errors
to ALL databases and templates, stating corruption...
(Unable to update activity document in log database for xxxx.nsf: Database is corrupt -- Cannot allocate space)
I got really scared :)

In my notes.ini there was an entry for ServerTasksAt5=Statlog.

I deleted the log.nsf (it was corrupt), statrep.nsf and restarted the server. The problem
was solved has none of the databases stated were really corrupt (only log.nsf).

At this moment i have all scheduled programs on hold and i removed all ServerTasksat in notes.ini....
guess i won't be able to do any maintainance to this server until problem solved.. :(

Lotus Support Team pointed technote #1095372 , but it only refers compact, and my dbs are getting corrupt
with compact stoped. This technote doesen't give any definitive solutions, only suggests a way to repair db's.

Damn i wish i was surfing..
StaLog updates tehe log.nsf section "Database Sizes"
Well guess this is going nowhere... should I split points and close the question or leave it open
for some time to see what happens?

(still no Lotus Support feedback)

note about spliting : in my point of view, experts more active/interested in this thread by activity :

1st - qwaletee
2nd - HemanthaKumar
3rd - .... ?

We didn't fix it for you.  Ask for a refund.
Yeah, that would be logic.... i tought about waiting, but since the question is quite old now, guess no one else is going
to check it.

(Don't you guys think that every time a question gets a comment should be moved to the top of the list again?)


Thanks for your contribute and effort.
********************* Email from Lotus Support :  **********************************

Regarding the above mentioned PMR.

It will not be possible for us to tell you what caused the corruption in the databases provided since no spesific debug parameters has been set to capture the DBMARK corrupt messages in the log.
These will usualyy give us an indication what caused it on the server.

In order to resolve the current possible corruption on the server, please run the following maintenance tasks while the server is down:

nfixup -f -j
nUpdall -r
nCompact -c -i

Please note that if you are running Transactional Logging , you will have to make a new backup of the DB's since compact -c will change the DBIID of the DB's.

In order for us to capture the DBMark corrupt messages that accompany the Database Corruption, please enable the following settings in the Notes.ini on the relevant server. Please remember to press enter after each line in the Notes.ini and also to restart the server for the changes to take affect.

debug_threadid=1
debug_outfile=c:\debug.txt

**************** END ***************************

Any comments?
Yep, wait and see.  And don't frget to remind the support tech that management is considering dumping Notes over this issue :)
ASKER CERTIFIED SOLUTION
Avatar of Lunchy
Lunchy
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Lunchy,

Above and beyond the call of duty!

Thanks.
Olá,
João, tenho um problem identico ao reportado por você no link abaixo:
https://www.experts-exchange.com/questions/20745034/Mail-db's-are-getting-corrupt-each-day.html?query=notes+database+corrupt+cluster&searchType=all

Até o momento não consegui encontar a solucão, você conseguiu identificar o problema de corrompimento de bases??

Se você puder me ajudar agradeço

Marcelo Sodré Plachevski
Provavelmente o problema era devido a um conjunto de motivos : falta de espaço em disco, base de dados de log do servidor corrompida (log.nsf), etc...

Mas nao é bom discutir este asssunto em Portugues numa questão que já está encerrada.. Por favor abre uma questão nova ou manda mail para:  joao PONTO pimenta AT mail PONTO telepac PONTO pt

Email por extenso para baralhar spammers :)