Link to home
Start Free TrialLog in
Avatar of techworksjh
techworksjh

asked on

Serious Windows XP network problem - delayed write errors and missing files and folders

This is a very serious problem and I really appreciate any help.

Here is the scenario:
I have been working on a network consisting of 6 computers running Windows XP Pro.  One of these machines acts as a file server and a database server for ACT 2005.  The server sits by itself in a closet and nobody touches it except for me.  The server is just a basic Windows box and is fast, but not fancy or unusual in any way.  It has a RAID 1 configuration on two 80 GB SATA drives with only a single partition.  Everything is on this partition.  

The client computers are all just HP desktop computers connecting to the server through a mapped drive linked through the server's static IP (ie, the mapped drive connects to \\192.168.1.10\fileshare).  All computers are using TCP/IP as the only networking protocol and all have static IPs.    

The networking gear includes a new Linksys 16-port 10/100 switch, a Cisco 800 series router, and a Cisco 800 series DSL modem.  There is also a single Linksys wireless access point (WAP54g) connected to the switch.  The access point is using WPA with a very long password.  

The client computers are all running ACT 2005 and access the server for the primary database.  As I said, they use mapped drives to the server (1 mapped drive per client PC) and they use Paperport 11 to view the shared files.  I recognize the issues regarding using Paperport in a network environment, but have had an impossible time convincing the owner of the business to stop using it.  

I flew from NY to Montana to completely redo this network in January and here is what I did and why.  The server was always the server for the ACT database, but the files accessed by Paperport were actually stored on and shared from one of the user's PCs.  I moved all of those files to the server and reset the permissions and ownership on all files and subfolders.  I reconfigured the network so that computers used static IP addresses instead of DHCP, and I configured mapped drives to connect using the server IP instead of the netBIOS name.  I disabled automatic discovery of network shares and printers on all computers.  I removed IPX protocols from 2 of the clients.  I also did extensive spyware and virus scans on all computers and installed Avast antivirus on all machines, including the server.  I did a lot to improve the speed of all computers through disabling system restore, error reporting, indexing, etc.  I spent 4 days from 8am to midnight completely redoing this network.  I also changed the connection speed setting in the driver for the server's network card.  It was set to 100 Mbps half-duplex for some crazy reason.  I set it to full-duplex obviously.  I also formatted two of the computers and did very careful full reinstalls of Windows and all software.  

When I left, everything was a million times better than before.  Network access was much faster across the board.  Access to the ACT database was significantly improved, and people were not having Paperport crash all the time.  Then, about five weeks after I left, one user (using one of the computers I formatted and reinstalled) was having periodic freezes in both ACT and Paperport, but the freeze-ups were not effecting anyone else at all.  I suspected that she was just being impatient and clicking on things like crazy when she experienced even slight network lag (she seems like that type of user).  Then, a few days after that, she got a delayed write error while saving a shared Excel spreadsheet that was on the server.  Then, several days later two other users got a delayed-write error while saving the same document.  I ran a chkdsk /r on the server and nobody got any more errors for about two weeks.  I also completely remade the Excel file that caused the original error because I thought it was corrupt.  More than two weeks went by without any freeze-ups or delayed-write errors, but today someone called and said that they got a delayed-write error on a Word doc they were saving and then the strangest thing happened:

The new Excel spreadsheet that I remade after the first delayed-write errors mysteriously disappeared.  As I was asking all the users if they might have deleted it or moved it on accident (everyone swears they didn't and I couldn't find the file even after searching all computers), someone else noticed that an entire directory disappeared (in a folder separate from the word file).  I ran 'chkdsk' on the server and it gave me the following errors:

CHKDSK discovered free space marked as allocated in the master file table bitmap
CHKDSK discovered free space marked as allocated in the volume bitmap

I got everyone out of the server and ran a chkdsk /r and rebooted the server.  The logfile read as follows:

A disk check has been scheduled.
Windows will now check the disk.                        
Cleaning up minor inconsistencies on the drive.
Cleaning up 6 unused index entries from index $SII of file 0x9.
Cleaning up 6 unused index entries from index $SDH of file 0x9.
Cleaning up 6 unused security descriptors.
CHKDSK is verifying file data (stage 4 of 5)...
File data verification completed.
CHKDSK is verifying free space (stage 5 of 5)...
Free space verification is complete.

  78140128 KB total disk space.
  15393476 KB in 58857 files.
     24300 KB in 9884 indexes.
         0 KB in bad sectors.
    161900 KB in use by the system.
     65536 KB occupied by the log file.
  62560452 KB available on disk.

      4096 bytes in each allocation unit.
  19535032 total allocation units on disk.
  15640113 allocation units available on disk.

However, after I ran the chkdsk, the excel file and the other folder are still gone.  Because I am obsessive about backing everything up each night (and online) we were able to just grab the missing files from the backup, but it is not good that this is happening and I know it points to a more serious problem.  I considered that one of the RAID drives might be failing, I disabled the wireless network in case someone got on and was deleting files, and I told everyone they would loose a hand if they so much as touched Paperport.  For now I am allowing everone access to the ACT database and the shared files, but I really need to figure out what's up.  Thanks for taking the time to read all this and thanks very much for your help.

Justin
Avatar of Kenneniah
Kenneniah

I would do a full diagnostic on the hard drives using the manufacturer's utility.
Sounds fishy, are you sure a user isn't messinging with the files. do you have auditing turned on ?
You may also get some write delays with lots of users and a cheap sata raid card or on board sata raid.
Avatar of techworksjh

ASKER

Here are a couple of additional thoughts:

First, the computer acting as a server was used as a client computer for a while.  I never changed any hardware out and I never had trouble like this with the computer while it was a client PC.  

Second, about 20 minutes after I wrote my initial post, the user (who I will now call Jane) who had the delayed-write error initially (and who is raising my suspicions by the minute) called and said that "something happened" while she was using ACT and now the database has reverted back to 3/15/07.  All notes, changes, and entries past the point to not exist.  She said that she clicked on the Lookup menu and chose to lookup by last name and "all of a sudden" she got a status screen indicating that the database was merging.  She said she hit Cancel and then ctrl-alt-del and killed ACT because "it wasn't cancelling".  I just got off the phone with ACT tech support and they had me go through all the database repair/rebuilt steps, but nothing is better.  The database seems fine otherwise, but nothing past 3/15/07 as far as I can tell.  The ACT guy seemed quite sharp and said he couldn't think of anything she could easily do to cause this type of problem.  His only thought was that she tried to restore a database from a previous date, but you can't do that in ACT without logging out all connected users first.  WTF?  

Shayneg is right that something seems fishy, but I also don't want to rule out hardware/software problems, as a delayed-write error seems like a difficult thing to cause intentionally.  I would certainly entertain the possibility of a combined problem; that is, both a hardware/software problem combined with  Jane deleting files and otherwise sabotaging the network in some way.  This woman's husband is Cisco certified and knows his stuff, and he is currently on unemployment (as he has been for more than a year now).  So, he could have the skills to hack into the network and he certainly has the time and the inside source necessary.  I just don't know what the motivation would be.

Furthermore, the router in place at this office is provided by Ameriprise Financial (formerly American Express Financial Advisors).  Unfortunately, this router is quite well protected with passwords because Ameriprise doesn't want anyone messing with the VPN settings, etc.  

I just don't know what to say.
Now to really blow things out of the water.  I just got another call from this same office.  I had the users look through the shared files on the server, and it seems that all the files have reverted back to how they were on 3/15/07.  Files and folders created after 3/15/07 do not exist in the file structure, and files already in existence by 3/15/07 have reverted back to their 3/15/07 state.  For example, one Excel spreadsheet made before 3/15/07 has all of the edits up to that date, but a worksheet added to the workbook after 3/15/07 is missing from the file.  When I do a file properties check on these crazy files, it shows last modified dates of 3/14/07 and 3/15/07, depending on the file.  Thus, I find it vital to include more information about the network configuration:

First, I found out that another completely trusted user was watching Jane's screen while the ACT destruction occured.  This witness says that she watched Jane click on the "Lookup" menu in ACT and that ACT proceeded to declare that it was "Merging".  What exactly we may never know.  This witness says she didn't see Jane do anything questionable, but that Jane was "clicking really fast on things".  She saw Jane hit ctrl-alt-del and kill ACT.  

So, on with the system configuration.  The backup system is as follows:  we have two external USB drives that have folders on each labeled Monday through Friday.  Each day, a user (the one who witnessed Jane mess up ACT) trades the hard drives out at 5:00pm and takes the extra drive home at night.  We use Acronis TrueImage 9 to create a full image of the server in the folder of the day the backup is being performed.  So, the user connects drive1 to the server Monday at 5, a full archive is created Monday night at 11:55pm and is stored in the "Monday" folder on the backup drive, overwriting the backup currently in that folder.  (Because of the drive swapping pattern, next Monday's backup would be saved to the Monday folder on drive2).  On Tuesday at 5, she swaps the drives and the Tuesday backup is written to drive2 (and again, because of the drive swapping pattern, next Tuesday's backup would be saved to the Tuesday folder on the opposite drive, drive1).  With this system, we get almost 2 weeks of backups spread between two drives.  This is the only backup system we currently use, although we were planning on adding online backup in the next few weeks to add to the redundancy.  Also, system restore is disabled on the server, so its use isn't even possible.  

Because some of the users refuse to keep their vital files on the server (and instead keep them on their desktops), I also have Acronis running on each client computer.  I have Acronis backup just the My Docs folder on each client to the backup drive connected to the server.  This backup occurs only once a day and happens when the user shuts their computer down (which all do at the end of the day).  Those backups are very small (< 20 MB) and take only a moment).  Those backups are stored in folders according to user names on the backup drives mentioned above.  More to come...

I just checked the Acronis log files and the last entries are on 3/15/07.  
ASKER CERTIFIED SOLUTION
Avatar of Alan Huseyin Kayahan
Alan Huseyin Kayahan
Flag of Sweden image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Not sure when the craziness will stop, but I went and looked at the system logs for Windows and there is a complete lack of log entries from 3/15/07 at 10:46 pm to 3/27/07 at 12:29 pm.

A few hours have passed since I typed the last sentence, and I have come to a few conclusions, but I need some opinions.  I was scheduled for a VNC appointment (to update some software) for this company at 12 noon today.  Jane called me right around 11:20 (I remember looking at the clock and wondering why she would call me 40 minutes early) and said that the entire network locked up and that Paperport and Act froze on all the computers.  This kind of system-wide freeze has not happened for some time.  We rescheduled for about 12:45 and spoke again at about 12:40, which a few minutes before people started noticing that files were missing.

My theory is that Jane restored our Acronis backup from 3/15/07, which would explain both the missing files, the messed up database, and the break in Windows log files.  Restoring an Acronis backup would take about 20-40 minutes, which, if she started it right before she called me, would have frozen the whole network (seemingly, because it would have restarted the server to complete the backup restoration), and that would explain the first 3/27/07 windows log file occuring right around 12:30, which was about 10 minutes before people started noticing that the files were missing.  I guess it all fits.  I'm guessing she didn't realize that the server would restart if she restored the system from Acronis, which would explain her somewhat frantic call to me before our appointment.  I also found out that she was probably the only person in the office around the time that she called me initially to let me know the network froze up.  

This brings me to the final series of questions in my long-winded chronicle.  I've already presented this information in real-time to the employer (who confessed an hour ago that he decided to fire her a week ago), but I like Jane and I don't want to be the jerk who gets her fired if she doesn't deserve it.  Thus:

1) How can I collect more evidence about this?  This is also urgent, because I have to restore last-night's backup tomorrow morning so that the office can continue to function.
2) Is there any way to prove that she did (or did not) copy any of the confidential client data from the server?  She may have copied such information between last night's backup and restoring the 3/15/07 backup, which would explain doing the 3/15/07 restore to cover her tracks (it did wipe everything out after all).
3) Would any of this have to do with the write-delayed errors we were getting, or is that likely a coincidence.  My theory is coincidence.
4) I am still concerned about her husband hacking into the network in some way.  Obviously the wireless AP was a weak point (it is currently powered off) and she had access to the WPA passphrase as she was in charge of the password lists (not my decision or recommendation).  He could potentially have hacked in through the router, but I really doubt that because I'm guessing Ameriprise has some hard-core security on those things.  Is there any way to determine whether he got in or not?  I doubt it because I have no access to the configuration or logs for that router.  

Thanks for all of those who took the time to read through this mess.  I have definitely learned from this, especially in regards to the physical security of networking equipment and servers.  The server is not in a locked room.  Stupid.  But then again, they probably would have given her the key to it.  The only computer with VNC access to the server (it has no keyboard, mouse, or screen) was kept in a locked room, but that person stupidly left her office open when she left for lunch (right around when I believe the backup restoration started), and for some reason the office decided to keep the VNC password for the server on the password list (even though only one person aside from me was supposed to know it).  I suppose I could have done MAC filtering on the AP because there are only a few computers that connect to it, but that wouldn't have stopped a clever hacker from spoofing the MAC and getting in.    

This will end fairly well considering that our backups are complete and up-to-date, but let this be a lesson to all.  Thanks again.

Justin
MrHusy,
  I overnighted two new drives to their office and we will replace them one at a time (rebuilding the RAID after each drive goes in).  This will rule out any drive problems.  Also, good call on the defrag.  It's easy to overlook.
Also, I'm sure the drive temps are OK because the server closet is cool and I have excellent cooling in the PC (and all the fans are working for sure).
there is no way a whole server just defaults back to an older day unless it has been imaged back with something like Acronis. I would say Jayne has sabotaged your system however getting proof of this will be impossible as whe imaged back all logs etc would have been overwritten. This is a very diffuculy situation indeed.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
MrHusy - thanks for that good bit of info on relocating the log file. I can see many uses for something like that.  I have some serious damage control to deal with this morning for this office, but I'll test that out later today.

Shayneg - I'm glad you agree that the only way this could have happened is through Acronis because this sort of problem certainly doesn't happen by accident.
More exciting news from the homefront.  This morning I went to restore the Acronis backup to the RAID mirror, but when it came time to select the partition to install to, there were two drives available.  This should not be, as the RAID mirror has only a single partition.  This means of course that the RAID mirror has broken, and most likely it broke at 10:49pm on 3/15/07.  Why the RAID controller decided to only use one of the drives and then just suddenly switch back to the drive it stopped using on 3/15/07 is completely beyond me.  This sounds more like a RAID controller/driver issue than a hard drive failure issue, but I still am not sure.

When I run the nVidia Mediashield (Windows RAID util), it actually seems to show two single-drive mirrors, one each for each of the drives installed.  The screen kind of looks like this:

MIRRORING - Degraded - 74.53 GB
      WDC WD800JD-00LSA0      -     Healthy     -     74.53 GB     -     SATA Primary Master
MIRRORING - Degraded - 74.53 GB
      WDC WD800JD-00LSA0      -     Healthy     -     74.53 GB     -     SATA Secondary Master

When I go into My Computer and open the C:\ (drive being booted to), it is the 3/15/07 drive, but when I open drive d:\ (the other former RAID drive) it contains all the data up to yesterday morning right before the system reverted back to 3/15/07.  This is pretty interesting.  My theory now is that the mirror broke on 3/15/07 and that the RAID controller continued on using just one drive.  Then for some reason (perhaps the RAID controller itself is toast) it decided to use the drive it quit using on the 15th.  

Any thoughts on this one?  And could this still be related to Jane restoring a backup?  Perhaps she could have restored the backup from Acronis onto just one of the drives.  Acronis does allow for lower-level access to drives than Windows provides.  
        What do eventlogs say?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The plan is as follows, after running diagnostics:

1) Take the 3/15/07 drive out and replace it with a new drive
2) rebuild the RAID array using the new drive and the drive that was last updated before the switch-over to the 3/15/07 drive.
3) after the array is back up, remove the second drive and replace it with a new drive as well
4) rebuild the array using both brand-new drives.  Assuming the new drives work, this will rule out the old drives as the problem.  

Any thoughts?
        Hi Justin
            What is the current status of your issue(s)?
Regards
You could try aqdding syslogging fromk the server to another PC either on another site or in a secure location Kiwi stuff is free and worth a try http://www.kiwisyslog.com/index.php
Thanks everyone.  I gave most of the points to Mr Husy because he was there from the beginning, gave a lot of good advice, and followed up at the end.  iccadmin will receive some points as well for the idea of rebuilding the RAID array.  I think we got things all figured out with this problem and the office in question will be getting a new server soon as I believe the RAID controller on the mobo is toast.  Thanks again everyone.

Justin