We help IT Professionals succeed at work.

Netware 5 optimization?

S Connelly
S Connelly asked
on
There are a number of questions here so I'm willing to be generous with points to more than one person!  I'll start with 200pts... but for real results, I'll willing to hand out 1000 pts in total.

Questions breakdown:
- I'm looking for ways to get this server functioning faster without hardware changes (ideally)
- I need to add an addition NIC (to 'double' bandwidth) what is the proper procedure?
- I'm looking for tools that will help with bench-marking, performance tuning and monitoring/trends of Netware functions.

I have read just about every document on Novell performance tuning and none of them address/help with my problems.

My dirty cache buffers are staying around 1300, total cache buffers is at 89000 and original cache buffers is at 130000 (rounding off all amounts).
Further, utilization is around 45% and the Novell server is very, very slow!!
In fact, it is so slow that accessing the console monitor is difficult!

This is completely ridiculous; my NT servers have slower processors (PII 250?s) and are doing as much work (Exchange and MRP servers) with much better response times!  I know that Novell can do better!!

Netware specs:
Intel 440BX-2 motherboard, 650MHz PIII (100MHz bus)
512MB DIMM (PC133)
DPT Smart Cache IV w/RAID and 64MB cache options
4 x 18.9GB Seagate SCSI-II fast/wide (RAID 5 configured)
IDE 20gb drive use as common shared volume
Novell Netware 5.0 w/SP6
1 - 3COM 3C980B-TXM (want to put in another but I need details on how this is done - Extra bonus points!)
Server connected to a HP 4000M switch
45 connections using (mostly) 3COM 3C905TX NICs, Win9x, WinNT SP6a, Win2K SP2 all with latest Netware clients (some with SP3 client patch)
Novell protocol is IPX (clients configured as IPX only) and IPX traffic is 70% of all traffic (according to Sniffer), the rest of primarily IP (for Internet access and NT server connections)

I made the following changes last night (based on Novell TIDs):
Max concurrent disk cache writes: was 750 -> changed to 4000
Min packet receive buffers: was 128 -> changed to 1000
Max packet receive buffers: was 500 -> changed to 5000
Min. dir cache buffers: was 500 -> changed to 1500
Max dir cache buffers: was 2000 -> changed to 5000
Days untouched before compression: 14 no change
Dirty disk cache delay time: was 3s/6 ticks -> changed to 1s/6 ticks
Maximum service processes: was 500 -> changed to 1000

The results of the above changes?   WORSE!  Much worse!

Upon arrival this morning, I found the backup had only completed 10% of 4.5gb (incremental backup).  Normally incremental backups only require 2 hours to complete.  The estimated time to completion for this ArcserveIT
(ver 6.6 w/latest patches) was around 431 hours!!!
Further, my attempt to kill the job was not only difficult (because of the extremely slow console), the server did an ABEND (2nd one this year both due to cancelling a misbehaving ArcserveIT job).  Please keep in mind, the server may be slow but it?s rock-solid with less than 10 minutes of down time a year (off-hour service time excluded of course).

BTW, a simple deletion process (with no other clients connected at the time) of 120MB and 2400 files requires over 5 minutes to complete and is enough to slow the server as already described.

So what do I do?  How do I make this server run as fast as Novell is supposed to be?   What tools are available to help diagnose problems and/or monitor with trend graphs most Novel parameters?

I have had fast speeds since Netware 4.11 but now it?s so slow that it?s adversely affecting it?s 45 clients!!

HELP PLEASE!!
Comment
Watch Question

Commented:
This is for your cache tuning only.

Your Dirty Cache Buffers should be going from any number to 0 after a few seconds.  Same for "Current Disk Requests".  The first is the waiting buffer for changes being written to disk, the second is the waiting buffer for asking things from disk.

First, check if your server has enough RAM to function properly.  Check Monitor|System Resources (in the menu): the percentage for Cache Buffer Memory should CERTAINLY be above 20, and ideally above 50.  If this is not the case, add RAM the sooner the better.

Then, presuming you have some RAM to "organise better", you should decide what is more important: reading or writing.  You devoted almost all caching to write-operations, leaving the server insufficient time for read-operations.  Bad reading from cache = a lot of reading directly from disk = high CPU.

There is no real number I can give from here, since they all depend on what your people are doing on the server (read a lot of Word docs?  access big databases, mostly to make changes?), and how your directories are organised.  Few large directories (thousands of files)= requires more directory cache buffers, for example.

While you make changes, check the Dirty Cache Buffers and Current Disk Requests all of the time, after every change.  Again, they should return to zero every so many seconds.

So, let's give it a shot.  All of these are in Monitor|Server Parameters.  
**In Directory Caching (remember, these are used to store some kind of a FAT per directory: which file is where), the Dirty Directory Cache Delay time indicates how fast a change should be written.  If this is extremely short (0.5 seconds or lower), combined with Max Concurrent Directory Cache Writes, it means that your server will very soon write a lot of changes.  Hence, it has less time to read.  You may have to balance this a bit.  Dirty Disk Cache Allocation Wait Time indicates how smooth (=short delay) you want the server to create buffers when needed.  You may set this pretty short during a test phase, just to see how much the server thinks it needs.  Don't set this extremely short forever, since then a peek-usage may create buffers that are never again needed.  The Non-Referenced Delay indicate how long a buffer stays in memory.  The first user to come in and look for a file, will read from disk.  The second will benefit from the first, since NetWare put that info in cache, if the user comes within the indicated delay.  So, if you have few large directories, you may want to set this pretty high (like 15 minutes or longer), to avoid that the same infor has to be read from disk every 5 minutes, using CPU power.  However, with small directories this may work against you, since the server consumes cache without really re-using it.
**File Caching then.  This is for the actual data of files.  Max Concurrent Disk Cache writes will indicate how many changes you want to write in one go.  More writes=less reads.  This you may need to balance a bit more too, certainly combined with the Dirty Disk Cache Delay time.  This one too you may want to put extremely low just to see how many are needed (the lower the figure, the easier NW will allocate blocks, but the higher the chance it will allocate based on unusual circumstances, like someone copying a few hundred MB).
By the way, you set the Max Dir Cache Buffers to 5000.  How much are there in use (check the main screen of Monitor)?  Maybe 5000 is not enough, and your server would like to make more, but you limited it.  Again, this depends on your directory structure.

I don't know how much space I have here, but these are, on first sight, the things to check.  Try figuring out how much files you have (check the "used directory entries" on your volumes, in Monitor|Volumes), and how many are open at the same time, on average.  Also important is to have an idea of the directory structure.

Jim

Commented:
That is a fairly high number of Dirty cache buffers for a RAID controller w/ 64 meg of memory.  What is your LRU sitting time?

Commented:
Listening until I get to work tomorrow

Commented:
I don't believe adding the second NIC at this time will help.  That only doubles the outbound path and will not affect the high utilization.  I might try off hours remming out as many NLMs or programs in Autoexec.ncf and bringing up server adding back one nlm or application at a time, you might very well identify the culprit this way.  Given the hardware specs, it would be unusual to maintain the high utilization, it should, as the dirty cache buffers, drop down to 0 after short peaks.  If this was my server, I would put all the SET changes back to default, perform the test above, and then only implement other changes one at a time.  Something seems off on your Original cache buffers and Total cache buffers.  I don't have the TID to calculate these but I have very similar hardware and memory and my Original cache buffers is 169,000, yours is 130,000.  Also my Total cache buffers is 135,000, against your 89,000, although this last varies dramatically with loaded NLMs.

Commented:
You might want to check the auto-negotiate settings of the NIC and the switch.  I've seen similar probs if one is set to full-duplex and the other to half. It'd also cause some pretty high utilization due to excessive packet re-sending.

Commented:
thought of this later, load server -na and watch the utilization, then manually load the nlm's the autoexec.ncf usually loads and see if you can find a guily party this way.

Commented:
I am curious - does the Dirty Cache Buffers jump to 1300 immediately after the server is brought up or does it ramp up slowly?  Trying to figure out if there is a lot of network traffic to the file system or just a bunch of junk on the wire that is causing NetWare to suck up buffers...
Just check one thing out, before you go too far - we had this on an everlasting loop on netware 4, and it certainly still happens on 5. Check for bindery contexts. If you have some, disable them. (Hide from a phone at this point!) you need to watch the server for about 5 minutes - see if the util goes down! You may have to enable it then, because the users are about to lynch you, but if its like our server, %util will become brilliant! Have a witness, because they wont believe you, if it resolves server issues.
If that works, your issue is that bindery processes are a single-threaded process which can take unlimited processor. What we had done, in all innocence, when a user required a new printer was gone to network, server, browsed the server (the printers happened to be living in a bindery context of the server, and so were visible), and selected the printer. Windows then HAMMERS this bindery process, not only when you are printing, but when it gets bored, and decides to check if the printer is there! You can tell how the printers are connected on the desktop - \\servername\printername, or full NDS style path through the tree.
Bindery logons would do the same!
It only takes a few users, especially if they are heavy printers, to take down a server this way, with no visible signs.

Commented:
Did you ever try bringing up the server -na or DSpoole's suggestion??
S ConnellyTechnical Writer

Author

Commented:
Gosh... what was I smoking when I ask this question?  I can't even remember writing it!

To re-state the problem more clearly....
Any time a large file (500mb and up as an example) is copied to my main Netware 5 (SP3) server, access to the server becomes difficult.  For instance, a 2.8gb file requires about 40 minutes to copy (from NT Server) to the Netware server (on 100mbps network that is too slow).  During this copy, the network slows down to a crawl.  In fact, it is so bad that if I attempt to access a volume on the Netware server, the workstation appears to lock up because of how slow it reacts. However, accessing the NT server seems as fast as it would if it were not involved in a copy process.

From the Netware console, things are no better.  Switching betweens screens can sometimes take several minutes, however, the monitor shows only nominal utilization (20-40% on average).

Since this copy testing occurred during the evening, there was little or no other traffic on the network.

Any ideas about how I can fix this problem?

What tools are recommended to help me diagnose this problem (cost is no object)?

Though, I am considering a move to Netware 6 and I still want to know what can be done to fix this problem, if only for the sake of knowledge.

Commented:
It almost sounds like you have a network bottle neck.  What kind of network hardware are you using?

Commented:
Oops, HP4000.  That is their low end chassis solution isn't it?  Does the switch report errors or duplex mismatch?
S ConnellyTechnical Writer

Author

Commented:
HP4000 is their flagship switch with a 1gbps backbone.
No significant errors. Duplex is full.

Commented:
It still sounds like a duplex issue to me - symptoms are EXACT match to a duplex issue I had a while back.  Carefully check client:switch, server:switch, and all switch:switch connections in the line.  Don't trust Netware startup.ncf (or Windows for the client), see how card is really set (card limitations, OS configs or hdw failure can work to not let it go to the mode you tell it to.)  If possible use a hardware tester like the Fluke LANMeter if you can find one.

Commented:
methinks melchioe is on to something, some folks report that setting the NICs to full and 100,or even 100 and half-duplex, works better than autoconfigure.  Your switch prpbably allows or supports setting each port in this manner alsotoo. I have not had this problem but it does appear now and then on Novell forums.  I am still curious about bringing up the server -na and trying to copy the file, then load each optional nlm one at a time and testing the results.

daveM

Commented:
yeah, I'd check the switch (usually telnet into it) and see if the ports attached to the server are really 100Mbps with Full Duplex - then I'd verify the desktop ports are also set properly.

If not, I'd use the setup disk that ships with the NICs and lock down the NIC's to 100Base-T/Full Duplex.

The switch should autonegotiate on it's end to match the NIC settings.

Commented:
HP does not make the best switches and that model shipped with 48 10/100 ports and you can add more.  With a 1 gig backplane, that means the switch is blocking at 25% load.  Have you checked utilization?

Commented:
This came out today:

High Utilization Caused by NDS

Error: "-252 DSERR_NO_SUCH_OBJECT" in DSTRACE while running the +EMU filter.

Unload DS.NLM and utilization drops immediately


--------------------------------------------------------------------------------
Cause
One workstation or device is repeatedly making bindery calls for unkown object.


--------------------------------------------------------------------------------
Fix
SET DSTRACE = +EMU (Emulate Bindery) to track down Bindery calls to NDS which are known to cause high utilization.

SET DSTRACE = ON
SET DSTRACE = NODEBUG
SET TTF = ON
SET DSTRACE = *R
SET DSTRACE = +EMU
 
..... watch for same request over and over...
let go for at least two or three syncs then ...
 
SET TTF = OFF

Commented:
I had sim. issue with slow network and server response. Try this:
SET CLIENT FILE CACHING ENABLED=OFF
good luck.
Some of your open 23 questions are current, but many are not.  ADMINISTRATION WILL BE CONTACTING YOU SHORTLY.  Moderators Computer101, Netminder or Mindphaser will return to finalize these if they are still open in 7 days.  Experts, please post closing recommendations before that time.

Below are your open questions as of today.  Questions which have been inactive for 21 days or longer are considered to be abandoned and for those, your options are:
1. Accept a Comment As Answer (use the button next to the Expert's name).
2. Close the question if the information was not useful to you, but may help others. You must tell the participants why you wish to do this, and allow for Expert response.  This choice will include a refund to you, and will move this question to our PAQ (Previously Asked Question) database.  If you found information outside this question thread, please add it.
3. Ask Community Support to help split points between participating experts, or just comment here with details and we'll respond with the process.
4. Delete the question (if it has no potential value for others).
   --> Post comments for expert of your intention to delete and why
   --> YOU CANNOT DELETE A QUESTION with comments; special handling by a Moderator is required.

For special handling needs, please post a zero point question in the link below and include the URL (question QID/link) that it regards with details.
http://www.experts-exchange.com/jsp/qList.jsp?ta=commspt
 
Please click this link for Help Desk, Guidelines/Member Agreement and the Question/Answer process.  http://www.experts-exchange.com/jsp/cmtyHelpDesk.jsp

Click you Member Profile to view your question history and please keep them updated. If you are a KnowledgePro user, use the Power Search option to find them.  

Questions which are LOCKED with a Proposed Answer but do not help you, should be rejected with comments added.  When you grade the question less than an A, please comment as to why.  This helps all involved, as well as others who may access this item in the future.  PLEASE DO NOT AWARD POINTS TO ME.

To view your open questions, please click the following link(s) and keep them all current with updates.
http://www.experts-exchange.com/questions/Q.20022366.html
http://www.experts-exchange.com/questions/Q.20022363.html
http://www.experts-exchange.com/questions/Q.20142011.html
http://www.experts-exchange.com/questions/Q.20163170.html
http://www.experts-exchange.com/questions/Q.20163198.html
http://www.experts-exchange.com/questions/Q.20165526.html
http://www.experts-exchange.com/questions/Q.20165532.html
http://www.experts-exchange.com/questions/Q.20175313.html
http://www.experts-exchange.com/questions/Q.20176088.html
http://www.experts-exchange.com/questions/Q.20188199.html
http://www.experts-exchange.com/questions/Q.20188197.html
http://www.experts-exchange.com/questions/Q.20191317.html
http://www.experts-exchange.com/questions/Q.20191318.html
http://www.experts-exchange.com/questions/Q.20191319.html
http://www.experts-exchange.com/questions/Q.20194139.html
http://www.experts-exchange.com/questions/Q.20251645.html
http://www.experts-exchange.com/questions/Q.20263647.html
http://www.experts-exchange.com/questions/Q.20266099.html
http://www.experts-exchange.com/questions/Q.20271348.html
http://www.experts-exchange.com/questions/Q.20274640.html
http://www.experts-exchange.com/questions/Q.20265767.html
http://www.experts-exchange.com/questions/Q.20298168.html
http://www.experts-exchange.com/questions/Q.20298175.html



*****  E X P E R T S    P L E A S E  ******  Leave your closing recommendations.
If you are interested in the cleanup effort, please click this link
http://www.experts-exchange.com/jsp/qManageQuestion.jsp?ta=commspt&qid=20274643 
POINTS FOR EXPERTS awaiting comments are listed in the link below
http://www.experts-exchange.com/commspt/Q.20277028.html
 
Moderators will finalize this question if in @7 days Asker has not responded.  This will be moved to the PAQ (Previously Asked Questions) at zero points, deleted or awarded.
 
Thanks everyone.
Moondancer
Moderator @ Experts Exchange
S ConnellyTechnical Writer

Author

Commented:
I need to go over this one again.... expect a response by end of week.
Will return in one week, thanks.
Moondancer - EE Moderator
S ConnellyTechnical Writer

Author

Commented:
Just about two months later...my how time flies!
I really will try my very best to close this question within the next week.  If only because I don't want to write something like this again.
Per recommendation, points NOT refunded and question closed.

Netminder
CS Moderator
S ConnellyTechnical Writer

Author

Commented:
Where is the EE policy that states question points will not be refunded if not answered and why is Netminder rewarded?

This question was never closed because it wasn't resolved.  Even if I was forced to give someone points, I would have picked JimBb for his effort.

Like I said, I still don't have a satisfactory solution.

Commented:
Your posts stated 45% utilization.  While the cache buffers are related to available memory, the utilization is generally related to NLMs that are running.  I still believe that you have some NLM running that is dragging down your server and that starting up server -na and manually adding your applications may assist in finding the culprit.
A question is considered abandoned after there is no activity for three weeks. When Moondancer listed all of your open questions, you responded that you would have an answer in one week; that was in June. You repeated that in August. In late September, an Expert doing cleanup made the recommendation that since you had abandoned the question, you forfeit the points as if an answer had been accepted for you. The Expert also felt that while there was information in this question worth keeping, no one Expert's comment merited being selected as an answer. As such, we removed the points from your account and from the question, and closed it.

Netminder
CS Moderator

Explore More ContentExplore courses, solutions, and other research materials related to this topic.