Link to home
Start Free TrialLog in
Avatar of tullhead
tullheadFlag for United States of America

asked on

CNTK model crashes - but only on come PCs

I am evaluating various neural networks created with CNTK from C++. Some are essentially AlexNet, others are V3-inception.

Everything runs fine on dozens and dozens of PCs. But I have found two PCs on which the same code will crash when attempting to evaluate the AlexNet model -- but the V3-inception model will always work - even on these two 'problem PCs'.

So, I figure it must be something about these 2 PCs -- some prerequisites missing, or something. I have checked the obvious things (like vc_redist… )

Going crazy! What could it be? Any ideas?
Avatar of tullhead
tullhead
Flag of United States of America image

ASKER

I guess nobody on EE has an idea for me?
Avatar of phoffric
phoffric

Had a similar problem. The good PC had 4 cores and 28gb ram. One bad machine had 6 cores and 24gb ram. Another bad machine had 8 logical cores and only 16gb ram.

We had to spend a couple hours releasing large arrays that we no longer being used in order to get them to work.
In another crazy example, a guy built a program and it worked ok. He rebuilt it with no changes to the code or system and the compiler crashed. The problem was related to the disc drive. I can't quite remember. It was either a disk fragmentation issue so that an allocation request could not occur. Or it may have been some undetected hard disk bad sectors.

So, size matters. You should check your task manager and resource monitor on the good and bad systems for comparison.
phoffric - thanks, I checked, but that does not seem to be the problem.  One PC it fails on has 64 GB of memory (and no spikes are seen in Task Manager) -- it will run fine on an 8 GB machine.  But thanks for suggesting - I keep trying to figure it out....
I assume you also have plenty of free disk space. Just to rule out another variable, hurt too much to run a disk check.

https://www.google.com/amp/s/howtofixwindows.com/fix-disk-drive-errors-with-chkdsk/amp/

You have ruled out having the prerequisite software components. But perhaps there is a hardware difference or a hardware Driver difference that can cause the problem.

If you provide the error code and any error messages that show up when your program crashes, I will see if I can get any other experts to help you.
Still stuck....
In previous post I wrote that if you provide the error code and any error messages that show up when your program crashes, I will see if I can get any other experts to help you.
OK, I extracted out the simplest possible test program.  I have both a debug and release version of the test program.  I have a VS 2017 project for this test program if any expert is willing to look at it.  I can run this test on various PCs, and it runs fine on most, but I have two PCs on which it fails.  

Now, I am not very good at using the debugger -- but I ran it in the debugger and captured some info when it failed, which I will paste below.

Does any expert have an idea of what it may mean?   I doubt that I want to try to debug ntdll.dll - perhaps a crash in that DLL means something?      Thanks for any help.

Some details from the debugger...
TestCNTK.exe has triggered a breakpoint.
Critical error detected c0000374

Unhandled exception at 0x00007FFF64499229 (ntdll.dll) in TestCNTK.exe:
0xC0000374: A heap has been corrupted (parameters: 0x00007FFF645027F0).

The top of the call stack...
 ntdll.dll!00007fff644991b2()                          Unknown
 vcomp140d.dll!00007fff40179ac5()                  Unknown
 Cntk.Core-2.8-rc0.dev20200509d.dll!00007ffed99f698c()   Unknown
----------------------------------------------------------------------------
Version of ntdll.dll
10.0.18362.778 (WinBuild.160101.0800)

This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.