[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 353
  • Last Modified:

Same calculation, different results once in 100 times

I'm refactoring some code, and I'm trying to squeeze three equal-length arrays into a matrix.

I'm getting very frustrated with this, because the below code fails it's assert once every hundred runs or so.
func(float** newVar, float* oldVar) {
  /* Code to calculate result */
 
  newVar[i][j] = 1000 - Result;
  oldVar[j] = 1000 - Result;
  assert(newVar[i][j] == oldVar[j]);
}

Open in new window

0
letharion
Asked:
letharion
  • 14
  • 10
  • 10
  • +3
12 Solutions
 
Steven KirklandRetiredCommented:
AMD had a bad batch of Opterons a few years back which caused a similar problem so it would be interesting to know if these anomalous results are on just your dev box or on another box as well?
0
 
letharionAuthor Commented:
I tried it on another computer, same problem unfortunately.
0
 
Infinity08Commented:
I don't think I'd consider a hardware problem just yet ;)

My guess would be that the values i and/or j are out of range for the allocated memory for newVar and/or oldVar.

Could you show a bit more of the code ?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
jkrCommented:
How is 'Result' calculated? How are the indices created? And finally: How are the arrays declared/filled/used?
0
 
itsmeandnobodyelseCommented:
>>>> assert(newVar[i][j] == oldVar[j]);

you cannot reliably compare floating point values on equality. Even if they were calulated with the same algorithm as in your sample above. The problem is that one float may have different internal binary representations with a pair of mantissa and exponent. You could imagine that by seeing that 1.9999999... is the same as 2.000000... and both would print as 2.

So you must change the assert to something like

   assert(fabs(newVar[i][j] - oldVar[j]) < 0.000001);

0
 
letharionAuthor Commented:
Infinity/jkr:
I'm gonna try to "clean out" a lot of code and try to answer more about what happens.

itsmeandnobodyelse:
That's interesting. This time however, I don't think that's the issue.

When the assert fails, Result is expected to have the value 0.
newVar would then be expected to have the value 1000, but newVar is either 0, or something like: 18910402543280383426838921216.000000 or similarly huge.

I also must apologize, because with a fresh look a the problem, I see that the problem isn't quite what I have presented above.
Thank you for your time ppl :)
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> you cannot reliably compare floating point values on equality. Even if they were calculated with the same algorithm

It would seem wrong to me that if I run the same calculation 10000 times I would get differing results. I accept (and know) that floats cannot be accurately represented due to their internal representation but if each float is defined in an identical manner with a non-deterministic result I would find that to be erroneous.

As an experiment I tried coding a test (below), I could not get it to fail.

Could you point to the location in the C/C++ standard where this is stated so that I may update my understanding of this?
#include <cassert>
int main()
{
	float a = 1.999f;
	float b = 8.349f;
	float c = 17534.1f;
 
	for(int i = 0 ; i < 999999 ; ++i)
	{
		float x = ((float)(1000 * i)) / a * b + c;
		float y = ((float)(1000 * i)) / a * b + c;
 
		assert(x==y);
	}
 
	return 0;
}

Open in new window

0
 
Infinity08Commented:
>> As an experiment I tried coding a test (below), I could not get it to fail.

And with good reason :) It's not supposed to fail. If a floating point value is calculated in exactly the same way twice, then both results should be exactly the same.
You know that - I'm just confirming it for anyone else reading this ;)


>> I also must apologize, because with a fresh look a the problem, I see that the problem isn't quite what I have presented above.

Does that mean you fixed it ? Or do you want us to look into this further ? If so, can you give a bit more information ?
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> You know that - I'm just confirming it for anyone else reading this ;)
Yeah I know, but I was hedging my bets just in case I'd over-looked something in the standard :)
0
 
letharionAuthor Commented:
>Does that mean you fixed it? Or do you want us to look into this further ? If so, can you give a bit more information ?

Not quite. I'm just rethinking it myself before I'm posting again :)
I'm trying to build a minimal sample showing the problem. It's quite possible I'll figure it out in the process. Otherwise I'll post it here :)
0
 
Infinity08Commented:
Ok. :)
0
 
itsmeandnobodyelseCommented:
>>>> If a floating point value is calculated in exactly the same way twice, then both results should be exactly the same.

Is there any standard which defines that?

I agree that a compiler should be able to see that the right-hand result can be used two times and therefore doesn't make a recalculation. But assume the optimizations were switched off and there were two cores where the calculation was done, then it depends on the floating point processor what it returns as an acceptable result and that might be different if a result was taken from cache or not.
 
>>>> Not quite. I'm just rethinking it myself before I'm posting again
I would like to know whether in case of failing the results differ only by a minimal value or if one of the numbers was simply corrupt. If the latter we don't need to ceck whether that could be possible by the standard or not.
0
 
itsmeandnobodyelseCommented:
>>>> newVar would then be expected to have the value 1000, but newVar is either 0, or something like: 18910402543280383426838921216.000000 or similarly huge.


Sorry, I overlooked your answer. Yes, if the numbers were significantly different, one of the operands probably was corrupt.
0
 
itsmeandnobodyelseCommented:
>>>> one of the operands probably was corrupt.

I had such issues when I passed local variables or stack variables by reference to callback functions. As a callback runs asynchronously all arguments passed by address must live beyond the life time of the calling function(s).  
0
 
Infinity08Commented:
>> Is there any standard which defines that?

Deterministic behavior ?

Sure, if you perform the calculation on two different machines (or possibly compiled with two different compilers and/or with different compiler flags/settings), the result might be different, depending on how compliant they are with the IEEE754 standard (the most common floating point standard).

But doing the exact same calculation twice on the same machine with the same executable, will give you the same result. Even better : performing the same calculation on two IEEE754 compliant platforms should also yield the same result.

That's what's so nice about computers - the fact that you know how it behaves, and that it behaves predictably (and deterministically). Otherwise, I'd say it would be quite difficult to write software heh, since today 1+1 might be 2, but tomorrow it might be 3.

For more info, see chapter 11 in the IEEE754 standard.


>> and there were two cores where the calculation was done, then it depends on the floating point processor what it returns as an acceptable result

So, you're saying that the two cores would return a different result for the same exact calculation ? If you find one that does that, I'd return it immediately as defective heh (don't forget to ask your money back).
0
 
itsmeandnobodyelseCommented:
>>>> So, you're saying that the two cores would return a different result for the same exact calculation ?

Hmmm. From a mathematical point of view there is no unique representation of a floatig point number which could be defined as *the* result of a computation. Hence, if for example calculating PI you must accept that different results were possible at least when using different algorithms. When using the same algorithm it depends on the process whether it could go different ways for optimization purposes or not. Assume an intermediate result like a logarithm value could be get by table access or by using an algorithm and that it depends on the current resources which way to take. Then, these both ways could produce a slightly different result though within the range of precision and all following results may differ as well. If a calculation always was done using the same process where no heuristic parts were involved we will always get the same result.

>>>> I'd return it immediately as defective heh

In 1993/1994 I bought a computer with a PENTIUM 60MHz processor which had an FDIV bug. A few months later when I had a warranty issue because of the RAM, I made a request to exchange the processor as well, and they finally granted it. The bug was that one of some million of floating point computations could fail. I actually never realized any issue with that.
0
 
Infinity08Commented:
Alex, we're still talking about these two lines, aren't we ?

>>   newVar[i][j] = 1000 - Result;
>>   oldVar[j] = 1000 - Result;

I don't see any different algorithms here. I just see two calculations that are exactly the same, and thus should produce the exact same result :) If that's not the case, then you've got a (seriously) non-compliant (or defective) system, and I'm surprised you can get any work done on it correctly.

Obviously there is always the possibility of non-compliance or defects, but I think it's more likely that there simply was a buffer overflow, where writing the first value actually overwrote some other important value (like Result for example).
0
 
letharionAuthor Commented:
>Obviously there is always the possibility of non-compliance or defects, but I think it's more likely that there >simply was a buffer overflow, where writing the first value actually overwrote some other important >value (like Result for example).

I have done millions and millions of other calculations (in fact, a lot more I think), and I've never noticed numbers not making sense like this before.

I wrote some new code that was supposed to make things easier, and every once in a while things go haywire. The problem is quite likely me. And for a while I thought Infinity hit it with "overwrote some other important value (like Result for example)." But if I print result before, between and after the calculations, the value remain the same.

Like I said above, I misrepresented the problem. It turns out I do an assert on the same values, but further down in the code, so I mistook the line where the problem occurs. (Hence I couldn't figure out what was wrong) I'm trying to narrow it down, and hopefully doing that alone will fix it. But I'll be back soon.

Gonna take a break however. The partyleader for the Swedish PirateParty is speaking pretty much outside of the building in a few min. Gotta hear that :)
0
 
Infinity08Commented:
If you can post a more complete code sample, we can always have a look at it for you ...
0
 
itsmeandnobodyelseCommented:
>>>> Alex, we're still talking about these two lines, aren't we ?
No. For these two lines the compiler hopefully has created only code for one single computation.

In my last comments I referred to

>>>> So, you're saying that the two cores would return a different result for the same exact calculation ?

and what I said is that if the internal representation of a floating point number as the result of a calculation isn't uniquely defined, it might depend on the process for that calculation whether the results always were binary equal. For example, I read about parallel processing and that some multi-core processors were able to divide a task into more or less parallel tasks dependent on the current capacities. Assume an algorithm like (a+b+c+d). If that is calculated  as (a + (b + (c + d))) the result could be different to  ((a + b) + (c + d)). That is what I mean by 'process'. If  (a+b+c+d) was always resolved same way, differences hardly could occur. If the current workload is a factor to choose the one or the other, differences are more likely.

>>>> If you can post a more complete code sample, we can always have a look at it for you ...

Yes, the sample code posted hardly would throw the assert cause it is very unlikely that between two statements there was an interrrupt where a buffer overflow or a wrong address of a concurrent thread overwrites just the right-hand variable used in the assignment statements.  That surely isn't impossible but wouldn't occur 'once every hundred runs or so'.
0
 
Infinity08Commented:
>> it might depend on the process for that calculation whether the results always were binary equal.

That's where the IEEE754 standard comes to the rescue ;) The same calculations performed on different compliant systems yield the same results. It doesn't matter how it's implemented, as long as the result is the same.

That said, there is still a lot of hardware that isn't fully IEEE754 compliant, and some hardware that isn't even following the standard at all (although that's not common), so you never know. But, for the example of two cores of the same CPU, for the same system, running the same executable ... I would be surprised to see different results. Even in a situation where the CPU does some extensive runtime optimizations like you describe (which is more and more common these days), if the CPU is IEEE754 compliant, the result still has to be the same, despite the optimizations.
0
 
letharionAuthor Commented:
Ok, here we go.
I hope I got all the relevant code. There unfortunately quite a bit of code involved, but this "should" be the interesting parts.
This code runs 5 times, with size = 32.
The 6th time (still 32), the 5th elements in the variables fail asserting their equality.

Everything works great for thousands and thousands of runs until I use the "newVar" CudaArray template, so probably I'm doing something bad with it (during CopyToHost?)

Just ask if anything is unclear and I'll try to answer.
CudaArray<float>* newVar;
float* oldVar, kernel_oldVar;
srand(1000); //So the error doesn't move around.
 
func1(float* inData, int size) {
  newVar = new CudaArray<float>(size)
  oldVar = inData;
  func2(newVar, oldVar, size);
}
 
func2(CudaArray<float>* newVar, float* oldVar, int size) {
  /* Snip */
  kernel_oldVar = CudaMalloc<float>(size);
 
  CudaKernel<<<>>>(newVar->DevicePtr(), oldVar, size, /*Other params*/)
  //Strange if your unfamiliar with Cuda, I know. But the <> should be there.
 
  newVar->CopyToHost();
  CudaCopyDtH<float>(kernel_oldVar, oldVar, size);
 
  for(int i = 0; i < popsize; i++) {
    if(newVar->HostPtr()[0][i] != result[i]) {
      printf("\n%d\n", i);
      assert(newVar->HostPtr()[0][i] == oldVar[i]);  //This was the real failing assert. i == 4. 6th run.
    }
  }
  /* Snip */
}
 
__global__ static void CudaKernel(float** newVar, float* oldVar, /* Other params */) {
  /* Lots of calculations */
 
  newVar[0][i] = 1000000 - result[i];
  oldVar[i] = 1000000 - result[i];
  assert(newVar[0][i] == oldVar[i]);  //This one doesn't fail. My mistake.
}
 
//The templates used:
template <class T>
T* CudaMalloc(const size_t size) {
   T* devicePtr = NULL;
   cudaError_t error = cudaMalloc((void**)&devicePtr, size * sizeof(T));
   if (error == cudaErrorMemoryAllocation) {
      cout << "CudaMalloc error: " << error << " (" << cudaGetErrorString(error) << ")" << ". Exiting" << endl;
      assert(error == 0);
   }
   return devicePtr;
}
template <class T>
void CudaCopyHtD(const T* hostPtr, T* devicePtr, const size_t size) {
   cudaError_t error = cudaMemcpy(devicePtr, hostPtr, sizeof(T)*size, cudaMemcpyHostToDevice);
   if(error != 0) {
      cout << "Copy HtD error: " << error << " (" << cudaGetErrorString(error) << ")" << ". Exiting" << endl;
      assert(error == 0);
   }
}
template <class T>
void CudaCopyDtH(const T* devicePtr, T* hostPtr, const size_t size) {
   cudaError_t error = cudaMemcpy(hostPtr, devicePtr, sizeof(T)*size, cudaMemcpyDeviceToHost);
   if(error != 0) {
      cout << "Copy DtT error: " << error << " (" << cudaGetErrorString(error) << ")" << ". Exiting" << endl;
      assert(error == 0);
   }
}
 
template <class T>
class CudaArray {
   public:
   CudaArray(unsigned int size) {
      arrays = 3;
      arraySize = size;
 
      ArrayHost = (T**)malloc(arrays * sizeof(T*));
      for(int i = 0; i < arrays; i++)
         ArrayHost[i] = (T*)malloc(arraySize * sizeof(T));
 
      ArrayDevice = CudaMalloc<T*>(arrays * sizeof(T*));
      PtrHolder = (T**)malloc(arrays * sizeof(T*));
      for(int i = 0; i < arrays; i++)
         PtrHolder[i] = CudaMalloc<T>(arraySize);
 
      CudaCopyHtD<T*>(PtrHolder, ArrayDevice, arrays);
   }
   T** DevicePtr() { return ArrayDevice; }
   T** HostPtr() { return ArrayHost; }
   void CopyToHost() {
      for(int i = 0; i < arrays; i++) CudaCopyDtH<T>(PtrHolder[i], ArrayHost[i], arraySize);
   }
   void CopyToDevice() {
      for(int i = 0; i < arrays; i++) CudaCopyHtD<T>(ArrayHost[i], PtrHolder[i], arraySize);
   }
 
   private:
   unsigned int arrays;
   unsigned int arraySize;
 
   T **ArrayHost;
   T **ArrayDevice;
   T **PtrHolder;
};

Open in new window

0
 
Infinity08Commented:
A few questions :

1) When/where is func1 called, and with which parameters ?

2) The CudaKernel call in func2 overwrites the oldVar (and thus also the inData originally passed to func1). Is that intentional ?

3) The CudaCopyDtH call in func2 again overwrites oldVar (and thus also the inData originally passed to func1) - this time with random data, since kernel_oldVar was just allocated. Is that intentional ?

4) What is result ?

5) What is popsize ? Can it ever be >= arraySize

My guess would be that you're corrupting the oldVar array, which causes the indicated assert to fail.
0
 
letharionAuthor Commented:
1) func1, really called "CudaCallC", is a JNI (Java->C) function, and called from a separate java program.
Signature:
JNIEnv * jenv, jclass jc, jint lastrun, jfloatArray calcs, jint maxcalcsize, jint size, jfloatArray oldVar, jintArray hits, jstring inDataFile, jint currentRun)

2) Yes, that is intentional.
But now that you ask, I get uncertain as to why any data is assigned to oldVar in the first place. It'd done during the conversion from Java to C, but I'm gonna think it through. The purpose of sending in inData in the first place is to provide a pointer to write data back to when the calculation is finished.

3) My bad. The CudaKernel call takes kernel_oldVar, not oldVar, as a parameter. Thus the data is not random, as it has been set by the kernel.

4) float result[3]. *sigh* The assignment is result[0], not [i].

5) popsize is a typo for size. So no, that should never be possible.
0
 
Infinity08Commented:
I guess I can ignore typo's, like :

>> float* oldVar, kernel_oldVar;

where kernel_oldVar is a float, rather than a float*, and several missing ;'s, and things like that. I assume that's just copy-paste errors, rather than problems with the code.


So, to summarize :

1) The CudaKernel call initializes newVar->DevicePtr()[0] and kernel_oldVar to the same values, checked by an assert (at the end of the CudaKernel function). That assert succeeds.

2) Then, newVar->CopyToHost() copies newVar->DevicePtr() to newVar->HostPtr().

3) Then, kernel_oldVar is copied to oldVar using CudaCopyDtH.

4) Then another set of asserts checks whether newVar->HostPtr()[0] and oldVar also contain the same values. Which you expect to be true, since they are copies of newVar->DevicePtr()[0] and kernel_oldVar resp., and those were already verified to contain the same values in 1). But it sometimes fails.

If the above is right, then the conclusion would be that either step 2) or 3) has a problem. I'm not familiar with what cudaMemcpy does exactly, and what the difference is between cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, so I can't really comment on that. But you can always use a debugger to track all these copies, and try to see what goes wrong, and where.
0
 
itsmeandnobodyelseCommented:
>>>>  if the CPU is IEEE754 compliant, the result still has to be the same, despite the optimizations
but the results of ((a+b)+(a+c)) and (a+(b+(c + d))) are mathematical different if the intermediate sums are rounded values.

>>>> assert(newVar->HostPtr()[0][i] == oldVar[i]);
I have problems to see why these values should be equal.

Is there any multi-threading involved? What does the kernel with the given oldVar ?


0
 
letharionAuthor Commented:
Yes, the code does compile, so those errors are because from the beginning, I started renaming variables, because I felt it would make sense. Not so now. This caused lots of stupid rewriting and I messed up.

Your summary is correct.
I agree with your assesment that the steps 2/3 have a problem. I'm fairly certain that it's 2.

cudaMemcpy copies data between to pointers. One pointer is supposed to point to the computers RAM, and the other to a cuda capable graphics cards memory, hence the CudaMalloc.
The "cudaMemcpyHostToDevice" is the direction that memory should be copied.

Yes, I should try that. I'm having problems debugging the native function however, since it's called from java, and I can't get the debugging to follow over the JNI call.
0
 
letharionAuthor Commented:
>I have problems to see why these values should be equal. Is there any multi-threading involved?
Yes, the kernel runs 32 parallel threads. I can see why that isn't really obvious.
The i in the kernel can be thought of as thread-ID. So each thread assigns one of the values.

I don't understand the last sentence.
0
 
Infinity08Commented:
>> >>>>  if the CPU is IEEE754 compliant, the result still has to be the same, despite the optimizations
>> but the results of ((a+b)+(a+c)) and (a+(b+(c + d))) are mathematical different if the intermediate sums are rounded values.

If that's the case, then if the CPU wants to be IEEE754 compliant, it can't perform that kind of optimization :)


>> cudaMemcpy copies data between to pointers. One pointer is supposed to point to the computers RAM, and the other to a cuda capable graphics cards memory, hence the CudaMalloc.

That's what I guessed from the names ;) But what I am unsure about, is whether it does a direct memory copy (just like memcpy would do), or whether it performs a "translation" on the data at the same time (for example change byte order, change size, change floating point format, add meta data, ...).


>> I'm having problems debugging the native function however, since it's called from java, and I can't get the debugging to follow over the JNI call.

Does it only have the problem when doing it through JNI ?
Or does it also have the problem when doing it directly in native code ?

Did you make sure that the inData received by func1 has enough memory allocated to it (ie. before the JNI call was made).
0
 
itsmeandnobodyelseCommented:
>>>> What does the kernel with the given oldVar ?
>>>> I don't understand the last sentence.

I mean, whether the kernel operates on the data given *after* the call has returned.

How do you assure that all 32 threads are done with their slot before the call was finished?

 
0
 
letharionAuthor Commented:
>>I mean, whether the kernel operates on the data given *after* the call has returned.
No, those are very last lines of code.
>>How do you assure that all 32 threads are done with their slot before the call was finished?
The kernel is launched asynchronously, but when a cudaMemcpy is called, the program freezes, awaiting the return of the kernel.

Basically, I "fixed" it.
Some of these threads return "early", and never reach the mentioned calculations.
I tried setting all values to 0 both before the kernel is called at all, and as the very first instruction inside. So before the return check occurs, but the assert still fails. Doesn't make any sense what so ever to me, but as long as I avoid asserting those threads, everything is fine.
That solution works perfectly (even saving some otherwise wasted clockcycles) if I can look beyond the fact that I don't understand it.
0
 
itsmeandnobodyelseCommented:
>>>> but when a cudaMemcpy is called, the program freezes, awaiting the return of the kernel.

That is only one condition which needs to be met. More important is that the kernel only returns if *all threads* have finished their work regarding their slot.

Did I understand correctly that you currently have no control when the threads were finished?
0
 
letharionAuthor Commented:
>>Did I understand correctly that you currently have no control when the threads were finished?

Yes, that's correct.
I added a __syncthreads() to the end of the kernel run.
It does sound quite reasonable of course that there should be some control.
I never really thought about until now, becase all the other copying always worked flawlessly, but that could be "luck" I guess.
0
 
letharionAuthor Commented:
Or it didn't work so flawlessly, but atleast I never had a problem with it
0
 
itsmeandnobodyelseCommented:
>>>> but that could be "luck" I guess

Assume one of the threads was 'late' by filling its slot and the array pointing to all slots already was returned to the caller. Then the kernel thread might either change data though they already were evaluated and/or may write to already freed memory. The latter could spoil any other data currently on the heap.

>>>>  added a __syncthreads() to the end of the kernel run.

I googled for __syncthread and found the following statement

__syncthreads() is you garden variety thread barrier. Any thread reaching the barrier waits until all of the other threads in that block also reach it. It is designed for avoiding race conditions when loading shared memory, and the compiler will not move memory reads/writes around a __syncthreads().

----------
It is nothing more and nothing less. Unless you are writing to a shared memory location in thread i then reading that same location in thread j, you don't need __syncthreads().
----------

If that is right, the __syncthreads doesn't help for your purpose. You would need some kind of semaphore with an initial count of 32. The return must not be done before the count is 0, i. e. all threads have finished their work and acknoledged by taking another slot of the semaphore.



0
 
letharionAuthor Commented:
I've been trying to find something in the docs that support this, but I can't. I'll try asking in the nvidia forum.

Because this is executed on a GPU rather than a CPU, all threads must follow the same execution path.
For example, if we pass an if/else, and half threads wanna go each way, one half must stop and wait for the rest of the threads, when "if" is done, "else" runs, and when they converge again, they all run concurrently again.

To my knowledge, it isn't possible for one thread to "finish" before the others. (Then again I have this weird issue here that might say otherwise)

I'll ask in nvidias forum.
0
 
letharionAuthor Commented:
>>Since kernel-launches are asynchronous, I wonder: how do I know/ensure that memcpy doesn't copy data that the kernel hasn't had time to touch yet?
>You don't need to do anything. A cudaMemcpy to/from the host will implicitly wait for all previous asynchronous operations to complete. A device to device >memcpy will be inserted in the queue of async operations and run in order.
0
 
letharionAuthor Commented:
Just splitting the points to close the question. Raise any objections of you want and we'll work it out. :)
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 14
  • 10
  • 10
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now