Be seen. Boost your question’s priority for more expert views and faster solutions

Hey guys,

I have a function that loops through an int array and returns true and breaks out if it finds a 0.

However, is there any way to improve it and detect the 0 using only ONE comparison, to reduce overhead?

I have a function that loops through an int array and returns true and breaks out if it finds a 0.

However, is there any way to improve it and detect the 0 using only ONE comparison, to reduce overhead?

```
int LookForOne (int n , int x[]) {
int i;
for (i = 0; i < n; i++) {
if (x[i] == 0) {
return (TRUE);
}
}
return (FALSE);
}
```

I was wondering if you could possibly do some math (add / subtract) to improve it.

What I meant was : what will this function be used for ? When will you call it ? How often ? For what purpose ?

There might be other ways of optimizing this.

Modern CPUs have a feature called "branch prediction" that, together with pipelining, eliminates much of the overhead of a conditional jump -- which is needed to handle the "if" construct. Try to benchmark your loop vs. a loop that ANDs all elements and checks the result, like in my code snippet. I'm afraid the run time will be dominated by the time needed to fetch the array values, at least for large arrays bigger than the L2 cache size.

Moreover, for example, x86 CPUs have special loop instructions that can implement tight loops that check arrays.. If my memory doesn't fail me, the core of your loop can be implemented using one (!) assembly instruction like

REPNE SCASD

with a consecutive check if the end of the array has been reached. The instruction reads like "REPeat while Not Equal (zero) SCAn the String Double-word-wise"

If you have a good compiler, it will issue an instruction like this for your code.

As Infinity08 already said, without pre-sorting or some other preprocessing, no substantial speed-ups are possible.

```
int LookForOneMaybe(int n , int x[]) {
int i;
int res = -1;
for (i = 0; i < n; i++) {
res &= x[i];
}
if (res != 0) {
return (FALSE);
}
return (MAYBE);
}
```

But that way doesn't reduce overhead.

```
int CheckForOne(int x[], int n) {
long int test = 1;
int i;
int temp = 1;
int retVal = 1;
for (i = 0; i < n; i++) {
temp = test;
test = test * x[i];
while (test > temp) {
test--;
}
printf("test = %d\n",test);
}
if ( test != 0 ) {
retVal = 0;
}
return retVal;
}
```

test--;

}

that could be a lot of comparisons

And if test ever goes negative, it would stay negative even if it later sees x[i] = 0

unsigned int temp = x[i];

temp |= temp>>1;

temp |= temp>>2;

temp |= temp>>4;

temp |= temp>>8;

temp |= temp>>16;

temp |= temp>>32; // depending on sizeof(int)*CHAR_BIT;

temp &=1;

then

test *= temp is guaranteed not to overflow, or test &= temp can also be used

Although it could be faster than

while (test > temp) {

test--;

}

The problem with the code is that the data dependencies are tight: Each access to temp depends on the previous line. This kind of makes the CPU's pipeline useless. A good compiler would do some loop unrolling and interleaving, like I did in the second example. Only possible if there are enough CPU registers available.

Anyway, I suggest to do some benchmarks to check if the effort pays off.

Koraxen: Is multithreading an option for you? Multi-core CPUs become more and more common, and it should be easy to parallelize the code. Just need to check which access pattern would be best:

- Thread 1 works on [0..N/2), thread 2 works on [N/2..N)

- Thread 1 works on even indexes of [0..N), thread 2 works on odd indexes of [0..N)

```
int LookForOne(int n , int x[]) {
int i;
unsigned int res = -1;
for (i = 0; i < n; i++) {
unsigned int temp = x[i];
temp |= temp>>1;
temp |= temp>>2;
temp |= temp>>4;
temp |= temp>>8;
temp |= temp>>16;
temp |= temp>>32; // depending on sizeof(int)*CHAR_BIT;
// Now, the LSB of temp is set iff x[i] <> 0
// We use a "full" AND for res,
// it's one machine instruction anyway,
// but we check the LSB after the loop
res &= temp;
}
if ( (res & 1) ) {
return (FALSE);
}
return (TRUE);
}
int LookForOneUnrolled(int n , int x[]) {
int i;
unsigned int res = -1;
for (i = 0; i < n; i += 6) {
unsigned int temp1 = x[i];
unsigned int temp2 = x[i+1];
unsigned int temp3 = x[i+2];
unsigned int temp4 = x[i+3];
unsigned int temp5 = x[i+4];
unsigned int temp6 = x[i+5];
temp1 |= temp1>>1;
temp2 |= temp2>>1;
temp3 |= temp3>>1;
temp4 |= temp4>>1;
temp5 |= temp5>>1;
temp6 |= temp6>>1;
temp1 |= temp1>>2;
temp2 |= temp2>>2;
temp3 |= temp3>>2;
temp4 |= temp4>>2;
temp5 |= temp5>>2;
temp6 |= temp6>>2;
temp1 |= temp1>>4;
temp2 |= temp2>>4;
temp3 |= temp3>>4;
temp4 |= temp4>>4;
temp5 |= temp5>>4;
temp6 |= temp6>>4;
temp1 |= temp1>>8;
temp2 |= temp2>>8;
temp3 |= temp3>>8;
temp4 |= temp4>>8;
temp5 |= temp5>>8;
temp6 |= temp6>>8;
temp1 |= temp1>>16;
temp2 |= temp2>>16;
temp3 |= temp3>>16;
temp4 |= temp4>>16;
temp5 |= temp5>>16;
temp6 |= temp6>>16;
temp1 |= temp1>>32;
temp2 |= temp2>>32;
temp3 |= temp3>>32;
temp4 |= temp4>>32;
temp5 |= temp5>>32;
temp6 |= temp6>>32;
temp1 &= temp2;
temp3 &= temp4;
temp5 &= temp6;
res &= temp1;
temp3 &= temp5;
res &= temp3;
}
if ( (res & 1) ) {
return (FALSE);
}
return (TRUE);
}
```

x[n]=0;

for( int *p=&n[0]; *p++; ){}

return( p <= &x[n] );

If your machine has hardware instuctions that strlen takes advantage of, it could be less overhead to use it to scan for candidate 0's dependng on how the int values are distributed.

Koraxen: What is your target architecture, anyway? :-)

```
if (n == 0) return FALSE;
if (x[n-1] == 0) return TRUE;
int save = x[n-1];
x[n-1] = 0;
// ...
x[n-1] = save;
return RETVAL;
```

I'm pretty sure it won't. Be serious. The original code was a loop with 1 compare in it. Now you've made it a loop with multiple shifts, ors and movs.

I thought the goal was "to reduce overhead" ?

>> What will this function be used for ? When will you call it ? How often ? For what purpose ?

Yes, but as I said : it replaced 1 compare with multiple shifts, ors and movs. I doubt that that will be faster on any machine.

Btw, when the compiler does a good job, it will generate code that has only two jumps : 1 for the loop, and one to get out of the loop when a 0 is found. The first will obviously be taken multiple times, but the second will only be taken once. It might look something like this (pseudo assembler) :

```
sl n, 2 ; n <<= 2
add n, x ; n += x
lab1: cmp x, n ;
jge lab2 ; if (x >= n) goto lab2
cmp 0, (x) ;
je lab3 ; if (0 == *x) goto lab3
add x, 4 ; x++
jmp lab1 ; goto lab1
lab2: ret 0 ; return 0
lab3: ret 1 ; return 1
```

```
int LookForOneWatchdog(int n, int x[]) {
int i;
if (n == 0) return (FALSE);
int watchdog = x[n-1];
int ret = FALSE;
for (i = 0; ; i++) {
if (x[i] == 0) {
ret = TRUE;
break;
}
}
x[n-1] = watchdog;
return ret;
}
```

I'm not sure what you mean by that, but I'd love you to show me how you can optimize assembler like what I posted.

btw, this is not true :

>> the watchdog suggestion by ozo reduced it to one compare.

because you still need a compare for the position at the end. Plus an extra compare at the beginning for n. That's 3 compares.

Btw, the code you posted will not work for several reasons :

1) what if there's no 0 at all in any memory location after the beginning of the array ? You'll have a pretty long loop that might even crash the program.

2) what if there isn't a 0 inside the array, but there is one AFTER the array ? You'll return TRUE anyway.

As I mentioned in one of my previous posts, the most efficient code on i386 for this task is

REPNE SCASD

together with some initialization and termination code.

I have posted optimized code that can be compiled directly to code that uses only one conditional jump in the loop (plus three conditional jumps outside the loop)..

```
int LookForOneWatchdog(int n, int x[]) {
int i;
if (n == 0) return (FALSE);
int watchdog = x[n-1];
x[n-1] = 0; // forgot this line
int ret = FALSE;
for (i = 0; ; i++) {
if (x[i] == 0) {
ret = TRUE;
break;
}
}
x[n-1] = watchdog;
return ret;
}
int LookForOneWatchdogOptimized(int n, int x[]) {
int i;
if (n == 0) return (FALSE);
if (x[n-1] == 0) return (TRUE);
int watchdog = x[n-1];
x[n-1] = 0;
int ret = FALSE;
for (i = 0; ; i++) {
if (x[i] == 0) {
break;
}
}
x[n-1] = watchdog;
return (i < (n - 1)) ? TRUE : FALSE;
}
```

In other words : it will always return TRUE now, unless n == 0. That doesn't meet the requirements either ;) You need to add a few more compares.

And what if the value at n-1 is already 0 ?

No matter how you look at it, adding a watchdog, will get you MORE compares, because you need to check for more special cases.

>> I have posted optimized code that can be compiled directly to code that uses only one conditional jump in the loop (plus three conditional jumps outside the loop)..

But the overall performance will most likely still be worse than the original code.

Anyway ... let's stop here ... I'll be waiting for Koraxen's feedback.

I'll be waiting for the author'S feedback, too.

So the statement is trivial, it might give you a way to optimize: add new entries to the array so that it keeps always sorted.

#define BLOCK_SIZE 64

void insertToArray(int ** parr, int * ps, int n, int i)

{

int * p;

int b, e, m;

int s = *ps;

if (n >= s)

{

p = (int*)malloc((s+BLOCK_SIZE

if (s > 0)

{

memcpy(p, *parr, s*sizeof(int));

free(*parr);

}

memset(p+s, 0, BLOCK_SIZE*sizeof(int));

*parr = p;

*ps = s + BLOCK_SIZE;

n = s;

}

p = *parr;

b = 0;

e = n-1;

while (b <= e)

{

m = (b + e)>>1; // middle

if (i < *(p + m))

e = m - 1;

else

b = m + 1;

}

/* now b is the pos to insert */

if (n > b) /* use memmove for overlapping copy */

memmove(p+b+1, p + b, (n - b)*sizeof(int));

*(p + b) = i;

}

Note, if you have negative integers you may do a binary search at the sorted array (look at the while loop) to check for 0 entry.

Regards, Alex

The question doesn't specify when this function is called, or what architecture it will be used on. The only requirement was to get around the conditional. I assume multi-threading was not an option either.

Again, thank you.

I still don't get the point though ... Was this an assignment ? If so, you should have said so from the beginning, and we could have helped you a lot better ...

Regards, Alex

http://en.wikipedia.org/wiki/Grover's_algorithm

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.

Interesting question. :)

You're going to have to do at least two compares. One for the loop iterator, and another for the actual data.

But a bit of cleverness can reduce the number of loop iterators that you need to compare in exchange for slightly more programming.

Assume that you know the data well enough that you expect the loop to iterate over dozens (or even hundreds) of objects. The conventional loop looks like this:

int FindZero (int *Array, int *Length)

{

int idx;

for (idx = 0; idx < Length; ++idx)

if (Array[idx] == 0)

return idx;

return -1

}

You may get better performance by checking multiple items per iteration.

int FindZero (int *Array, int *Length)

{

int idx;

int Limit;

int *Item;

// Process most of the array in 8 word chunks

Limit = Length && ~3;

Item = Array;

for (idx = 0; idx < Limit; idx += 8)

{

if (*(Item++) == 0)

return Limit;

if (*(Item++) == 0)

return Limit + 1;

if (*(Item++) == 0)

return Limit + 2;

if (*(Item++) == 0)

return Limit + 3;

if (*(Item++) == 0)

return Limit+ 4;

if (*(Item++) == 0)

return Limit + 5;

if (*(Item++) == 0)

return Limit + 6;

if (*(Item++) == 0)

return Limit + 7;

}

// Check the items after the last 8 word chunk

for (idx = Limit; idx < Length; ++idx)

if (Array[idx] == 0)

return idx;

return -1

}

And there may be minor improvements in this by consistently using array notation or consistently dereferencing the array as a pointer, but this gives you the ida.

Good Luck,

Kent