# Can you make this code faster?

Hi Experts,

In a particular calculation that I do, I have three strings (composed of 0s and 1s) that I use to produce a number. All 3 strings (fp1, fp2 and fp3) are of the same length. I compare the 0s and 1s along the length of the string. If fp3 is "1" and either fp1 or fp2 is "1", add one to nA.
If all three strings have "1" at a particular position, add one to nB.
Return value is (double)nB / nA *100

(see post http://www.experts-exchange.com/Programming/Programming_Languages/Cplusplus/Q_20738573.html for some previous history, although the way I'm doing things has changed slightly since then).

In the code below, SimCalc works and gives me the result I require.

I'm having issues now with the lengths of my strings - they are in the tens of thousands in length, and so I've been trying to compress the initial strings with the following sequence of code, in attempt to save some storage space on my hard drive, and also in an attempt to speed things up. The following code works, but I've also suceeded in slowing down the calculation several fold in the function SimCalc2.

I include below my attempt at compression, with some explanation throughout the code of what I'm trying to do.

//CUT HERE ============================================
#include <iostream>
#include <string>
#include <iomanip>
#include <windows.h>            //for GetTickCount function
using namespace std;

double SimCalc(string& fp1, string& fp2, string& fp3);  //prototypes
unsigned char string2char(string& binstring);
string CompressString(string& tfp);
int numbits (unsigned char b);
double SimCalc2(string& fp1, string& fp2, string& fp3, int* pCS, string& fstring);

void main()
{
string fp1 = "00110010111010010010100111110011";
string fp2 = "00001101111010110100100110100101";
string fp3 = "11101001100001101001100110000111";

int maxit = 10000;      //max iterations for timing
double n = 0.0;
unsigned long tic = GetTickCount();
for ( int k = 0; k<maxit; k++) {
n= SimCalc( fp1, fp2, fp3);      //standard calculation
//strings not compressed
}
unsigned long tic2 = GetTickCount();
cout << "Took simcalc  " << (tic2 -tic) <<" ms for " << maxit << " iterations\n";
cout << "Answer is "<< n << " " << fp1.length() << " units long\n\n";

//building a string to contain all unsigned chars - use this in SimCalc2
string fstring = "";
for (int z =0; z<= 255;++z) {
unsigned char w = z;
fstring += w;
}

//Compress all three strings - see function CompressString for details.
string fp1c = CompressString(fp1);
string fp2c = CompressString(fp2);
string fp3c = CompressString(fp3);

//create array to store the number of bits set in an unsigned char - used
//in SimCalc2
int* pCS = new int[255];
for (int x =0; x<=255;++x) {
*(pCS + x) = numbits(x);
}

//perform SimCalc2 calc with new compressed strings
n = 0.0;
tic = GetTickCount();
for ( k = 0; k<maxit; k++) {
n= SimCalc2(fp1c, fp2c, fp3c, pCS, fstring);
}
tic2 = GetTickCount();
cout << "Took simcalc2 " << (tic2 -tic) <<" ms for " << maxit << " iterations\n";
cout << "Answer is "<< n << " " << fp1c.length() << " units long\n\n";
}

double SimCalc(string& fp1, string& fp2, string& fp3)
{
const char* p1 = fp1.c_str();  //pointer directly to data
const char* p2 = fp2.c_str();
const char* p3 = fp3.c_str();

int nA= 0;
int nB= 0;
int nLen= fp1.length();
for ( int j=0; j< nLen; ++j, ++p1, ++p2, ++p3 ) {
if (*p3 == '1')      {
if ((*p1 | *p2)== '1') {
++nA;
if ((*p1 + *p2) == ('1'+'1')){
++nB;
}
}
}
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}
unsigned char string2char(string& binstring)
{
//converts a strong of 0s and 1s (8 in length)
//to an unsigned char
int x = 128;
int result = 0;
string BS = "";
for (int i = 0; i <8; ++i, x /= 2) {
BS = binstring.substr(i, 1);
if (BS == "1") {
result += x;
}
}
unsigned char crap = result;
return result;
}

string CompressString(string& tfp)
{
//takes a strings of 0s and 1s and changes it,
//eight characters at a time to a string of unsigned
//chars
string result = "";
while (tfp.length() >7)      {
string temp = tfp.substr(0,8);
result += string2char(temp);
tfp = tfp.substr(8);
}
return result;
}

int numbits (unsigned char b)
{
//counts number of bits set in an unsigned char
int result;
++result;
}
}
return result;
}

double SimCalc2(string& fp1, string& fp2, string& fp3, int* pCS, string& fstring)
{
int nA= 0;
int nB= 0;
int p31, p32;
unsigned char p1, p2, p3;
for ( int j=0; j< fp1.length(); ++j)      {
//following gives me index of the character
p1 = fstring.find(fp1.substr(j,1));
p2 = fstring.find(fp2.substr(j,1));
p3 = fstring.find(fp3.substr(j,1));

p31 = (p3 & p1);
p32 = (p3 & p2);
//link back to pCS array, to give number of bits set
nA += *(pCS + (p31 | p32));
nB += *(pCS + (p32 & p31));
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}
//CUT TO HERE   ===============================================

Would love your advice on how I can speed up the SimCalc2 calculation, or the way I'm trying to do things.
###### Who is Participating?

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Commented:
hmmm, remind me, why weren't you using std::bitset or something like that but strings instead?  'cause, that would save a lot of space without extra code

-- ba
Author Commented:
ba-

Am pretty new to C++  - first time I've of heard of bitset. Will have to read up on it.
Commented:

std::bitset stores the data in a special structure where you can access individual bits.  you can store 8 bits in a single char, which means your storage requirement is going to be 1/8th of the original.  furthermore, std::bitset is typically optimized nicely

the only drawback is, std::bitset is a static structure, i.e. you can't change its size during runtime.  for that, i suggest using boost's dynamic_set:

http://www.boost.org/libs/dynamic_bitset/dynamic_bitset.html

one final note.  you can achieve significant speeds gain by building a pre-calculated table for comparing string snippets instead of individual bits.  consider this simple example:  let's say we have only two bitsets, each with 16 bits.  let's say, we want to calculate the mismatch between them ( i.e. whenever the corresponding bit in each bitset is different, we want to increase the distance by 1 ).  this is a simplified version of your problem.  then, you can simply divide the bitsets into 4 snippets ( each containing 4 bits )

it gets a tad more complex after this.  as these 4 bit snippets are nothing but binary numbers, you can actually convert them to their decimal equivalents.  that means for each unique bit combination, there is a unique decimal number that can be used as an index.  once you have these indices, you can pre-calculate a difference table once at the very beginning and then compare strings snippet by snippet instead of bit by bit

let me know if you are interested in this.  i can try to change the code i already have to suit your distance function

-- ba

[ about boost.org :: boost.org is an organization supported by many c++ standards committee members and provides 100% free, peer-reviewed, cross-platform libraries.  many of the boost libraries, such as their smart pointer library, are already in the drafts of the next revision of the c++ standard ]
Commented:
It helps to map these things out.  There are 8 possibilities at each position:

fp1 fp2 fp3 action
--- --- --- ---------
0   0   0  (nothing)
1   0   0  (nothing)
0   1   0  (nothing)
1   1   0  (nothing)
0   0   1  (nothing)
1   0   1  nA++
0   1   1  nA++
1   1   1  nA++, nB++

Inspection shows that
if fp3 is 0, nothing happens so if we AND against that, the results will be 0
if we OR fp1 with fp2 we will get a 1 everywhere we want to increment nA
If we AND all three, we will get a 1 everywhere we want to increment nB

This boils down nicely to:
If  (fp1 |  fp2)  &  fp3  is non-zero, nA++
If  (fp1 & fp2)  &  fp3  is non-zero, nB++

By using straight binary operations like this, we can process 32 bits at a time, which offers fantastic speed advantage over examining each character of three string individually.   So the solution is to take the incomming string composted of '1' and '0' in groups of 32 and convert into a 32-bit binary value.  Then do the ORs and ANDs, then count up the results.  Here is a sanity check on the idea:

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

void main()
{
DWORD n1= strtoul( "11111111111111111111110111111011", 0, 2 );  // base 2
DWORD n2= strtoul( "00000000000010001000000001000000", 0, 2 );  // base 2
DWORD n3= strtoul( "11111111111111111111111111111111", 0, 2 );  // base 2

DWORD A= (n1 | n2) & n3;
DWORD B= (n1 & n2) & n3;

int nA= 0, nB= 0;
DWORD nBit= 1;
for (int j=0; j<32; j++ ) {
if (A & nBit ) nA++;
if (B & nBit ) nB++;
nBit <<= 1;  // move over to next position
}
printf( "nA=%d nB=%d\r\n", nA, nB );
}

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
As to your storage problem... it now becomes trivial...
Read in the strings and convert them to binary by taking 32 characters at a time (the strtoul library function will serve nicely) then output the binary values.  The resulting file will be one-eigth the size of the original.  During your conversion, if you reach the end of the file and have less than 32 characters, append a series of '0's to the end until it is 32 long.

Once the files are binary, you will just read the file in 4-byte chunks (directly into DWORD values) and use the OR and AND sequences shown to accumulate nA and nB.  Actually, for performance, you should probably read all three files into memory and just use DWORD pointers to access the data.

Incidently, this lends itself to an additional -- quite significant -- speedup.  The Pentium has a set of instructions that process 128-bit chunks like this -- the MMX and SSE opcodes.  In fact, they are designed for exactly this type of operation.  Got 500 points ? :)

-- Dan
Author Commented:
Dan, ba...

While I've been learning about bitsets, I actually figured out that formula.. honest gov!

double SimCalc3(bitset<bsLength>& p1, bitset<bsLength>& p2, bitset<bsLength>& p3)
{
int nA = (p3 & (p1 | p2)).count();
int nB = (p3 & (p1 & p2)).count();
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

In anycase, I'll have to have a look at what you've both suggested over the next few days, to make sure I understand the new stuff, and am able to play with it a little. I've only got an extra 5 points at the moment, as soon as I join, I'll post for the 500 point solution!
Commented:
> "I'll have to have a look at what you've both suggested over the next few days..."

actually, what dan posted is an implementation of what i suggested ( yet, without using stl as is expected from dan ;-) ).  because he went through the trouble of showing implementation examples, he should get the points

-- ba
Commented:
I was joking about the points.  I love this kind of problem :)
Commented:
I tried to speed up your program by changing only SimCalc2. My last version was as fast as SimCalc1 (16 ms / 10000).

double SimCalc2(string& fp1, string& fp2, string& fp3, int* pCS, string& fstring)
{
int nA= 0;
int nB= 0;
int p31, p32;
unsigned char p1, p2, p3;
const char* pp1 = fp1.c_str();
const char* pp2 = fp2.c_str();
const char* pp3 = fp3.c_str();
for ( int j=0; j< fp1.length(); ++j)
{
/*
//following gives me index of the character
p1 = fstring.find(fp1.substr(j,1));
p2 = fstring.find(fp2.substr(j,1));
p3 = fstring.find(fp3.substr(j,1));

p31 = (p3 & p1);
p32 = (p3 & p2);
//link back to pCS array, to give number of bits set
nA += *(pCS + (p31 | p32));
nB += *(pCS + (p32 & p31));
*/
p1 = pp1[j];
p2 = pp2[j];
p3 = pp3[j];

unsigned int t = 0;
for (int i= 0; i < 8; i++)
{
t = 1<<i;
if ((p3 & t))
{
if ((p1 & t) || (p2 & t))
{
nA++;
if ((p1 & t) && (p2 & t))
{
nB++;
}
}
}
}
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

The point is to avoid any string member functions.

Hope, that helps

Alex

BTW, you could speed up CompressString same way but it runs only once here.

Commented:
I changed my last approach from bytes to integers and got it to 0ms for 10000 iterations.

Took simcalc  15 ms for 10000 iterations
Answer is 45.4545 32 units long

Took simcalc2 469 ms for 10000 iterations
Answer is 45.4545 4 units long

------------------ code using cfpx[j] in SimCalc2 --------------------
J:\test\virt\Debug>virt
Took simcalc  16 ms for 10000 iterations
Answer is 45.4545 32 units long

Took simcalc2 63 ms for 10000 iterations
Answer is 45.4545 4 units long

------------------ code using const char* in loop (see last answer) --------------------
J:\test\virt\Debug>virt
Took simcalc  16 ms for 10000 iterations
Answer is 45.4545 32 units long

Took simcalc2 16 ms for 10000 iterations
Answer is 45.4545 4 units long

----------------- last code (see below) -----------------------------------------------------

J:\test\virt\Debug>virt
Took simcalc  16 ms for 10000 iterations
Answer is 45.4545 32 units long

Took simcalc2 0 ms for 10000 iterations
Answer is 45.4545 4 units long

double SimCalc2(string& fp1, string& fp2, string& fp3, int* pCS, string& fstring)
{
int nA= 0;
int nB= 0;
// int p31, p32;
// unsigned char   p1, p2, p3;
unsigned int    u1, u2, u3;
const char*     pp1 = fp1.c_str();
const char*     pp2 = fp2.c_str();
const char*     pp3 = fp3.c_str();
const unsigned* pu1 = (const unsigned*)pp1;
const unsigned* pu2 = (const unsigned*)pp2;
const unsigned* pu3 = (const unsigned*)pp3;

for ( int j=0; j< fp1.length(); j += 4)
{
/*
//following gives me index of the character
p1 = fstring.find(fp1.substr(j,1));
p2 = fstring.find(fp2.substr(j,1));
p3 = fstring.find(fp3.substr(j,1));

p31 = (p3 & p1);
p32 = (p3 & p2);
//link back to pCS array, to give number of bits set
nA += *(pCS + (p31 | p32));
nB += *(pCS + (p32 & p31));
p1 = pp1[j];
p2 = pp2[j];
p3 = pp3[j];
*/
u1 = pu1[j/4];
u2 = pu2[j/4];
u3 = pu3[j/4];

unsigned int t = 0;
for (int i= 0; i < 32; i++)
{
t = 1<<i;
if ((u3 & t))
{
if ((u1 & t) || (u2 & t))
{
nA++;
if ((u1 & t) && (u2 & t))
{
nB++;
}
}
}
}
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

So, if your strings have a length that is a multiple of 32 you might take this code.

Regards, Alex
Commented:
itsmeandnobodyelse,
>>got it to 0ms for 10000 iterations
The resolution of the GetTickCount()-based timer seems to be about 15ms or 16ms on many machines.  And any process that runs for more than 30ms is likely to be interrupted by a task switch one or many times.

If you ever get a time of 0ms, that means you need to increase th size of your test run by (at least) an order of magitude.

-- Dan
Commented:
not to mention, of course, the test is highly dependent on the test program itself.  depending on how you are testing stuff, the compiler might be actually optimizing certain parts away ( although they won't be optimized away in the actual code ).  also, i'd suggest testing with at least a million if not 10 million

oh, and, when used properly, std::string functions are as fast as their counterparts ( had to say it ;-) )

-- ba
Commented:
Dan,
you are right. i never got a resolution greater 0 and less than 15 ms.

Increasing to 1000000 gives time numbers like these:

Took simcalc  688 ms for 1000000 iterations
Answer is 45.4545 32 units long

Took simcalc2 844 ms for 1000000 iterations
Answer is 45.4545 4 units long

So, optimized compressed solution was about 20 % slower than non-optimized uncompressed one (on my machine).

double SimCalc2(string& fp1, string& fp2, string& fp3) //, int* pCS, string& fstring)
{
static unsigned  t[32] =
{   1,     2,     4,     8,     16,    32,    64,    128,   256,   512,   1024,  2048,  4096,  8192,  16384, 32768,
1<<16, 1<<17, 1<<18, 1<<19, 1<<20, 1<<21, 1<<22, 1<<23, 1<<24, 1<<25, 1<<26, 1<<27, 1<<28, 1<<29, 1<<30, 1<<31,
};

int nA= 0;
int nB= 0;
unsigned int    u1, u2, u3;
const char*     pp1 = fp1.c_str();
const char*     pp2 = fp2.c_str();
const char*     pp3 = fp3.c_str();
const unsigned* pu1 = (const unsigned*)pp1;
const unsigned* pu2 = (const unsigned*)pp2;
const unsigned* pu3 = (const unsigned*)pp3;

int len = fp1.length() / 4;
for ( int j=0; j< len; j++)
{

u3 = pu3[j];
u1 = pu1[j];
u2 = pu2[j];

unsigned* pt = t;
for (int i= 0; i < 32; i++, pt++)
{
if (u3 & *pt)
{
if ((u1 & *pt) || (u2 & *pt))
nA++;
{
if ((u1 & *pt) && (u2 & *pt))
nB++;
}
}
}
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

That gives these times

Took simcalc  671 ms for 1000000 iterations
Answer is 45.4545 32 units long

Took simcalc2 672 ms for 1000000 iterations
Answer is 45.4545 4 units long

Regards, Alex

Author Commented:
Been working away and impementing your various suggestions - these are the timings I get on my laptop for 10 million iterations, and orginal strings 32 in length.

That took simcalc  (ORIGINAL) 5888 ms
Answer is 28.5714 32 units long

That took simcalc3 (bitsets)3295 ms

That took simcalc4 (ALEX) 5478 ms

That took simcalc5 (DAN)4436 ms

So, for the time being, bitsets look the best. I create them from the orginal strings just once with:

bitset<bsLength> bs1(fp1);
bitset<bsLength> bs2(fp2);
bitset<bsLength> bs3(fp3);
where bsLength is 32

and then the following function is called..

double SimCalc3(bitset<bsLength>& p1, bitset<bsLength>& p2, bitset<bsLength>& p3)
{
int nA = (p3 & (p1 | p2)).count();
int nB = (p3 & (p1 & p2)).count();
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

Going to keep working away on this for a few more days, as I've still got plenty to learn, and I've also got to download the boost stuff. Thanks for hte help guys - I'll probably post back after the holiday weekend....
Commented:
I stole the algorithm from bitset::count and voilĂ

Took simcalc  344 ms for 1000000 iterations  Origin
Answer is 45.4545 32 units long

Took simcalc2 281 ms for 1000000 iterations  My best
Answer is 45.4545 4 units long

Took simCalc3 62 ms for 1000000 iterations   bitset

Took simCalc4 282 ms for 1000000 iterations  Dan DWORD only
Answer is 45.4545 1 DWORD32 bits

Took simCalc5 62 ms for 1000000 iterations   Dan's + count from bitset
Answer is 45.4545 1 DWORD and count from bitset

The last code was:

double SimCalc3(bitset<bsLen>& p1, bitset<bsLen>& p2, bitset<bsLen>& p3)
{
int nA = (p3 & (p1 | p2)).count();
int nB = (p3 & (p1 & p2)).count();
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

double SimCalc4(DWORD d1, DWORD d2, DWORD d3)
{
DWORD da= (d1 | d2) & d3;
DWORD db= (d1 & d2) & d3;

int nA= 0, nB= 0;
DWORD nBit= 1;
for (int j=0; j<32; j++ )
{
if (da & nBit ) nA++;
if (db & nBit ) nB++;
nBit <<= 1;  // move over to next position
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

double SimCalc5(DWORD d1, DWORD d2, DWORD d3)
{
DWORD da= (d1 | d2) & d3;
DWORD db= (d1 & d2) & d3;

int nA = 0;
int nB = 0;
for (unsigned long _X = da, _Y = db; _X != 0; _X >>= 4, _Y >>=4)
{
nA += "\0\1\1\2\1\2\2\3"
"\1\2\2\3\2\3\3\4"[_X & 0xF];
nB += "\0\1\1\2\1\2\2\3"
"\1\2\2\3\2\3\3\4"[_Y & 0xF];
}

if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

it's very tricky using half words that have the bit count statically.
I never saw such thing like "ABCD"[i].

Regards, Alex

Commented:
Sorry, half bytes (4 bits) not half words

Regards, Alex
Commented:
clever trick.  I think it could just as easily be written:

BYTE OnesInNybble[16] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4};
....
nA +=  OnesInNybble[ _X & 0xF ];
...etc...

But the authors of STL are the Kings Of Obfuscation.

Presumably a 256-byte table could be used to make that part of it twice as fast -- count 'em up 8-bits at a time, like:
nA +=  OnesInByte[ _X & 0xFF ];

-- Dan
Commented:
Here it is:

Took simcalc  328 ms for 1000000 iterations
Answer is 45.4545 32 units long

Took simcalc2 281 ms for 1000000 iterations
Answer is 45.4545 4 units long

Took simCalc3 63 ms for 1000000 iterations

Took simCalc4 312 ms for 1000000 iterations
Answer is 45.4545 1 DWORD32 bits

Took simCalc5 47 ms for 1000000 iterations
Answer is 45.4545 1 DWORD and count from bitset using bytes

The code:

double SimCalc5(DWORD d1, DWORD d2, DWORD d3)
{
static BYTE bits[256] =
{   0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7,
5, 6, 6, 7, 6, 7, 7, 8,
};

DWORD da= (d1 | d2) & d3;
DWORD db= (d1 & d2) & d3;

int nA = 0;
int nB = 0;
for (unsigned long _X = da, _Y = db; _X != 0; _X >>= 8, _Y >>=8)
{
nA += bits[_X & 0xFF];
nB += bits[_Y & 0xFF];
}

if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}

You see, it is 25% better.

Regards, Alex

Regards, Alex

Experts Exchange Solution brought to you by

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Commented:
very simple suggestion , as I haven't  seen it suggested yet

have you changed your project setting from Debug to Release. Under the Build menu select Set Active Configuration.

Steve

Author Commented:
Thanks for everybodys efforts. Here is the final code that I'm going to use:

//CUT HERE==========================================================
#include <iostream>
#include <string>
#include <windows.h>            //for GetTickCount function
#include <vector>

using namespace std;

double SimCalc(string& fp1, string& fp2, string& fp3);
double SimCalc9(vector<DWORD>& d1, vector<DWORD>& d2, vector<DWORD>& d3);

void main()
{
string fp1l = "1001010010101100100100101010010111001";
string fp2l = "0100101001010110010010010101001011100";
string fp3l = "1001000101011001001001010100101110010";
//add zeros to end of strings to get a multiple of
//32 characters in length.
int theSize = fp1l.length() / 32;
int numtoadd = (32 - (fp1l.length() % 32));
if (fp1l.length() % 32 != 0)
{
++theSize;
int tempint = fp1l.length() / 32;
for (int q = 0; q< numtoadd; ++q)
{
fp1l += "0";
fp2l += "0";
fp3l += "0";
}
}

vector<DWORD> vfp1l(theSize);
vector<DWORD> vfp2l(theSize);
vector<DWORD> vfp3l(theSize);

for (int z = 0; z<(fp1l.length() / 32); ++z)
{
string temp1 = fp1l.substr(z*32, 32);
string temp2 = fp2l.substr(z*32, 32);
string temp3 = fp3l.substr(z*32, 32);

vfp1l[z] = strtoul(temp1.c_str(), 0, 2);
vfp2l[z] = strtoul(temp2.c_str(), 0, 2);
vfp3l[z] = strtoul(temp3.c_str(), 0, 2);
}
int maxit = 10000000;

double n = 0.0;
unsigned long tic = GetTickCount();
for ( int k = 0; k<maxit; k++) {
n= SimCalc9(vfp1l, vfp2l, vfp3l);
}
unsigned long tic2 = GetTickCount();
cout << "That took simcalc9 (dword+stolen count+256+vector)" << (tic2 -tic) <<" ms \n";
cout << "Answer is "<< n << " " << vfp1l.size() << endl<<endl;

tic = GetTickCount();
for ( k = 0; k<maxit; k++) {
n= SimCalc( fp1l, fp2l, fp3l);
}
tic2 = GetTickCount();
cout << "That took simcalc  (ORIGINAL)  " << (tic2 -tic) <<" ms \n";
cout << "Answer is "<< n << " " << fp1l.length() << " units long\n\n";
}
double SimCalc(string& fp1, string& fp2, string& fp3)
{
const char* p1 = fp1.c_str();
const char* p2 = fp2.c_str();
const char* p3 = fp3.c_str();

int nA= 0;
int nB= 0;
int nLen= fp1.length();
for ( int j=0; j< nLen; ++j, ++p1, ++p2, ++p3 ) {
if (*p3 == '1')      {
if ((*p1 | *p2)== '1') {
++nA;
if ((*p1 + *p2) == ('1'+'1')){
++nB;
}
}
}
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}
double SimCalc9(vector<DWORD>& d1, vector<DWORD>& d2, vector<DWORD>& d3)
{
static BYTE bits[256] =
{   0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5,
3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6,
4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7,
5, 6, 6, 7, 6, 7, 7, 8,
};

DWORD da, db;

int nA = 0;
int nB = 0;
for (int z = 0; z<d1.size();++z)
{
da= (d1[z] | d2[z]) & d3[z];
db= (d1[z] & d2[z]) & d3[z];
for (unsigned long _X = da, _Y = db; _X != 0; _X >>= 8, _Y >>=8)
{
nA += bits[_X & 0xFF];
nB += bits[_Y & 0xFF];
}
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

}
//CUT TO HERE=====================================================

Finally, could someone explain to me, in English, what the line..

nA += bits[_X & 0xFF];

is actually doing?

Commented:
>Finally, could someone explain to me, in English, what the line..
>
>      nA += bits[_X & 0xFF];
>
>is actually doing?

0xFF means all bits set of a byte. By making (_X & 0xFF) we strip down a 4byte unsigned long word to the last byte (or lowest byte of 4)
So, we have a byte that is between 0..255. That byte is used as an index to an array - bits - where the counts of bits of all numbers (byte values) from 0 to 255 is stored, e. g. 7 == bit 0 + bit 1 + bit 2 == 3 bits ==  bits[7] == bits[_X where lowest byte is 7].

By using _X<<8 the loop switches _X 8 bits to the right - filling up zero bits on the left side - that makes the next byte of _X become lowest. At least after 4 loop steps _X is 0 and lthe loop terminates. If one of the higher bytes of _X is 0 the loop terminates earlier.

So, the algorithm is to count the bits of all 4 bytes of a longword by using a predefined count table for all possible bit counts.

Hope, that was clear enough ;-)

Regards, Alex

BTW, i would recommend to split points.

Commented:
itsmeandnobodyelse, very nice work on this question!
s_lakshma, thanks for the points and the grade :)
Author Commented:
I think thats about as close to English as I'm going to get!

I think I understand it now though, enough to maybe use it somewhere else in the future.

Thanks again.
Commented:
Oops, there is a bug!

change

for (unsigned long _X = da, _Y = db; _X != 0; _X >>= 8, _Y >>=8)
to

for (unsigned long _X = da, _Y = db; (_X + _Y)  != 0; _X >>= 8, _Y >>=8)

The first line wouldn't evaluate _Y correctly if one of the highest bytes of _X turns to 0.

Regards, Alex

Author Commented:
Hi Alex,

I had noticed that. However I think its ok in its original form. It is impossible for nB to be incremented without nA being incremented. So doing the test for continuation with _X is good enough.

S
Commented:
>>  could someone explain to me, in English, what the line.., nA += bits[_X & 0xFF];... is actually doing?

I'll give it a try:
Rather than counting up the ones (as in my original code), this system uses a table that says how many 1-bits exist in each possible byte (it is easy to know this in advance).

To learn how many 1s exist in a particular byte, just use its value as an index into the table.

To calculate how many 1s are in a 4-byte DWORD, we can just use each byte as an index into the table and add the looked-up value (number of 1s in that byte) to an accumulator.

To obtain each of four bytes of a DWORD, one at a time, we can AND the DWORD with 0x000000FF and shift it right by 8 bits.  If we do this four times, we will get our hands on four values each of which will be between 0 and 255 inclusive.  That makes them perfect for use with the lookup table.

Incidently, I'm seeing about a 5% speedup by unrolling the loop:

int nA = 0;
int nB = 0;
for ( int z = 0; z<d1.size(); ++z )     {
DWORD da= (d1[z] | d2[z]) & d3[z];
DWORD db= (d1[z] & d2[z]) & d3[z];
nA += bits[ da & 0x000000FF ];
nB += bits[ db & 0x000000FF ];
da >>= 8; db >>= 8;
nA += bits[ da & 0x000000FF ];
nB += bits[ db & 0x000000FF ];
da >>= 8; db >>= 8;
nA += bits[ da & 0x000000FF ];
nB += bits[ db & 0x000000FF ];
da >>= 8; db >>= 8;
nA += bits[ da & 0x000000FF ];
nB += bits[ db & 0x000000FF ];
}
if (nA == 0) return 0.0;
return( ((double)nB / nA) *100 );

-- Dan
Commented:
Finally English and 5% faster.  ;-)

Regards, Alex
Author Commented:
OK, had to do it. I set up the [256*256] byte table, unrolled the loops, and looked at 16 at a time. Following timings for one million iterations, strings 1280 in length.

That took simcalc  (ORIGINAL)  35341 ms
Answer is 65.3846 1280 units long

That took simcalc9 (DAN+stolen count+256+vector)3796 ms

That took simcalc10 (DAN+stolen count+256+vector+unroll)3234 ms