EvilGuru
asked on
Calculating Pi using BBP forumula - optimisations
Hi, currently I am working on a distributed computing project to work out the mathematical constant 'pi'. After taking a look at many examples and formulas for pi I have choson to use a common implementation of the BBP formula. This is a digit extraction formula which is able to compute any digit in Pi without having to know the n-1 digit. However, I have a problem with the script, it is quite slow, each time it is run it returns 9 digits of Pi from the value of n. If n was 1 it would return digits 1-9 . If in my basic test program I set n to be 10,000 it takes around 85 seconds to work out nine digits starting at position 10,000 of pi.
This is a problem for, as when larger values are reached (billions) it could takes weeks of cpu time just to get nine more digits. However, I am quite sure there is room for optimisation in the current formula I use. I a currently not sure which part of the formula is causing the slowdown, as my attempt to add in some debugging code failed. I have also been told by some people that it would be faster if the program used hardware cpu optimisations (sse?), however I am unsure of this. I have a feeling that is it the calculating of the prime numbers that is taking the time but I am still unsure. Here is a link to my current code as I did not want to post it in this topic.
http://www.dcrez.co.uk/create/pi.c
Thank you for any help/advice you can give .
This is a problem for, as when larger values are reached (billions) it could takes weeks of cpu time just to get nine more digits. However, I am quite sure there is room for optimisation in the current formula I use. I a currently not sure which part of the formula is causing the slowdown, as my attempt to add in some debugging code failed. I have also been told by some people that it would be faster if the program used hardware cpu optimisations (sse?), however I am unsure of this. I have a feeling that is it the calculating of the prime numbers that is taking the time but I am still unsure. Here is a link to my current code as I did not want to post it in this topic.
http://www.dcrez.co.uk/create/pi.c
Thank you for any help/advice you can give .
EvilGuru,
http://mathworld.wolfram.com/QuadraticSieve.html
has a better prime checking algorithm... but I don't know how to implement that :(
http://mathworld.wolfram.com/QuadraticSieve.html
has a better prime checking algorithm... but I don't know how to implement that :(
ASKER
Thanks! That got me down a good few seconds here as well. I am currently not sure what is taking up the time in the program, I have an idea that it is the primes so I thought about 'spoon feeding' the program a list of prime numbers in an array, however I am not totally sure how to do that or if there will be a large gain from it.
Surely prime checking is the delay factor
If you want that... look at these
http://primes.utm.edu/nthprime/algorithm.php
http://www.troubleshooters.com/codecorn/primenumbers/primenumbers.htm
If you want that... look at these
http://primes.utm.edu/nthprime/algorithm.php
http://www.troubleshooters.com/codecorn/primenumbers/primenumbers.htm
If you want to spoon feed them...
http://www.nist.gov/dads/HTML/sieve.html
http://www.nist.gov/dads/HTML/sieve.html
C Implementations....
http://primes.utm.edu/links/programs/sieves/Eratosthenes/C_source_code/
http://primes.utm.edu/links/programs/sieves/Eratosthenes/C_source_code/
ASKER
Can anyone help implement that amazing C prime finder into the int is_prime function? I have had a few tries but cant seem to find out which bit actually does it. Thanks for all of your help, it is nothing short of amazing.
You might want to look into how "Super pi" does its calculations. According to its benchmarks file, ti can do ONE MILLION digits of Pi in under 300 seconds, and that's on an ancient 90MHz PC.
ASKER
Currently, using different algorithms I can do around 10 million in about 60 seconds. However, that method is both memory intensive and can not be used in distributed computng (which is what this project is about), so while this method may be slower, it is the only method and so must be optimisted as best as we can make it.
I got:
Position: 10000 Digits: 856672279 Time: 111.81 seconds
Press any key to continue . . .
It is compiled under VC++. I had to change:
#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
to
#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)
because I was getting loads of warnings but I dont think that was the problem. I am going to use one of my code timing tools on it to get a better idea of where the time is going.
Back in a while.
Paul
Position: 10000 Digits: 856672279 Time: 111.81 seconds
Press any key to continue . . .
It is compiled under VC++. I had to change:
#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
to
#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)
because I was getting loads of warnings but I dont think that was the problem. I am going to use one of my code timing tools on it to get a better idea of where the time is going.
Back in a while.
Paul
ASKER
Whoever manages to integrates a fast prime system (and allows it to work with numbers up to 2 trillion) into the code will probably be the accepted answer, along with mgh_mgharish who found the great information on primes.
Maybe it's just me, but it seems silly to use an algorithm that is TWO MILLION times slower than another algorithm, just because it's distributable. If it's that slow, it's obvious this is an experiment in distributed processing, not a real attempt to make a fast Pi-finder. So why spend much time on this aspect?
And it's risky to use (double) arithmetic--- with these algorithms you need EXACT integer results, which is something floating-point numbers cannot guarantee.
And it's risky to use (double) arithmetic--- with these algorithms you need EXACT integer results, which is something floating-point numbers cannot guarantee.
ASKER
The accuracy of the formula is 100%, the use of double gives us 15 digits of accuracy when we only need 9. The slowness come from the fact that to work out large amounts huge amounts of ram would be needed (looking in tb here) as well as a cluster of many cpu's. The idea was not to make it fast, but most other forumulas have a headroom of around 17 billion, the idea of this is to make it faster, in which we have found out what is causing the slowdown and are not hoping for integration. Just take a look at pihex, they managed to find the quadillionth bit of Pi in around two years using a few slow 400mhz pc's.
>the use of double gives us 15 digits of accuracy when we only need 9.
You may be right.
But have you considered what happens along these lines:
15 minus 9 is six. So you seem to have 5 (plus or minus one) digits of extra accuracy.
But that means that somewhere between one in ten thousand, to one in a million times, the extra digits are going to be all nines, or all zeroes, plus or minus a bit. Which is going to generate a carry and affect the significant digits. And the significant digits could be anything up to "999999999" to start with, so the last bit of inaccuracy can totally bollix your result.
Even if type double gives you 15 digits of accuracy, or 1000 digits of accuracy, or a MILLION digits, it's not enough for doing truly, really, actually, verifiable, precise integer math.
At least the way I see it.
You may be right.
But have you considered what happens along these lines:
15 minus 9 is six. So you seem to have 5 (plus or minus one) digits of extra accuracy.
But that means that somewhere between one in ten thousand, to one in a million times, the extra digits are going to be all nines, or all zeroes, plus or minus a bit. Which is going to generate a carry and affect the significant digits. And the significant digits could be anything up to "999999999" to start with, so the last bit of inaccuracy can totally bollix your result.
Even if type double gives you 15 digits of accuracy, or 1000 digits of accuracy, or a MILLION digits, it's not enough for doing truly, really, actually, verifiable, precise integer math.
At least the way I see it.
any way the prime is not eating time..
the suspected code is
for(k=1;k<=N;k++) {
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);
t=((k*2)-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;
if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}
}
Try commenting this part of the code..
it excecutes in .01 seconds
Thanxs,
Joju
the suspected code is
for(k=1;k<=N;k++) {
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);
t=((k*2)-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;
if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}
}
Try commenting this part of the code..
it excecutes in .01 seconds
Thanxs,
Joju
ASKER
grg99, interesting point. To test it I went to http://www.angio.net/pi/bigpi.cgi and typed in "999999" and at point 762 that way found. From this I set the client up to 753 (+9 = 762, right where the nines start) and it gave me the result "870721134", not rounding up the 4 > 5 even though there was a string of nines. Is one of us seeing this wrong?
EvilGuru,
I've been a bit long over this one as I had to reengineer my millisecond timer to use QueryPerformanceFrequency under Windoze. Here's some results:
Timer Time Count Mean(Secs) Name
0 21858061 1 6.106380 pseudo_main
1 3360 870 0.000001 log(2*N)/log(a)
2 2679 870 0.000001 fmod(sum+(double) s/ (double) av,1.0)
3 13135784 2947560 0.000001 k
4 21858043 1 6.106375 a
5 16012 870 0.000005 pow_mod;mul_mod
6 35474 870 0.000011 next_prime
7 13547 6776 0.000001 is_prime
8 9199 870 0.000003 pow_mod
9 21787312 870 0.006996 k2
My timer system records up to 10 timers. The time field is the total time spect. The count is the number of times that bit of code was performed, the mean(Secs) is the mean time spent in the section. The name should be obvious.
Here's the source so you can see what sections I was timing:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
extern "C" {
#include "CDefs.h"
#include "CodeTime.h"
void Idle ( void ) {};
DEBUG;
};
TimerData CodeTimers[NTIMES] = {0};
/* #define HAS_LONG_LONG */
#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)
#endif
/* returns the inverse of x mod y */
int inv_mod(int x,int y) {
int q,u,v,a,c,t;
u=x;
v=y;
c=1;
a=0;
do {
q=v/u;
t=c;
c=a-q*c;
a=t;
t=u;
u=v-q*u;
v=t;
} while (u!=0);
a=a%y;
if (a<0) a=y+a;
return a;
}
/* returns (a^b) mod m */
int pow_mod(int a,int b,int m)
{
int r,aa;
TESTING(StartTime(8,"pow_m od"););
r=1;
aa=a;
while (1) {
if (b&1) r=mul_mod(r,aa,m);
b=b>>1;
if (b == 0) break;
aa=mul_mod(aa,aa,m);
}
TESTING(StopTime(8););
return r;
}
/* returns true if n is prime */
int is_prime(int n)
{
int r,i;
int rnsr = 1;
TESTING(StartTime(7,"is_pr ime"););
if ((n % 2) == 0) return 0;
r=(int)(sqrt(n));
for(i=3;i<=r&&rnsr==1;i+=2 ) if ((n % i) == 0) rnsr = 0;
TESTING(StopTime(7););
return rnsr;
}
/* returns the prime number immediatly after n */
int next_prime(int n)
{
TESTING(StartTime(6,"next_ prime"););
do {
n++;
} while (!is_prime(n));
TESTING(StopTime(6););
return n;
}
/* finds out nine digits of pi starting from n */
/* will probably be the main function in openmosix capable version*/
int pseudo_main(int n)
{
int av,a,vmax,N,num,den,k,kq,k q2,t,v,s,i ;
double sum;
// double start_time, elap_time;
TESTING(StartTime(0,"pseud o_main");) ;
// start_time = (double) clock ();
N=(int)((n+20)*log(10)/log (2));
sum=0;
TESTING(StartTime(4,"a");) ;
for(a=3;a<=(2*N);a=next_pr ime(a)) {
TESTING(StartTime(1,"log(2 *N)/log(a) "););
vmax=(int)(log(2*N)/log(a) );
av=1;
for(i=0;i<vmax;i++) av=av*a;
TESTING(StopTime(1););
s=0;
num=1;
den=1;
v=0;
kq=1;
kq2=1;
TESTING(StartTime(9,"k2"); );
for(k=1;k<=N;k++) {
TESTING(StartTime(3,"k");) ;
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);
t=(2*k-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;
if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}
TESTING(StopTime(3););
}
TESTING(StopTime(9););
TESTING(StartTime(5,"pow_m od;mul_mod "););
t=pow_mod(10,n-1,av);
s=mul_mod(s,t,av);
TESTING(StopTime(5););
TESTING(StartTime(2,"fmod( sum+(doubl e) s/ (double) av,1.0)"););
sum=fmod(sum+(double) s/ (double) av,1.0);
TESTING(StopTime(2););
}
TESTING(StopTime(4););
//elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
TESTING(StopTime(0););
printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));
return 0;
}
/* calls up the pseudo_main function to find 9 digits at a time */
int main (int argc,char *argv[])
{
int position=1,counter=1000; //counter is the digit of pi to start from.
// int number;
// double start_time, elap_time;
pseudo_main(counter);
DumpTimes();
//Pause Code
//system ("pause"); // execute M$-DOS' pause command
return 0;
}
Its the TESTING(StartTime... TESTING(StopTime... that records the moments.
The results suggest some pretty obvious results:
of the 21858061 'ticks' spent in the program (pseudo_main timer) 21858043 is spent in the 'a' loop and, of that time, 13135784 ticks are spent in the 'k' loop. Inside this loops, next_prime, is_prime and pow_mod take almost trivial time.
I'd therefore recommend you try to ease the burden of all teh variable in the pseudo_main function. Break pseudo_main down into multiple calls. This allows the compiler to establish a new set of register variables in each call.
My CodeTime.h header looks like:
/*
This header defines some functions that allow you to time certain processes
at the millisecond/microsecond level.
Use it as follows.
1. Somewhere in your code, probably best in a main module, declare the timer
data with:
TimerData CodeTimers[NTIMES] = {0};
2. Functions you wish to time can be surrounded by StartTime and StopTime calls.
E.G.
TESTING(StartTime(4,"Expan dFromFile" ););
length = ExpandFromFile ( &p[i+1], &expanded[j] );
TESTING(StopTime(4););
The name will appear in the log so I use the name of the function I am timing
so I can use the code browser to get there quickly.
3. At any time - I usually do it at program terminate - call DumpTimes().
This will give you a dump of the form:
Timer[0] 570 131 ShowBasics
Timer[1] 170 131 ShowNumbers #
Timer[2] 3110 131 ShowNumbers $
Timer[3] 2420 12196 ExpandParameter
Timer[4] 1380 8782 ExpandFromFile
Timer[5] 0 7 ReadFromFile
Timer[6] 0 0
Timer[7] 0 0
Timer[8] 0 0
Timer[9] 3950 131 ShowBackground
Giving Timer number, total time taken (in ms), call count and the supplied name.
This data is tab delimited so you can copy and paste it into a spreadsheet.
*/
// Timing code to log millisecond times.
#include <sys\TimeB.h>
// Number of timers you can keep.
#define NTIMES 10
#ifdef WIN32
#include "WINDOWS.H"
//#include "WINBASE.H"
#define RecordedTime LONGLONG
#define ElapsedTime LONGLONG
#define fElapsed "%I64d"
#else
// Ansi stuff.
#define RecordedTime struct timeb
#define ElapsedTime long
#define fElapsed "%ld"
#endif
// One timer data structure.
typedef struct
{
RecordedTime startTime;
ElapsedTime elapsed;
long count;
char * name;
} TimerData;
#ifdef CODETIME
// Please allocate one of these somewhere and make it initially empty.
// E.G. TimerData CodeTimers[NTIMES] = {0};
extern TimerData CodeTimers[NTIMES];
// Call this to start a timer.
#ifdef WIN32
#define StartTime(n,fname) \
{\
CodeTimers[n].name = fname;\
CodeTimers[n].count += 1;\
QueryPerformanceCounter((L ARGE_INTEG ER*)&CodeT imers[n].s tartTime); \
}
#else
#define StartTime(n,fname) \
{\
CodeTimers[n].name = fname;\
CodeTimers[n].count += 1;\
ftime(&CodeTimers[n].start Time);\
}
#endif
// Call this to stop the timer.
#ifdef WIN32
#define StopTime(n) \
{\
RecordedTime stopTime;\
QueryPerformanceCounter((L ARGE_INTEG ER*)&stopT ime);\
CodeTimers[n].elapsed += stopTime - CodeTimers[n].startTime;\
}
#else
#define StopTime(n) \
{\
struct timeb stopTime;\
ftime(&stopTime);\
CodeTimers[n].elapsed += ((stopTime.time - CodeTimers[n].startTime.ti me) * 1000L) + ((long)stopTime.millitm - CodeTimers[n].startTime.mi llitm);\
}
#endif
// Call this on exit or whenever you want the times logged.
#define DumpTimes()\
{\
int i;\
RecordedTime freq;\
QueryPerformanceFrequency( (LARGE_INT EGER*)&fre q);\
TFPRINT(("Timer\tTime\tCou nt\tMean(S ecs)\tName "));\
for ( i = 0; i < NTIMES; i++ )\
{\
TFPRINT(("%d\t"fElapsed"\t %ld\t%f\t% s", i, CodeTimers[i].elapsed, CodeTimers[i].count, CodeTimers[i].count > 0 ? (double)CodeTimers[i].elap sed / (double)CodeTimers[i].coun t / freq: 0.0,CodeTimers[i].name == NULL ? "": CodeTimers[i].name ));\
}\
}
#else
// Call this to start a timer.
#define StartTime(n,fname)
// Call this to stop the timer.
#define StopTime(n)
// Call this on exit or whenever you want the times logged.
#define DumpTimes()
#endif
Note that my enhancements have upped its resolution beyiond milliseconds under a WIN32 environment. TFPRINT Is just like printf but with two ((.
Good luck.
Paul
I've been a bit long over this one as I had to reengineer my millisecond timer to use QueryPerformanceFrequency under Windoze. Here's some results:
Timer Time Count Mean(Secs) Name
0 21858061 1 6.106380 pseudo_main
1 3360 870 0.000001 log(2*N)/log(a)
2 2679 870 0.000001 fmod(sum+(double) s/ (double) av,1.0)
3 13135784 2947560 0.000001 k
4 21858043 1 6.106375 a
5 16012 870 0.000005 pow_mod;mul_mod
6 35474 870 0.000011 next_prime
7 13547 6776 0.000001 is_prime
8 9199 870 0.000003 pow_mod
9 21787312 870 0.006996 k2
My timer system records up to 10 timers. The time field is the total time spect. The count is the number of times that bit of code was performed, the mean(Secs) is the mean time spent in the section. The name should be obvious.
Here's the source so you can see what sections I was timing:
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
extern "C" {
#include "CDefs.h"
#include "CodeTime.h"
void Idle ( void ) {};
DEBUG;
};
TimerData CodeTimers[NTIMES] = {0};
/* #define HAS_LONG_LONG */
#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)
#endif
/* returns the inverse of x mod y */
int inv_mod(int x,int y) {
int q,u,v,a,c,t;
u=x;
v=y;
c=1;
a=0;
do {
q=v/u;
t=c;
c=a-q*c;
a=t;
t=u;
u=v-q*u;
v=t;
} while (u!=0);
a=a%y;
if (a<0) a=y+a;
return a;
}
/* returns (a^b) mod m */
int pow_mod(int a,int b,int m)
{
int r,aa;
TESTING(StartTime(8,"pow_m
r=1;
aa=a;
while (1) {
if (b&1) r=mul_mod(r,aa,m);
b=b>>1;
if (b == 0) break;
aa=mul_mod(aa,aa,m);
}
TESTING(StopTime(8););
return r;
}
/* returns true if n is prime */
int is_prime(int n)
{
int r,i;
int rnsr = 1;
TESTING(StartTime(7,"is_pr
if ((n % 2) == 0) return 0;
r=(int)(sqrt(n));
for(i=3;i<=r&&rnsr==1;i+=2
TESTING(StopTime(7););
return rnsr;
}
/* returns the prime number immediatly after n */
int next_prime(int n)
{
TESTING(StartTime(6,"next_
do {
n++;
} while (!is_prime(n));
TESTING(StopTime(6););
return n;
}
/* finds out nine digits of pi starting from n */
/* will probably be the main function in openmosix capable version*/
int pseudo_main(int n)
{
int av,a,vmax,N,num,den,k,kq,k
double sum;
// double start_time, elap_time;
TESTING(StartTime(0,"pseud
// start_time = (double) clock ();
N=(int)((n+20)*log(10)/log
sum=0;
TESTING(StartTime(4,"a");)
for(a=3;a<=(2*N);a=next_pr
TESTING(StartTime(1,"log(2
vmax=(int)(log(2*N)/log(a)
av=1;
for(i=0;i<vmax;i++) av=av*a;
TESTING(StopTime(1););
s=0;
num=1;
den=1;
v=0;
kq=1;
kq2=1;
TESTING(StartTime(9,"k2");
for(k=1;k<=N;k++) {
TESTING(StartTime(3,"k");)
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);
t=(2*k-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;
if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}
TESTING(StopTime(3););
}
TESTING(StopTime(9););
TESTING(StartTime(5,"pow_m
t=pow_mod(10,n-1,av);
s=mul_mod(s,t,av);
TESTING(StopTime(5););
TESTING(StartTime(2,"fmod(
sum=fmod(sum+(double) s/ (double) av,1.0);
TESTING(StopTime(2););
}
TESTING(StopTime(4););
//elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
TESTING(StopTime(0););
printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));
return 0;
}
/* calls up the pseudo_main function to find 9 digits at a time */
int main (int argc,char *argv[])
{
int position=1,counter=1000; //counter is the digit of pi to start from.
// int number;
// double start_time, elap_time;
pseudo_main(counter);
DumpTimes();
//Pause Code
//system ("pause"); // execute M$-DOS' pause command
return 0;
}
Its the TESTING(StartTime... TESTING(StopTime... that records the moments.
The results suggest some pretty obvious results:
of the 21858061 'ticks' spent in the program (pseudo_main timer) 21858043 is spent in the 'a' loop and, of that time, 13135784 ticks are spent in the 'k' loop. Inside this loops, next_prime, is_prime and pow_mod take almost trivial time.
I'd therefore recommend you try to ease the burden of all teh variable in the pseudo_main function. Break pseudo_main down into multiple calls. This allows the compiler to establish a new set of register variables in each call.
My CodeTime.h header looks like:
/*
This header defines some functions that allow you to time certain processes
at the millisecond/microsecond level.
Use it as follows.
1. Somewhere in your code, probably best in a main module, declare the timer
data with:
TimerData CodeTimers[NTIMES] = {0};
2. Functions you wish to time can be surrounded by StartTime and StopTime calls.
E.G.
TESTING(StartTime(4,"Expan
length = ExpandFromFile ( &p[i+1], &expanded[j] );
TESTING(StopTime(4););
The name will appear in the log so I use the name of the function I am timing
so I can use the code browser to get there quickly.
3. At any time - I usually do it at program terminate - call DumpTimes().
This will give you a dump of the form:
Timer[0] 570 131 ShowBasics
Timer[1] 170 131 ShowNumbers #
Timer[2] 3110 131 ShowNumbers $
Timer[3] 2420 12196 ExpandParameter
Timer[4] 1380 8782 ExpandFromFile
Timer[5] 0 7 ReadFromFile
Timer[6] 0 0
Timer[7] 0 0
Timer[8] 0 0
Timer[9] 3950 131 ShowBackground
Giving Timer number, total time taken (in ms), call count and the supplied name.
This data is tab delimited so you can copy and paste it into a spreadsheet.
*/
// Timing code to log millisecond times.
#include <sys\TimeB.h>
// Number of timers you can keep.
#define NTIMES 10
#ifdef WIN32
#include "WINDOWS.H"
//#include "WINBASE.H"
#define RecordedTime LONGLONG
#define ElapsedTime LONGLONG
#define fElapsed "%I64d"
#else
// Ansi stuff.
#define RecordedTime struct timeb
#define ElapsedTime long
#define fElapsed "%ld"
#endif
// One timer data structure.
typedef struct
{
RecordedTime startTime;
ElapsedTime elapsed;
long count;
char * name;
} TimerData;
#ifdef CODETIME
// Please allocate one of these somewhere and make it initially empty.
// E.G. TimerData CodeTimers[NTIMES] = {0};
extern TimerData CodeTimers[NTIMES];
// Call this to start a timer.
#ifdef WIN32
#define StartTime(n,fname) \
{\
CodeTimers[n].name = fname;\
CodeTimers[n].count += 1;\
QueryPerformanceCounter((L
}
#else
#define StartTime(n,fname) \
{\
CodeTimers[n].name = fname;\
CodeTimers[n].count += 1;\
ftime(&CodeTimers[n].start
}
#endif
// Call this to stop the timer.
#ifdef WIN32
#define StopTime(n) \
{\
RecordedTime stopTime;\
QueryPerformanceCounter((L
CodeTimers[n].elapsed += stopTime - CodeTimers[n].startTime;\
}
#else
#define StopTime(n) \
{\
struct timeb stopTime;\
ftime(&stopTime);\
CodeTimers[n].elapsed += ((stopTime.time - CodeTimers[n].startTime.ti
}
#endif
// Call this on exit or whenever you want the times logged.
#define DumpTimes()\
{\
int i;\
RecordedTime freq;\
QueryPerformanceFrequency(
TFPRINT(("Timer\tTime\tCou
for ( i = 0; i < NTIMES; i++ )\
{\
TFPRINT(("%d\t"fElapsed"\t
}\
}
#else
// Call this to start a timer.
#define StartTime(n,fname)
// Call this to stop the timer.
#define StopTime(n)
// Call this on exit or whenever you want the times logged.
#define DumpTimes()
#endif
Note that my enhancements have upped its resolution beyiond milliseconds under a WIN32 environment. TFPRINT Is just like printf but with two ((.
Good luck.
Paul
ASKER
That is amazing! Do you think there is much room for improvements in the k loops, and this may sound a tad noobie but would you be able to explain how splitting up the pseudo_main() call into smaller parts would help make it faster? Sorry, this is the first time I have needed to optimise my code :) Thanks for any help/advice you can give.
There's a big problem with using QueryPerformanceTimer()-- it has a LOT of overhead per call.
Try calling it twice in sucession and print out the time difference. It's probably around 1500 clock cycles.
So it's not very useful for timing short instruction sequences. You can make it a bit better by subtracting the overhead time from the printouts.
As a better alternative, you might try using the RDTSC instruction, which is at least 40 times faster, but still awfully slow, around 45 clocks.
Try calling it twice in sucession and print out the time difference. It's probably around 1500 clock cycles.
So it's not very useful for timing short instruction sequences. You can make it a bit better by subtracting the overhead time from the printouts.
As a better alternative, you might try using the RDTSC instruction, which is at least 40 times faster, but still awfully slow, around 45 clocks.
Greg,
Good points! I'll investigate these. I only want the figures relative so it QueryPerformanceTimer is nearly what I want. My earlier version that used ftime just wasnt up to it. It was recording millions of calls taking zero time because the calls themselves took less than 1ms.
EvilGuru,
Splitting can make a heap of difference. The aim shuld be to reduce the number of local stack variables. The compiler can then more efficiently make use of registers and instruction cacheing. Remember also that there is stack overhead so dont go too far. I'd suggest you put the 'a' loop in one function and the 'k' loop in another, then see if you've made a difference.
Paul
Good points! I'll investigate these. I only want the figures relative so it QueryPerformanceTimer is nearly what I want. My earlier version that used ftime just wasnt up to it. It was recording millions of calls taking zero time because the calls themselves took less than 1ms.
EvilGuru,
Splitting can make a heap of difference. The aim shuld be to reduce the number of local stack variables. The compiler can then more efficiently make use of registers and instruction cacheing. Remember also that there is stack overhead so dont go too far. I'd suggest you put the 'a' loop in one function and the 'k' loop in another, then see if you've made a difference.
Paul
I suspect fmod().
solution some fast fmod func ?
solution some fast fmod func ?
ASKER
PaulCaswell, you mean something like:
pseudo_main {
int a(vars here);
}
a {
a code here
int k(vars here);
}
Or have I got the wrong end of the stick?
According to the timer fmod takes up very little time, so I while I did think it was that I now dont. But a faster one cant hurt.
pseudo_main {
int a(vars here);
}
a {
a code here
int k(vars here);
}
Or have I got the wrong end of the stick?
According to the timer fmod takes up very little time, so I while I did think it was that I now dont. But a faster one cant hurt.
EvilGuru,
This is the sort of thing I was suggesting. I've only pulled the a_loop code out but I am sure you will understand what I mean. I hope you can pull the k_loop out yourself.
// The 'a' part of the loop.
void a_loop ( int N, int n, int a, double * sum )
{
int av,vmax,num,den,k,kq,kq2,t ,v,s,i;
vmax=(int)(log(2*N)/log(a) );
av=1;
for(i=0;i<vmax;i++) av=av*a;
s=0;
num=1;
den=1;
v=0;
kq=1;
kq2=1;
for(k=1;k<=N;k++) {
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);
t=(2*k-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;
if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}
}
t=pow_mod(10,n-1,av);
s=mul_mod(s,t,av);
*sum=fmod(*sum+(double) s/ (double) av,1.0);
}
/* finds out nine digits of pi starting from n */
/* will probably be the main function in openmosix capable version*/
int pseudo_main(int n)
{
int a,N;
double sum;
// double start_time, elap_time;
TESTING(StartTime(0,"pseud o_main");) ;
// start_time = (double) clock ();
N=(int)((n+20)*log(10)/log (2));
sum=0;
TESTING(StartTime(4,"a");) ;
for(a=3;a<=(2*N);a=next_pr ime(a)) {
a_loop ( N, n, a, &sum );
}
TESTING(StopTime(4););
//elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
TESTING(StopTime(0););
printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));
return 0;
}
This is the sort of thing I was suggesting. I've only pulled the a_loop code out but I am sure you will understand what I mean. I hope you can pull the k_loop out yourself.
// The 'a' part of the loop.
void a_loop ( int N, int n, int a, double * sum )
{
int av,vmax,num,den,k,kq,kq2,t
vmax=(int)(log(2*N)/log(a)
av=1;
for(i=0;i<vmax;i++) av=av*a;
s=0;
num=1;
den=1;
v=0;
kq=1;
kq2=1;
for(k=1;k<=N;k++) {
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);
t=(2*k-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;
if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}
}
t=pow_mod(10,n-1,av);
s=mul_mod(s,t,av);
*sum=fmod(*sum+(double) s/ (double) av,1.0);
}
/* finds out nine digits of pi starting from n */
/* will probably be the main function in openmosix capable version*/
int pseudo_main(int n)
{
int a,N;
double sum;
// double start_time, elap_time;
TESTING(StartTime(0,"pseud
// start_time = (double) clock ();
N=(int)((n+20)*log(10)/log
sum=0;
TESTING(StartTime(4,"a");)
for(a=3;a<=(2*N);a=next_pr
a_loop ( N, n, a, &sum );
}
TESTING(StopTime(4););
//elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
TESTING(StopTime(0););
printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));
return 0;
}
ASKER
After a good bit of playing around with it I replaced:
#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
#endif
with:
#define mul_mod(a,b,m) (( (__int64) (a) * (__int64) (b) ) % (m))
My time for 10,000 went down from 85 seconds to 42 seconds! How bigger benefit do you think I will get from splitting up the functions and are there any other causes where you think something like __int64 will help?
#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
#endif
with:
#define mul_mod(a,b,m) (( (__int64) (a) * (__int64) (b) ) % (m))
My time for 10,000 went down from 85 seconds to 42 seconds! How bigger benefit do you think I will get from splitting up the functions and are there any other causes where you think something like __int64 will help?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
The program itself will come with a .cfg file where I will have an option saying:
Use __int64 = yes;
However on the machines I have tried it on the speed increase has been quite large, but I will give the option to swtich.
You ahve helped so much! Thanks, answer accepted.
Use __int64 = yes;
However on the machines I have tried it on the speed increase has been quite large, but I will give the option to swtich.
You ahve helped so much! Thanks, answer accepted.
That is a great code...
Here is the output in my computer... :)
Position: 10000 Digits: 856672279 Time: 243.09 seconds
Press any key to continue . . .
I made some changes... Then I got this result...
Position: 10000 Digits: 856672279 Time: 235.57 seconds
Press any key to continue . . .
The change...
__________________________
int next_prime(int n)
{
n = n - 1 + n%2; // Make n a odd number
do {
n+=2; // Here you will jump to only the odd numbers
} while (!is_prime(n));
return n;
}
__________________________
Bye
---
Harish