Solved

# Calculating Pi using BBP forumula - optimisations

Posted on 2005-05-01
560 Views
Hi, currently I am working on a distributed computing project to work out the mathematical constant 'pi'. After taking a look at many examples and formulas for pi I have choson to use a common implementation of the BBP formula. This is a digit extraction formula which is able to compute any digit in Pi without having to know the n-1 digit. However, I have a problem with the script, it is quite slow, each time it is run it returns 9 digits of Pi from the value of n. If n was 1 it would return digits 1-9 . If in my basic test program I set n to be 10,000 it takes around 85 seconds to work out nine digits starting at position 10,000 of pi.

This is a problem for, as when larger values are reached (billions) it could takes weeks of cpu time just to get nine more digits. However, I am quite sure there is room for optimisation in the current formula I use. I a currently not sure which part of the formula is causing the slowdown, as my attempt to add in some debugging code failed. I have also been told by some people that it would be faster if the program used hardware cpu optimisations (sse?), however I am unsure of this. I have a feeling that is it the calculating of the prime numbers that is taking the time but I am still unsure. Here is a link to my current code as I did not want to post it in this topic.
http://www.dcrez.co.uk/create/pi.c
Thank you for any help/advice you can give .
0
Question by:EvilGuru

LVL 37

Expert Comment

Hi EvilGuru,
That is a great code...

Here is the output in my computer... :)

Position:     10000 Digits: 856672279 Time:    243.09 seconds
Press any key to continue . . .

I made some changes... Then I got this result...

Position:     10000 Digits: 856672279 Time:    235.57 seconds
Press any key to continue . . .

The change...
___________________________________________
int next_prime(int n)
{
n = n - 1 + n%2;            // Make n a odd number
do {
} while (!is_prime(n));
return n;
}
___________________________________________

Bye
---
Harish
0

LVL 37

Expert Comment

EvilGuru,

has a better prime checking algorithm... but I don't know how to implement that :(
0

Author Comment

Thanks! That got me down a good few seconds here as well. I am currently not sure what is taking up the time in the program, I have an idea that it is the primes so I thought about 'spoon feeding' the program a list of prime numbers in an array, however I am not totally sure how to do that or if there will be a large gain from it.
0

LVL 37

Expert Comment

Surely prime checking is the delay factor

If you want that... look at these

http://primes.utm.edu/nthprime/algorithm.php
0

LVL 37

Expert Comment

If you want to spoon feed them...
0

LVL 37

Expert Comment

0

Author Comment

Can anyone help implement that amazing C prime finder into the int is_prime function? I have had a few tries but cant seem to find out which bit actually does it. Thanks for all of your help, it is nothing short of amazing.
0

LVL 22

Expert Comment

You might want to look into how "Super pi" does its calculations.  According to its benchmarks file, ti can do ONE MILLION digits of Pi in under 300 seconds, and that's on an ancient 90MHz PC.

0

Author Comment

Currently, using different algorithms I can do around 10 million in about 60 seconds. However, that method is both memory intensive and can not be used in distributed computng (which is what this project is about), so while this method may be slower, it is the only method and so must be optimisted as best as we can make it.
0

LVL 16

Expert Comment

I got:

Position:     10000 Digits: 856672279 Time:    111.81 seconds
Press any key to continue . . .

It is compiled under VC++. I had to change:

#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)

to

#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)

because I was getting loads of warnings but I dont think that was the problem. I am going to use one of my code timing tools on it to get a better idea of where the time is going.

Back in a while.

Paul
0

Author Comment

Whoever manages to integrates a fast prime system (and allows it to work with numbers up to 2 trillion) into the code will probably be the accepted answer, along with mgh_mgharish who found the great information on primes.
0

LVL 22

Expert Comment

Maybe it's just me, but it seems silly to use an algorithm that is TWO MILLION times slower than another algorithm, just because it's distributable.  If it's that slow, it's obvious this is an experiment in distributed processing, not a real attempt to make a fast Pi-finder.   So why spend much time on this aspect?

And it's risky to use (double) arithmetic--- with these algorithms you need EXACT integer results, which is something  floating-point numbers cannot guarantee.

0

Author Comment

The accuracy of the formula is 100%, the use of double gives us 15 digits of accuracy when we only need 9. The slowness come from the fact that to work out large amounts huge amounts of ram would be needed (looking in tb here) as well as a cluster of many cpu's. The idea was not to make it fast, but most other forumulas have a headroom of around 17 billion, the idea of this is to make it faster, in which we have found out what is causing the slowdown and are not hoping for integration. Just take a look at pihex, they managed to find the quadillionth bit of Pi in around two years using a few slow 400mhz pc's.
0

LVL 22

Expert Comment

>the use of double gives us 15 digits of accuracy when we only need 9.

You may be right.

But have you considered what happens along these lines:

15 minus 9 is six.  So you seem to have  5 (plus or minus one) digits of extra accuracy.

But that means that somewhere between one in ten thousand, to one in a million times, the extra digits are going to be all nines, or all zeroes, plus or minus a bit.    Which is going to generate a carry and affect the significant digits.  And the significant digits could be anything up to "999999999" to start with, so the last bit of inaccuracy can totally bollix your result.

Even if type double gives you 15 digits of accuracy, or 1000 digits of accuracy, or  a MILLION digits, it's not enough for doing truly, really, actually, verifiable,  precise integer math.

At least the way I see it.

0

LVL 3

Expert Comment

any way the prime is not eating time..

the suspected code is

for(k=1;k<=N;k++) {

t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);

t=((k*2)-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;

if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}

}

Try commenting this part of the code..

it excecutes in .01 seconds

Thanxs,
Joju
0

Author Comment

grg99, interesting point. To test it I went to http://www.angio.net/pi/bigpi.cgi and typed in "999999" and at point 762 that way found. From this I set the client up to  753 (+9 = 762, right where the nines start) and it gave me the result "870721134", not rounding up the 4 > 5 even though there was a string of nines. Is one of us seeing this wrong?
0

LVL 16

Expert Comment

EvilGuru,

I've been a bit long over this one as I had to reengineer my millisecond timer to use QueryPerformanceFrequency under Windoze. Here's some results:

Timer      Time      Count      Mean(Secs)      Name
0      21858061      1      6.106380      pseudo_main
1      3360      870      0.000001      log(2*N)/log(a)
2      2679      870      0.000001      fmod(sum+(double) s/ (double) av,1.0)
3      13135784      2947560      0.000001      k
4      21858043      1      6.106375      a
5      16012      870      0.000005      pow_mod;mul_mod
6      35474      870      0.000011      next_prime
7      13547      6776      0.000001      is_prime
8      9199      870      0.000003      pow_mod
9      21787312      870      0.006996      k2

My timer system records up to 10 timers. The time field is the total time spect. The count is the number of times that bit of code was performed, the mean(Secs) is the mean time spent in the section. The name should be obvious.

Here's the source so you can see what sections I was timing:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

extern "C" {
#include "CDefs.h"
#include "CodeTime.h"
void Idle ( void ) {};
DEBUG;
};

TimerData CodeTimers[NTIMES] = {0};
/* #define HAS_LONG_LONG */

#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)
#endif

/* returns the inverse of x mod y */

int inv_mod(int x,int y) {
int q,u,v,a,c,t;

u=x;
v=y;
c=1;
a=0;
do {
q=v/u;

t=c;
c=a-q*c;
a=t;

t=u;
u=v-q*u;
v=t;
} while (u!=0);
a=a%y;
if (a<0) a=y+a;
return a;
}

/* returns (a^b) mod m */

int pow_mod(int a,int b,int m)
{
int r,aa;
TESTING(StartTime(8,"pow_mod"););

r=1;
aa=a;
while (1) {
if (b&1) r=mul_mod(r,aa,m);
b=b>>1;
if (b == 0) break;
aa=mul_mod(aa,aa,m);
}
TESTING(StopTime(8););
return r;
}

/* returns true if n is prime */

int is_prime(int n)
{
int r,i;
int rnsr = 1;
TESTING(StartTime(7,"is_prime"););
if ((n % 2) == 0) return 0;

r=(int)(sqrt(n));
for(i=3;i<=r&&rnsr==1;i+=2) if ((n % i) == 0) rnsr = 0;
TESTING(StopTime(7););
return rnsr;
}

/* returns the prime number immediatly after n */

int next_prime(int n)
{
TESTING(StartTime(6,"next_prime"););
do {
n++;
} while (!is_prime(n));
TESTING(StopTime(6););
return n;
}

/* finds out nine digits of pi starting from n */

/* will probably be the main function in openmosix capable version*/

int pseudo_main(int n)
{
int av,a,vmax,N,num,den,k,kq,kq2,t,v,s,i;
double sum;

//      double start_time, elap_time;
TESTING(StartTime(0,"pseudo_main"););
//      start_time = (double) clock ();
N=(int)((n+20)*log(10)/log(2));

sum=0;

TESTING(StartTime(4,"a"););
for(a=3;a<=(2*N);a=next_prime(a)) {

TESTING(StartTime(1,"log(2*N)/log(a)"););
vmax=(int)(log(2*N)/log(a));
av=1;
for(i=0;i<vmax;i++) av=av*a;
TESTING(StopTime(1););

s=0;
num=1;
den=1;
v=0;
kq=1;
kq2=1;

TESTING(StartTime(9,"k2"););
for(k=1;k<=N;k++) {

TESTING(StartTime(3,"k"););
t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);

t=(2*k-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;

if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}

TESTING(StopTime(3););
}
TESTING(StopTime(9););

TESTING(StartTime(5,"pow_mod;mul_mod"););
t=pow_mod(10,n-1,av);
s=mul_mod(s,t,av);
TESTING(StopTime(5););
TESTING(StartTime(2,"fmod(sum+(double) s/ (double) av,1.0)"););
sum=fmod(sum+(double) s/ (double) av,1.0);
TESTING(StopTime(2););
}
TESTING(StopTime(4););

//elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
TESTING(StopTime(0););

printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));

return 0;
}

/* calls up the pseudo_main function to find 9 digits at a time */

int main (int argc,char *argv[])
{
int position=1,counter=1000; //counter is the digit of pi to start from.
//  int number;
//  double start_time, elap_time;

pseudo_main(counter);

DumpTimes();
//Pause Code
//system ("pause"); // execute M\$-DOS' pause command
return 0;

}

Its the TESTING(StartTime... TESTING(StopTime... that records the moments.

The results suggest some pretty obvious results:

of the 21858061 'ticks' spent in the program (pseudo_main timer) 21858043 is spent in the 'a' loop and, of that time, 13135784 ticks are spent in the 'k' loop. Inside this loops, next_prime, is_prime and pow_mod take almost trivial time.

I'd therefore recommend you try to ease the burden of all teh variable in the pseudo_main function. Break pseudo_main down into multiple calls. This allows the compiler to establish a new set of register variables in each call.

/*
This header defines some functions that allow you to time certain processes
at the millisecond/microsecond level.

Use it as follows.

1.      Somewhere in your code, probably best in a main module, declare the timer
data with:
TimerData CodeTimers[NTIMES] = {0};

2.      Functions you wish to time can be surrounded by StartTime and StopTime calls.
E.G.
TESTING(StartTime(4,"ExpandFromFile"););
length = ExpandFromFile ( &p[i+1], &expanded[j] );
TESTING(StopTime(4););

The name will appear in the log so I use the name of the function I am timing
so I can use the code browser to get there quickly.

3.      At any time - I usually do it at program terminate - call DumpTimes().
This will give you a dump of the form:
Timer[0]      570      131      ShowBasics
Timer[1]      170      131      ShowNumbers #
Timer[2]      3110      131      ShowNumbers \$
Timer[3]      2420      12196      ExpandParameter
Timer[4]      1380      8782      ExpandFromFile
Timer[6]      0      0
Timer[7]      0      0
Timer[8]      0      0
Timer[9]      3950      131      ShowBackground
Giving Timer number, total time taken (in ms), call count and the supplied name.
This data is tab delimited so you can copy and paste it into a spreadsheet.

*/

// Timing code to log millisecond times.
#include <sys\TimeB.h>

// Number of timers you can keep.
#define NTIMES      10

#ifdef WIN32
#include "WINDOWS.H"
//#include "WINBASE.H"
#define RecordedTime LONGLONG
#define ElapsedTime LONGLONG
#define fElapsed "%I64d"
#else
// Ansi stuff.
#define RecordedTime struct timeb
#define ElapsedTime long
#define fElapsed "%ld"
#endif

// One timer data structure.
typedef struct
{
RecordedTime startTime;
ElapsedTime elapsed;
long count;
char * name;
} TimerData;

#ifdef CODETIME

// Please allocate one of these somewhere and make it initially empty.
// E.G. TimerData CodeTimers[NTIMES] = {0};
extern TimerData CodeTimers[NTIMES];

// Call this to start a timer.
#ifdef WIN32
#define StartTime(n,fname) \
{\
CodeTimers[n].name = fname;\
CodeTimers[n].count += 1;\
QueryPerformanceCounter((LARGE_INTEGER*)&CodeTimers[n].startTime);\
}
#else
#define StartTime(n,fname) \
{\
CodeTimers[n].name = fname;\
CodeTimers[n].count += 1;\
ftime(&CodeTimers[n].startTime);\
}
#endif

// Call this to stop the timer.
#ifdef WIN32
#define StopTime(n) \
{\
RecordedTime stopTime;\
QueryPerformanceCounter((LARGE_INTEGER*)&stopTime);\
CodeTimers[n].elapsed += stopTime - CodeTimers[n].startTime;\
}
#else
#define StopTime(n) \
{\
struct timeb stopTime;\
ftime(&stopTime);\
CodeTimers[n].elapsed += ((stopTime.time - CodeTimers[n].startTime.time) * 1000L) + ((long)stopTime.millitm - CodeTimers[n].startTime.millitm);\
}
#endif

// Call this on exit or whenever you want the times logged.
#define DumpTimes()\
{\
int i;\
RecordedTime freq;\
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);\
TFPRINT(("Timer\tTime\tCount\tMean(Secs)\tName"));\
for ( i = 0; i < NTIMES; i++ )\
{\
TFPRINT(("%d\t"fElapsed"\t%ld\t%f\t%s", i, CodeTimers[i].elapsed, CodeTimers[i].count, CodeTimers[i].count > 0 ? (double)CodeTimers[i].elapsed / (double)CodeTimers[i].count / freq: 0.0,CodeTimers[i].name == NULL ? "": CodeTimers[i].name ));\
}\
}

#else
// Call this to start a timer.
#define StartTime(n,fname)
// Call this to stop the timer.
#define StopTime(n)
// Call this on exit or whenever you want the times logged.
#define DumpTimes()

#endif

Note that my enhancements have upped its resolution beyiond milliseconds under a WIN32 environment. TFPRINT Is just like printf but with two ((.

Good luck.

Paul
0

Author Comment

That is amazing! Do you think there is much room for improvements in the k loops, and this may sound a tad noobie but would you be able to explain how splitting up the pseudo_main() call into smaller parts would help make it faster? Sorry, this is the first time I have needed to optimise my code :) Thanks for any help/advice you can give.
0

LVL 22

Expert Comment

There's a big problem with using QueryPerformanceTimer()-- it has a LOT of overhead per call.
Try calling it twice in sucession and print out the time difference.  It's probably around 1500 clock cycles.

So it's not very useful for timing short instruction sequences.   You can make it a bit better by subtracting the overhead time from the printouts.

As a better alternative, you might try using the RDTSC instruction, which is at least 40 times faster, but still awfully slow, around 45 clocks.

0

LVL 16

Expert Comment

Greg,

Good points! I'll investigate these. I only want the figures relative so it QueryPerformanceTimer is nearly what I want. My earlier version that used ftime just wasnt up to it. It was recording millions of calls taking zero time because the calls themselves took less than 1ms.

EvilGuru,

Splitting can make a heap of difference. The aim shuld be to reduce the number of local stack variables. The compiler can then more efficiently make use of registers and instruction cacheing. Remember also that there is  stack overhead so dont go too far. I'd suggest you put the 'a' loop in one function and the 'k' loop in another, then see if you've made a difference.

Paul
0

LVL 3

Expert Comment

I suspect fmod().
solution some fast fmod func ?
0

Author Comment

PaulCaswell, you mean something like:

pseudo_main {
int a(vars here);
}
a {
a code here
int k(vars here);
}
Or have I got the wrong end of the stick?

According to the timer fmod takes up very little time, so I while I did think it was that I now dont. But a faster one cant hurt.
0

LVL 16

Expert Comment

EvilGuru,

This is the sort of thing I was suggesting. I've only pulled the a_loop code out but I am sure you will understand what I mean. I hope you can pull the k_loop out yourself.

// The 'a' part of the loop.
void a_loop ( int N, int n, int a, double * sum )
{
int av,vmax,num,den,k,kq,kq2,t,v,s,i;
vmax=(int)(log(2*N)/log(a));
av=1;
for(i=0;i<vmax;i++) av=av*a;

s=0;
num=1;
den=1;
v=0;
kq=1;
kq2=1;

for(k=1;k<=N;k++) {

t=k;
if (kq >= a) {
do {
t=t/a;
v--;
} while ((t % a) == 0);
kq=0;
}
kq++;
num=mul_mod(num,t,av);

t=(2*k-1);
if (kq2 >= a) {
if (kq2 == a) {
do {
t=t/a;
v++;
} while ((t % a) == 0);
}
kq2-=a;
}
den=mul_mod(den,t,av);
kq2+=2;

if (v > 0) {
t=inv_mod(den,av);
t=mul_mod(t,num,av);
t=mul_mod(t,k,av);
for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
s+=t;
if (s>=av) s-=av;
}

}

t=pow_mod(10,n-1,av);
s=mul_mod(s,t,av);
*sum=fmod(*sum+(double) s/ (double) av,1.0);
}

/* finds out nine digits of pi starting from n */

/* will probably be the main function in openmosix capable version*/

int pseudo_main(int n)
{
int a,N;
double sum;

//      double start_time, elap_time;
TESTING(StartTime(0,"pseudo_main"););
//      start_time = (double) clock ();
N=(int)((n+20)*log(10)/log(2));

sum=0;

TESTING(StartTime(4,"a"););
for(a=3;a<=(2*N);a=next_prime(a)) {
a_loop ( N, n, a, &sum );
}
TESTING(StopTime(4););

//elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
TESTING(StopTime(0););

printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));

return 0;
}

0

Author Comment

After a good bit of playing around with it I replaced:
#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
#endif
with:
#define mul_mod(a,b,m) (( (__int64) (a) * (__int64) (b) ) % (m))

My time for 10,000 went down from 85 seconds to 42 seconds! How bigger benefit do you think I will get from splitting up the functions and are there any other causes where you think something like __int64 will help?
0

LVL 16

Accepted Solution

Awesome! A 50% improvement! You probably wont get too much more.

Function splitting CAN do amazing things. Imagine the difference between 't' stored in ram on the stack and 't' in a register! You are limited to a certain number of registers, if you can break down the code into chunks small enough to ensure that ALL calculations work with register variables then you will see astonishing improvements.

Much of the optimisation will involve trial and error. I notice that:

#define mul_mod(a,b,m) (( (__int64) (a) * (__int64) (b) ) % (m))

involves a multiply and a divide. Have you tried carrying the result as a fraction? I.e. quotient and remainder. You can often get quite good improvements by reducing the number of divisions.

Another idea! I notice:

do {
t=t/a;
v--;
} while ((t % a) == 0);

and

do {
t=t/a;
v++;
} while ((t % a) == 0);

Could this be changed to:

v -= countFactors(&t,a);

and

v += countFactors(&t,a);

You could then write countFactors as:

int countFactors(int * tp, int a)
{
int count = 0;
register int t = *tp;
do {
t/=a;
count++;
} while ((t % a) == 0);
*tp = t;
return count;
}

I bet the compiler would be able to optimise this big-time. It may not help but its the essence of optimising! Either prove where the time is going or tighten up the code.

Paul

P.S. You are hoping to distribute this process! Can you assume that __int64 is as efficient on all machines as it is on yours? Perhaps fmod would be better on some?
0

Author Comment

The program itself will come with a .cfg file where I will have an option saying:
Use __int64 = yes;
However on the machines I have tried it on the speed increase has been quite large, but I will give the option to swtich.
You ahve helped so much! Thanks, answer accepted.
0

## Featured Post

### Suggested Solutions

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn hoâ€¦
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to useâ€¦
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.

#### Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!