Link to home
Start Free TrialLog in
Avatar of EvilGuru
EvilGuru

asked on

Calculating Pi using BBP forumula - optimisations

Hi, currently I am working on a distributed computing project to work out the mathematical constant 'pi'. After taking a look at many examples and formulas for pi I have choson to use a common implementation of the BBP formula. This is a digit extraction formula which is able to compute any digit in Pi without having to know the n-1 digit. However, I have a problem with the script, it is quite slow, each time it is run it returns 9 digits of Pi from the value of n. If n was 1 it would return digits 1-9 . If in my basic test program I set n to be 10,000 it takes around 85 seconds to work out nine digits starting at position 10,000 of pi.

This is a problem for, as when larger values are reached (billions) it could takes weeks of cpu time just to get nine more digits. However, I am quite sure there is room for optimisation in the current formula I use. I a currently not sure which part of the formula is causing the slowdown, as my attempt to add in some debugging code failed. I have also been told by some people that it would be faster if the program used hardware cpu optimisations (sse?), however I am unsure of this. I have a feeling that is it the calculating of the prime numbers that is taking the time but I am still unsure. Here is a link to my current code as I did not want to post it in this topic.
http://www.dcrez.co.uk/create/pi.c
Thank you for any help/advice you can give .
Avatar of Harisha M G
Harisha M G
Flag of India image

Hi EvilGuru,
    That is a great code...

    Here is the output in my computer... :)

Position:     10000 Digits: 856672279 Time:    243.09 seconds
Press any key to continue . . .

I made some changes... Then I got this result...

Position:     10000 Digits: 856672279 Time:    235.57 seconds
Press any key to continue . . .

The change...
___________________________________________
int next_prime(int n)
{
   n = n - 1 + n%2;            // Make n a odd number
   do {
      n+=2;                    // Here you will jump to only the odd numbers
   } while (!is_prime(n));
   return n;
}
___________________________________________

Bye
---
Harish
EvilGuru,

http://mathworld.wolfram.com/QuadraticSieve.html
has a better prime checking algorithm... but I don't know how to implement that :(
Avatar of EvilGuru
EvilGuru

ASKER

Thanks! That got me down a good few seconds here as well. I am currently not sure what is taking up the time in the program, I have an idea that it is the primes so I thought about 'spoon feeding' the program a list of prime numbers in an array, however I am not totally sure how to do that or if there will be a large gain from it.
If you want to spoon feed them...
http://www.nist.gov/dads/HTML/sieve.html
Can anyone help implement that amazing C prime finder into the int is_prime function? I have had a few tries but cant seem to find out which bit actually does it. Thanks for all of your help, it is nothing short of amazing.
You might want to look into how "Super pi" does its calculations.  According to its benchmarks file, ti can do ONE MILLION digits of Pi in under 300 seconds, and that's on an ancient 90MHz PC.

Currently, using different algorithms I can do around 10 million in about 60 seconds. However, that method is both memory intensive and can not be used in distributed computng (which is what this project is about), so while this method may be slower, it is the only method and so must be optimisted as best as we can make it.
I got:

Position:     10000 Digits: 856672279 Time:    111.81 seconds
Press any key to continue . . .

It is compiled under VC++. I had to change:

#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)

to

#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)

because I was getting loads of warnings but I dont think that was the problem. I am going to use one of my code timing tools on it to get a better idea of where the time is going.

Back in a while.

Paul
Whoever manages to integrates a fast prime system (and allows it to work with numbers up to 2 trillion) into the code will probably be the accepted answer, along with mgh_mgharish who found the great information on primes.
Maybe it's just me, but it seems silly to use an algorithm that is TWO MILLION times slower than another algorithm, just because it's distributable.  If it's that slow, it's obvious this is an experiment in distributed processing, not a real attempt to make a fast Pi-finder.   So why spend much time on this aspect?

And it's risky to use (double) arithmetic--- with these algorithms you need EXACT integer results, which is something  floating-point numbers cannot guarantee.

The accuracy of the formula is 100%, the use of double gives us 15 digits of accuracy when we only need 9. The slowness come from the fact that to work out large amounts huge amounts of ram would be needed (looking in tb here) as well as a cluster of many cpu's. The idea was not to make it fast, but most other forumulas have a headroom of around 17 billion, the idea of this is to make it faster, in which we have found out what is causing the slowdown and are not hoping for integration. Just take a look at pihex, they managed to find the quadillionth bit of Pi in around two years using a few slow 400mhz pc's.
>the use of double gives us 15 digits of accuracy when we only need 9.

You may be right.

But have you considered what happens along these lines:


   15 minus 9 is six.  So you seem to have  5 (plus or minus one) digits of extra accuracy.

But that means that somewhere between one in ten thousand, to one in a million times, the extra digits are going to be all nines, or all zeroes, plus or minus a bit.    Which is going to generate a carry and affect the significant digits.  And the significant digits could be anything up to "999999999" to start with, so the last bit of inaccuracy can totally bollix your result.


Even if type double gives you 15 digits of accuracy, or 1000 digits of accuracy, or  a MILLION digits, it's not enough for doing truly, really, actually, verifiable,  precise integer math.

At least the way I see it.



any way the prime is not eating time..

the suspected code is

   for(k=1;k<=N;k++) {

       t=k;
       if (kq >= a) {
         do {
            t=t/a;
            v--;
         } while ((t % a) == 0);
         kq=0;
      }
      kq++;
      num=mul_mod(num,t,av);

      t=((k*2)-1);
      if (kq2 >= a) {
         if (kq2 == a) {
            do {
               t=t/a;
               v++;
            } while ((t % a) == 0);
         }
         kq2-=a;
      }
     den=mul_mod(den,t,av);
      kq2+=2;

       if (v > 0) {
        t=inv_mod(den,av);
        t=mul_mod(t,num,av);
        t=mul_mod(t,k,av);
        for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
        s+=t;
        if (s>=av) s-=av;
      }

    }


Try commenting this part of the code..

it excecutes in .01 seconds


Thanxs,
Joju
grg99, interesting point. To test it I went to http://www.angio.net/pi/bigpi.cgi and typed in "999999" and at point 762 that way found. From this I set the client up to  753 (+9 = 762, right where the nines start) and it gave me the result "870721134", not rounding up the 4 > 5 even though there was a string of nines. Is one of us seeing this wrong?
EvilGuru,

I've been a bit long over this one as I had to reengineer my millisecond timer to use QueryPerformanceFrequency under Windoze. Here's some results:

Timer      Time      Count      Mean(Secs)      Name
0      21858061      1      6.106380      pseudo_main
1      3360      870      0.000001      log(2*N)/log(a)
2      2679      870      0.000001      fmod(sum+(double) s/ (double) av,1.0)
3      13135784      2947560      0.000001      k
4      21858043      1      6.106375      a
5      16012      870      0.000005      pow_mod;mul_mod
6      35474      870      0.000011      next_prime
7      13547      6776      0.000001      is_prime
8      9199      870      0.000003      pow_mod
9      21787312      870      0.006996      k2

My timer system records up to 10 timers. The time field is the total time spect. The count is the number of times that bit of code was performed, the mean(Secs) is the mean time spent in the section. The name should be obvious.

Here's the source so you can see what sections I was timing:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

extern "C" {
#include "CDefs.h"
#include "CodeTime.h"
 void Idle ( void ) {};
 DEBUG;
};

TimerData CodeTimers[NTIMES] = {0};
/* #define HAS_LONG_LONG */

#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) (int)fmod( (double) a * (double) b, m)
#endif

/* returns the inverse of x mod y */

int inv_mod(int x,int y) {
      int q,u,v,a,c,t;
      
      u=x;
      v=y;
      c=1;
      a=0;
      do {
            q=v/u;
            
            t=c;
            c=a-q*c;
            a=t;
            
            t=u;
            u=v-q*u;
            v=t;
      } while (u!=0);
      a=a%y;
      if (a<0) a=y+a;
      return a;
}

/* returns (a^b) mod m */

int pow_mod(int a,int b,int m)
{
      int r,aa;
      TESTING(StartTime(8,"pow_mod"););
      
      r=1;
      aa=a;
      while (1) {
            if (b&1) r=mul_mod(r,aa,m);
            b=b>>1;
            if (b == 0) break;
            aa=mul_mod(aa,aa,m);
      }
      TESTING(StopTime(8););
      return r;
}

/* returns true if n is prime */

int is_prime(int n)
{
      int r,i;
      int rnsr = 1;
      TESTING(StartTime(7,"is_prime"););
      if ((n % 2) == 0) return 0;
      
      r=(int)(sqrt(n));
      for(i=3;i<=r&&rnsr==1;i+=2) if ((n % i) == 0) rnsr = 0;
      TESTING(StopTime(7););
      return rnsr;
}

/* returns the prime number immediatly after n */

int next_prime(int n)
{
      TESTING(StartTime(6,"next_prime"););
      do {
            n++;
      } while (!is_prime(n));
      TESTING(StopTime(6););
      return n;
}

/* finds out nine digits of pi starting from n */

/* will probably be the main function in openmosix capable version*/

int pseudo_main(int n)
{
      int av,a,vmax,N,num,den,k,kq,kq2,t,v,s,i;
      double sum;
      
//      double start_time, elap_time;
      TESTING(StartTime(0,"pseudo_main"););
//      start_time = (double) clock ();
      N=(int)((n+20)*log(10)/log(2));
      
      sum=0;
      
      TESTING(StartTime(4,"a"););
      for(a=3;a<=(2*N);a=next_prime(a)) {
            
      TESTING(StartTime(1,"log(2*N)/log(a)"););
            vmax=(int)(log(2*N)/log(a));
            av=1;
            for(i=0;i<vmax;i++) av=av*a;
      TESTING(StopTime(1););
            
            s=0;
            num=1;
            den=1;
            v=0;
            kq=1;
            kq2=1;
            
      TESTING(StartTime(9,"k2"););
            for(k=1;k<=N;k++) {
                  
      TESTING(StartTime(3,"k"););
                  t=k;
                  if (kq >= a) {
                        do {
                              t=t/a;
                              v--;
                        } while ((t % a) == 0);
                        kq=0;
                  }
                  kq++;
                  num=mul_mod(num,t,av);
                  
                  t=(2*k-1);
                  if (kq2 >= a) {
                        if (kq2 == a) {
                              do {
                                    t=t/a;
                                    v++;
                              } while ((t % a) == 0);
                        }
                        kq2-=a;
                  }
                  den=mul_mod(den,t,av);
                  kq2+=2;
                  
                  if (v > 0) {
                        t=inv_mod(den,av);
                        t=mul_mod(t,num,av);
                        t=mul_mod(t,k,av);
                        for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
                        s+=t;
                        if (s>=av) s-=av;
                  }
                  
      TESTING(StopTime(3););
            }
      TESTING(StopTime(9););

      TESTING(StartTime(5,"pow_mod;mul_mod"););
            t=pow_mod(10,n-1,av);
            s=mul_mod(s,t,av);
      TESTING(StopTime(5););
      TESTING(StartTime(2,"fmod(sum+(double) s/ (double) av,1.0)"););
            sum=fmod(sum+(double) s/ (double) av,1.0);
      TESTING(StopTime(2););
      }
      TESTING(StopTime(4););
      
      //elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
      TESTING(StopTime(0););
      
      printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));
      
      return 0;
}

/* calls up the pseudo_main function to find 9 digits at a time */

int main (int argc,char *argv[])
{
      int position=1,counter=1000; //counter is the digit of pi to start from.
      //  int number;
      //  double start_time, elap_time;
      
      pseudo_main(counter);
      
      DumpTimes();
      //Pause Code
      //system ("pause"); // execute M$-DOS' pause command
      return 0;
      
}

Its the TESTING(StartTime... TESTING(StopTime... that records the moments.

The results suggest some pretty obvious results:

of the 21858061 'ticks' spent in the program (pseudo_main timer) 21858043 is spent in the 'a' loop and, of that time, 13135784 ticks are spent in the 'k' loop. Inside this loops, next_prime, is_prime and pow_mod take almost trivial time.

I'd therefore recommend you try to ease the burden of all teh variable in the pseudo_main function. Break pseudo_main down into multiple calls. This allows the compiler to establish a new set of register variables in each call.

My CodeTime.h header looks like:

/*
      This header defines some functions that allow you to time certain processes
      at the millisecond/microsecond level.

      Use it as follows.

      1.      Somewhere in your code, probably best in a main module, declare the timer
            data with:
            TimerData CodeTimers[NTIMES] = {0};

      2.      Functions you wish to time can be surrounded by StartTime and StopTime calls.
            E.G.
            TESTING(StartTime(4,"ExpandFromFile"););
            length = ExpandFromFile ( &p[i+1], &expanded[j] );
            TESTING(StopTime(4););

            The name will appear in the log so I use the name of the function I am timing
            so I can use the code browser to get there quickly.

      3.      At any time - I usually do it at program terminate - call DumpTimes().
            This will give you a dump of the form:
            Timer[0]      570      131      ShowBasics
            Timer[1]      170      131      ShowNumbers #
            Timer[2]      3110      131      ShowNumbers $
            Timer[3]      2420      12196      ExpandParameter
            Timer[4]      1380      8782      ExpandFromFile
            Timer[5]      0      7      ReadFromFile
            Timer[6]      0      0      
            Timer[7]      0      0      
            Timer[8]      0      0      
            Timer[9]      3950      131      ShowBackground
            Giving Timer number, total time taken (in ms), call count and the supplied name.
            This data is tab delimited so you can copy and paste it into a spreadsheet.


*/

// Timing code to log millisecond times.
#include <sys\TimeB.h>

// Number of timers you can keep.
#define NTIMES      10

#ifdef WIN32
#include "WINDOWS.H"
//#include "WINBASE.H"
#define RecordedTime LONGLONG
#define ElapsedTime LONGLONG
#define fElapsed "%I64d"
#else
// Ansi stuff.
#define RecordedTime struct timeb
#define ElapsedTime long
#define fElapsed "%ld"
#endif

// One timer data structure.
typedef struct
{
      RecordedTime startTime;
      ElapsedTime elapsed;
      long count;
      char * name;
} TimerData;

#ifdef CODETIME

// Please allocate one of these somewhere and make it initially empty.
// E.G. TimerData CodeTimers[NTIMES] = {0};
extern TimerData CodeTimers[NTIMES];

// Call this to start a timer.
#ifdef WIN32
#define StartTime(n,fname) \
{\
      CodeTimers[n].name = fname;\
      CodeTimers[n].count += 1;\
      QueryPerformanceCounter((LARGE_INTEGER*)&CodeTimers[n].startTime);\
}
#else
#define StartTime(n,fname) \
{\
      CodeTimers[n].name = fname;\
      CodeTimers[n].count += 1;\
      ftime(&CodeTimers[n].startTime);\
}
#endif

// Call this to stop the timer.
#ifdef WIN32
#define StopTime(n) \
{\
      RecordedTime stopTime;\
      QueryPerformanceCounter((LARGE_INTEGER*)&stopTime);\
      CodeTimers[n].elapsed += stopTime - CodeTimers[n].startTime;\
}
#else
#define StopTime(n) \
{\
      struct timeb stopTime;\
      ftime(&stopTime);\
      CodeTimers[n].elapsed += ((stopTime.time - CodeTimers[n].startTime.time) * 1000L) + ((long)stopTime.millitm - CodeTimers[n].startTime.millitm);\
}
#endif

// Call this on exit or whenever you want the times logged.
#define DumpTimes()\
{\
      int i;\
      RecordedTime freq;\
      QueryPerformanceFrequency((LARGE_INTEGER*)&freq);\
      TFPRINT(("Timer\tTime\tCount\tMean(Secs)\tName"));\
      for ( i = 0; i < NTIMES; i++ )\
      {\
      TFPRINT(("%d\t"fElapsed"\t%ld\t%f\t%s", i, CodeTimers[i].elapsed, CodeTimers[i].count, CodeTimers[i].count > 0 ? (double)CodeTimers[i].elapsed / (double)CodeTimers[i].count / freq: 0.0,CodeTimers[i].name == NULL ? "": CodeTimers[i].name ));\
      }\
}

#else
// Call this to start a timer.
#define StartTime(n,fname)
// Call this to stop the timer.
#define StopTime(n)
// Call this on exit or whenever you want the times logged.
#define DumpTimes()

#endif

Note that my enhancements have upped its resolution beyiond milliseconds under a WIN32 environment. TFPRINT Is just like printf but with two ((.

Good luck.

Paul
That is amazing! Do you think there is much room for improvements in the k loops, and this may sound a tad noobie but would you be able to explain how splitting up the pseudo_main() call into smaller parts would help make it faster? Sorry, this is the first time I have needed to optimise my code :) Thanks for any help/advice you can give.
There's a big problem with using QueryPerformanceTimer()-- it has a LOT of overhead per call.    
Try calling it twice in sucession and print out the time difference.  It's probably around 1500 clock cycles.

So it's not very useful for timing short instruction sequences.   You can make it a bit better by subtracting the overhead time from the printouts.

As a better alternative, you might try using the RDTSC instruction, which is at least 40 times faster, but still awfully slow, around 45 clocks.

Greg,

Good points! I'll investigate these. I only want the figures relative so it QueryPerformanceTimer is nearly what I want. My earlier version that used ftime just wasnt up to it. It was recording millions of calls taking zero time because the calls themselves took less than 1ms.

EvilGuru,

Splitting can make a heap of difference. The aim shuld be to reduce the number of local stack variables. The compiler can then more efficiently make use of registers and instruction cacheing. Remember also that there is  stack overhead so dont go too far. I'd suggest you put the 'a' loop in one function and the 'k' loop in another, then see if you've made a difference.

Paul
I suspect fmod().
solution some fast fmod func ?
PaulCaswell, you mean something like:

pseudo_main {
int a(vars here);
}
a {
a code here
int k(vars here);
}
Or have I got the wrong end of the stick?

According to the timer fmod takes up very little time, so I while I did think it was that I now dont. But a faster one cant hurt.
EvilGuru,

This is the sort of thing I was suggesting. I've only pulled the a_loop code out but I am sure you will understand what I mean. I hope you can pull the k_loop out yourself.

// The 'a' part of the loop.
void a_loop ( int N, int n, int a, double * sum )
{
      int av,vmax,num,den,k,kq,kq2,t,v,s,i;
      vmax=(int)(log(2*N)/log(a));
      av=1;
      for(i=0;i<vmax;i++) av=av*a;
      
      s=0;
      num=1;
      den=1;
      v=0;
      kq=1;
      kq2=1;
      
      for(k=1;k<=N;k++) {
            
            t=k;
            if (kq >= a) {
                  do {
                        t=t/a;
                        v--;
                  } while ((t % a) == 0);
                  kq=0;
            }
            kq++;
            num=mul_mod(num,t,av);
            
            t=(2*k-1);
            if (kq2 >= a) {
                  if (kq2 == a) {
                        do {
                              t=t/a;
                              v++;
                        } while ((t % a) == 0);
                  }
                  kq2-=a;
            }
            den=mul_mod(den,t,av);
            kq2+=2;
            
            if (v > 0) {
                  t=inv_mod(den,av);
                  t=mul_mod(t,num,av);
                  t=mul_mod(t,k,av);
                  for(i=v;i<vmax;i++) t=mul_mod(t,a,av);
                  s+=t;
                  if (s>=av) s-=av;
            }
            
      }
      
      t=pow_mod(10,n-1,av);
      s=mul_mod(s,t,av);
      *sum=fmod(*sum+(double) s/ (double) av,1.0);
}

/* finds out nine digits of pi starting from n */

/* will probably be the main function in openmosix capable version*/

int pseudo_main(int n)
{
      int a,N;
      double sum;
      
//      double start_time, elap_time;
      TESTING(StartTime(0,"pseudo_main"););
//      start_time = (double) clock ();
      N=(int)((n+20)*log(10)/log(2));
      
      sum=0;
      
      TESTING(StartTime(4,"a"););
      for(a=3;a<=(2*N);a=next_prime(a)) {
            a_loop ( N, n, a, &sum );
      }
      TESTING(StopTime(4););
      
      //elap_time = ((double) clock () - (double) start_time) / CLOCKS_PER_SEC;
      TESTING(StopTime(0););
      
      printf("Position: %9d Digits: %09d\n",n,(int)(sum*1e9));
      
      return 0;
}

After a good bit of playing around with it I replaced:
#ifdef HAS_LONG_LONG
#define mul_mod(a,b,m) (( (long long) (a) * (long long) (b) ) % (m))
#else
#define mul_mod(a,b,m) fmod( (double) a * (double) b, m)
#endif
with:
#define mul_mod(a,b,m) (( (__int64) (a) * (__int64) (b) ) % (m))

My time for 10,000 went down from 85 seconds to 42 seconds! How bigger benefit do you think I will get from splitting up the functions and are there any other causes where you think something like __int64 will help?
ASKER CERTIFIED SOLUTION
Avatar of PaulCaswell
PaulCaswell
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The program itself will come with a .cfg file where I will have an option saying:
Use __int64 = yes;
However on the machines I have tried it on the speed increase has been quite large, but I will give the option to swtich.
You ahve helped so much! Thanks, answer accepted.