Solved

Fast base-62 encode/decode

Posted on 2006-09-06
13
10,610 Views
Last Modified: 2012-06-21
I use base-62 encoding quite a lot as it's URL safe and reasonably obfuscated, however, my current implementation is not very fast and I'm wondering if there's a better route, perhaps using an iterative method, perhaps using pack() and friends. As an example: base62_encode('http://www.php.net') gives 'ScoQhfdZc5xJxw6hwvPQMo3W'.

My current implementation use bc_math to treat strings as numbers in base-256, converts them to base-10, then to base-62 (or the reverse). I do this with these functions which do arbitrary base conversions:

function dec2base($dec, $base, $digits = FALSE) {
      if($base < 2 or $base > 256) {
            die("Invalid Base: .$base\n");
      }
      bcscale(0);
      $value = '';
      if(!$digits) {
            $digits = digits($base);
      }
      while($dec > $base - 1) {
            $rest = bcmod($dec,$base);
            $dec = bcdiv($dec,$base);
            $value = $digits[$rest].$value;
      }
      $value=$digits[intval($dec)].$value;
      return (string) $value;
}

function base2dec($value, $base, $digits = FALSE) {
      if($base < 2 or $base > 256) {
            die("Invalid Base: .$base\n");
      }
      bcscale(0);
      if($base < 37) {
            $value = strtolower($value);
      }
      if(!$digits) {
            $digits = digits($base);
      }
      $size = strlen($value);
      $dec = '0';
      for($loop=0; $loop < $size; $loop++) {
            $element = strpos($digits, $value[$loop]);
            $power = bcpow($base, $size-$loop-1);
            $dec = bcadd($dec, bcmul($element, $power));
      }
      return (string)$dec;
}

function digits($base) {
      if($base < 64) {
            return substr('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_', 0, $base);
      } else {
            return substr("\x0\x1\x2\x3\x4\x5\x6\x7\x8\x9\xa\xb\xc\xd\xe\xf\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\x22#\x24%&'()*+,-./0123456789:;<=>\x3f@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff", 0, $base);
      }
}

function base62_encode($value) {
      return dec2base(base2dec($value, 256), 62);
}

function base62_decode($value) {
      return dec2base(base2dec($value, 62), 256);
}

So, they all work ok, but it's a bit roundabout. It would be nice to go directly from 256->62, but I can't seem to think that through. There are some base conversion functions in bc_math, but they only go up to base-36. The same goes for the big_int extension in pecl, which despite it's claims, seems slower than bc_math.

I think I need to ask someone who's more of a mathematician than a coder!
0
Comment
Question by:Marcus Bointon
  • 4
  • 4
  • 2
  • +1
13 Comments
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17467026
Sorry about that long line...
0
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17467038
BTW, some of these functions were adapted from PHP user notes.
0
 
LVL 29

Expert Comment

by:TeRReF
ID: 17469425
Are the numbers too large to use base_convert() ?
http://php.net/base_convert

0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 29

Expert Comment

by:TeRReF
ID: 17469437
Sorry, disregard my comment, I had forgotten it only goes to base-36 (like you already stated in your question :)
0
 
LVL 84

Expert Comment

by:ozo
ID: 17736281
   $dec = 0;
     for($loop=0; $loop < $size; $loop++) {
          $dec = $dec * $base + strpos($digits, $value[$loop]);
     }
0
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17737008
That's no use - the numbers involved are well outside what's feasible to handle using ints.
For example, the 256-bit number 'http://www.experts-exchange.com/' converts to this in base-10:

47246294140391775270645689091620257283684500560481702162465929546238692125999
0
 
LVL 84

Expert Comment

by:ozo
ID: 17737080
Could you do base 64 encoding instead of base 62?
Would you be willing to waste a litte space in order to gain some speed?
0
 
LVL 84

Expert Comment

by:ozo
ID: 17737138
If you could use 5 base 62 digits to encode 29 bits at a time, it would be feasible to do it with 32 bit integer arithmetic
or if you use 21 base 62 digits to encode 125 bits at a time, you can limit the size of your multiprecision arithmetic
          $dec = bcadd($element, bcmul($dec, $base)); would at least cut out a bcpow($base, $size-$loop-1); operation
0
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17737203
I would use base-64, but it's not is not safe to use in all of URLs, domain-names, pathnames and email addresses.
Also, all the usual implementations add padding characters.

You second comment is more like the kind of thing I was after! Needs a bit of thinking about.

As it happens, since I originally posted this question my need for this function has somewhat diminished, but I'm still curious for a solution.
0
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
ID: 17737311
Since 62^n != 256^m unless m ==0 and n == 0, base62 will have padding too
multi-precision arithmetic usually gets slower as the numbers get larger, so if you can do in chunks you may speed things up.
e.g. instead of base 10, if you work in base 10^9, you may be able to speed things by a factor of 9,
and its easy to convert from base 10 to base 10^9 and vice versa
0
 
LVL 22

Assisted Solution

by:NovaDenizen
NovaDenizen earned 200 total points
ID: 17739116
Instead of treating the string as a huge multiprecision number, you could treat it as a series of fixed-precision smaller numbers.  This would cost you about 20% more space but it should be a lot faster to calculate.

ln(62)/ln(2) = 5.954 bits.  You could represent 2 url characters with 3 base-62 characters, 4 url characters with up to 6 (5.37) base-62 characters (no point in doing this because it's the same expansion as for 2 url characters), 8 url characters with up to 11 (10.75) base-62 characters.

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question