Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Fast base-62 encode/decode

Posted on 2006-09-06
13
Medium Priority
?
10,844 Views
Last Modified: 2012-06-21
I use base-62 encoding quite a lot as it's URL safe and reasonably obfuscated, however, my current implementation is not very fast and I'm wondering if there's a better route, perhaps using an iterative method, perhaps using pack() and friends. As an example: base62_encode('http://www.php.net') gives 'ScoQhfdZc5xJxw6hwvPQMo3W'.

My current implementation use bc_math to treat strings as numbers in base-256, converts them to base-10, then to base-62 (or the reverse). I do this with these functions which do arbitrary base conversions:

function dec2base($dec, $base, $digits = FALSE) {
      if($base < 2 or $base > 256) {
            die("Invalid Base: .$base\n");
      }
      bcscale(0);
      $value = '';
      if(!$digits) {
            $digits = digits($base);
      }
      while($dec > $base - 1) {
            $rest = bcmod($dec,$base);
            $dec = bcdiv($dec,$base);
            $value = $digits[$rest].$value;
      }
      $value=$digits[intval($dec)].$value;
      return (string) $value;
}

function base2dec($value, $base, $digits = FALSE) {
      if($base < 2 or $base > 256) {
            die("Invalid Base: .$base\n");
      }
      bcscale(0);
      if($base < 37) {
            $value = strtolower($value);
      }
      if(!$digits) {
            $digits = digits($base);
      }
      $size = strlen($value);
      $dec = '0';
      for($loop=0; $loop < $size; $loop++) {
            $element = strpos($digits, $value[$loop]);
            $power = bcpow($base, $size-$loop-1);
            $dec = bcadd($dec, bcmul($element, $power));
      }
      return (string)$dec;
}

function digits($base) {
      if($base < 64) {
            return substr('0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_', 0, $base);
      } else {
            return substr("\x0\x1\x2\x3\x4\x5\x6\x7\x8\x9\xa\xb\xc\xd\xe\xf\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !\x22#\x24%&'()*+,-./0123456789:;<=>\x3f@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff", 0, $base);
      }
}

function base62_encode($value) {
      return dec2base(base2dec($value, 256), 62);
}

function base62_decode($value) {
      return dec2base(base2dec($value, 62), 256);
}

So, they all work ok, but it's a bit roundabout. It would be nice to go directly from 256->62, but I can't seem to think that through. There are some base conversion functions in bc_math, but they only go up to base-36. The same goes for the big_int extension in pecl, which despite it's claims, seems slower than bc_math.

I think I need to ask someone who's more of a mathematician than a coder!
0
Comment
Question by:Marcus Bointon
  • 4
  • 4
  • 2
  • +1
13 Comments
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17467026
Sorry about that long line...
0
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17467038
BTW, some of these functions were adapted from PHP user notes.
0
 
LVL 29

Expert Comment

by:TeRReF
ID: 17469425
Are the numbers too large to use base_convert() ?
http://php.net/base_convert

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 29

Expert Comment

by:TeRReF
ID: 17469437
Sorry, disregard my comment, I had forgotten it only goes to base-36 (like you already stated in your question :)
0
 
LVL 85

Expert Comment

by:ozo
ID: 17736281
   $dec = 0;
     for($loop=0; $loop < $size; $loop++) {
          $dec = $dec * $base + strpos($digits, $value[$loop]);
     }
0
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17737008
That's no use - the numbers involved are well outside what's feasible to handle using ints.
For example, the 256-bit number 'http://www.experts-exchange.com/' converts to this in base-10:

47246294140391775270645689091620257283684500560481702162465929546238692125999
0
 
LVL 85

Expert Comment

by:ozo
ID: 17737080
Could you do base 64 encoding instead of base 62?
Would you be willing to waste a litte space in order to gain some speed?
0
 
LVL 85

Expert Comment

by:ozo
ID: 17737138
If you could use 5 base 62 digits to encode 29 bits at a time, it would be feasible to do it with 32 bit integer arithmetic
or if you use 21 base 62 digits to encode 125 bits at a time, you can limit the size of your multiprecision arithmetic
          $dec = bcadd($element, bcmul($dec, $base)); would at least cut out a bcpow($base, $size-$loop-1); operation
0
 
LVL 25

Author Comment

by:Marcus Bointon
ID: 17737203
I would use base-64, but it's not is not safe to use in all of URLs, domain-names, pathnames and email addresses.
Also, all the usual implementations add padding characters.

You second comment is more like the kind of thing I was after! Needs a bit of thinking about.

As it happens, since I originally posted this question my need for this function has somewhat diminished, but I'm still curious for a solution.
0
 
LVL 85

Accepted Solution

by:
ozo earned 800 total points
ID: 17737311
Since 62^n != 256^m unless m ==0 and n == 0, base62 will have padding too
multi-precision arithmetic usually gets slower as the numbers get larger, so if you can do in chunks you may speed things up.
e.g. instead of base 10, if you work in base 10^9, you may be able to speed things by a factor of 9,
and its easy to convert from base 10 to base 10^9 and vice versa
0
 
LVL 22

Assisted Solution

by:NovaDenizen
NovaDenizen earned 800 total points
ID: 17739116
Instead of treating the string as a huge multiprecision number, you could treat it as a series of fixed-precision smaller numbers.  This would cost you about 20% more space but it should be a lot faster to calculate.

ln(62)/ln(2) = 5.954 bits.  You could represent 2 url characters with 3 base-62 characters, 4 url characters with up to 6 (5.37) base-62 characters (no point in doing this because it's the same expansion as for 2 url characters), 8 url characters with up to 11 (10.75) base-62 characters.

0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

971 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question