?
Solved

URGENT: How do I convert my current read line from file function to read unicode?

Posted on 2003-03-03
8
Medium Priority
?
531 Views
Last Modified: 2007-12-19
I got a piece of code which reads a line of characters from a file and then returns them to a string. The problem I am having is that I know one line in the file has a line with unicode characters in it.

Here is the function:

string GetNextLine(FILE *fin)
{
     string     rv;
    char    buffer[2048];
    bool    line_end = false;  
    char    c;
    char    u;
    char    *bptr = buffer;

     if (feof(fin))     return "";

    memset(buffer,0,2048);
     
    while (!feof(fin) && !line_end)
    {
        c = 0;
        fread(&c,1,1,fin);
        if (c==0x0a)    line_end = true;
        if (c==0x00)    line_end = true;
        if (c<31)       c=0;
        if (c)          *bptr++ = c;
    }

    *bptr++ = 0;
    *bptr++ = 0;

    rv = buffer;

    return rv;
}

Is there an easy way to make it process unicode? Or am I looking at writing a new function?
0
Comment
Question by:Phoenix_4uk
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 12

Expert Comment

by:Salte
ID: 8057112
Part of the problem here is that I don't know what you mean by "Unicode".

In practice, very few files are stored in unicode. There are many reasons for that but the fact remain that unicode is seldom used directly.

What most people refer to as "unicode" is usually UTF-8, UTF-16 or even in some cases UTF-32. Of these the UTF-32 and possibly the UTF-16 can be thought of as 'pure unicode' but in reality they are all encodings that can be used to transfer and store/retrieve data in unicode.

Now, is your program also operating with unicode? If it previously operated with char data I assume that the data isn't unicode at all. If so your program operate with 8 bit char type and is - in general - unable to handle unicode data or wide character data at all. If so you have to rewrite your program to use wide character data. This in itself is a hard task and it is often made complicated - especially in practice.

Secondly you need to be able to encode/decode the data stored in the file (some form of unicode) to the data that your program operate with. This means that you have to use some form of encoder/decoder. Decode the data you read from file and if you write to the file you need to encode the data written to it.

ofstream has - I believe - some support for specifying encoding and decoding. Not sure how useful it is though, I have generally ended up in trouble when trying to read unicode data from files. Main problem is that different platforms seems to solve the problems in different ways and often incompatible ways. For example Cygwin operate with a wchar_t of size 2, i.e. sizeof(wchar_t) == 2. Obviously you can store UTF-16 in such a char type but you can't really store unicode data outside of BMP with such a type directly and will have to operate with a 2 character encoding of such characters - i.e. UTF-16.

sizeof(wchar_t) == 4 on Linux though for IBM PC and so there you can store UTF-32 chars in a wchar_t. This is also the best way to do in practice but it means that wchar_t on one machine is different from wchar_t on another. To make matters worse, the wchar_t on Linux will still store UTF-16 or even UTF-8 in those wchar_t variables unless tell it to do otherwise.

L"....some string...";

will in Linux in some cases expand to UTF-8 representation in a wchar_t so even though a wchar_t data type can hold 32 bit it only uses values in the range 0-255 in each wchar_t element.

So I think you can safely say that the support for encoding/decoding isn't really very good in std::iostream library.

However, since you have char type and you suddenly see stuff like what you indicated above I suspect the data in the file is UTF-8 and if you want to translate that to unicode you can follow the steps below.

given the above uncertainty around wchar_t and so on, I find it very hard to suggest you to use wchar_t to store the characters from the file. However there isn't really much alternative:

You could define int as the char type, so that each char is stored in an int but this has at least three problems attached to it:

1. int has no support for string functions, so you would have to write your own unilen() function, unicmp(), unichr(), etc etc, all the string functions that you would like to use, you would have to write on your own, the C library has no support for 'int as char' type.

2. String literals would be hard to read:

int hello[] = { 'H', 'e', 'l', 'l', 'o', 0 };
is sort of readable but if you started with characters above pure ASCII it will get hard to read at once.

3. You would at one time or another have to write this back out to a file or display on screen etc. This means you would have to convert from your format with int as char to the format available to file or screen output. This is most likely doable and is likely to be a simple UTF-32 to UTF-8 conversion.

4. All the wcs support that does exist in your library will not be taken advantage of. There are functions there that test a character for being a digit or alphabetic etc etc but you cannot use any of those directly without first translating your 'int as char' text to wchar_t and then do the test on the wchar_t char or chars.

So even though wchar_t has its problems it is very possible that it is your best bet. If you want your program to run on both platform types (sizeof(wchar_t) == 2 and sizeof(wchar_t) == 4) you probably have to write two different libraries to handle that though.

The handling in itself is simple enough in this case:

use wifstream instead of ifstream.

when reading characters from the file you can either specify that the file is UTF-8 or you can read and translate manually. The UTF-8 to UTF-32 is simple enough and if sizeof(wchar_t) == 4 it is very likely that "unicode" is the same as "UTF-32".

If sizeof(wchar_t) == 2 you cannot store 32 bit codes in the wchar_t variable so I would guess that "UTF-16" is what is meant by "unicode".

Translation from UTF-8 to UTF-32 is simple enough and I gave a thorough response to that exact question in a previous posting here.

Translation from UTF-8 to UTF-16 is best done by first translating to UTF-32 (use int or long as type to hold UTF-32 values if sizeof(wchar_t) == 2. Then when you have a UTF-32 value you can translate it to one or two UTF-16 codes also easily using the same posting I spoke of above.

UTF-8 to UTF-32 is in short like this:

0x00..0x7f are translated to 0x000000..0x00007f

I.e. if the low bit is 0 the code is the byte.

0x80..0xbf are never used by themselves and should never appear on their own.

0xc0..0xdf are always followed by ONE and only one byte in the range 0x80..0xbf, the code is:

c is a byte in the range 0xc0..0xcf
d is a byte in the range 0x80..0xbf

code = (static_cast<int>(c & 0x1f) << 6) | (d & 0x3f);
code must be in the range 0x000080..0x0007ff Otherwise it is invalid.

0xe0..0xef are always followed by TWO and exactly two bytes in the range 0x80..0xbf, the code is:

c is a byte in the range 0xe0..0xef
d is the next byte in the range 0x80..0xbf
e is the next byte in the rnage 0x80..0xbf

code = (static_cast<int>(c & 0x0f) << 12) |
       (static_cast<int>(d & 0x3f) << 6) |
       (e & 0x3f);

code must be in the range 0x000800..0x00d7ff
or 0x00e000..0x00ffff otherwise it is invalid.
The codes 0x00fffd..0x00ffff are also actually invalid but you might want to detect 0x00fffe since that indicates that you have the bytes in opposite order since 0x00fffe is invalid while 0x00feff is a valid code and is used for "byte order mark" in the beginning of some files.

If the first byte is in the range 0xf0..0xf7 the byte is followed by exactly THREE bytes in the range 0x80..0xbf and the code is as followed:

code = (static_cast<int>(c & 0x07) << 18) |
       (static_cast<int>(d & 0x3f) << 12) |
       (static_cast<int>(e & 0x3f) <<  6) |
       (f & 0x3f);

The code must be in the range 0x010000..0x10ffff or the code is invalid.

If the first byte is in the range 0xf8..0xfb it would specify 4 bytes in the range 0x80..0xbf. However they would all give codes that are too high for UTF-16 and so they are invalid as UTF-32 even though they would be valid UCS-4 codes. The codes would be valid UCS-4 codes in the range 0x00200000..0x03ffffff.

Similarly if the first byte is in the range 0xfc..0xfd it would specify 5 bytes in the range 0x80..0xbf. They too, would be valid UCS-4 codes but invalid UTF-32 codes. The codes would be in the range 0x04000000..0x7fffffff.

A UTF-32 code would be in the range 0x000000..0x00d7ff and 0x00e000..0x10ffff.

Note that proper UTF-8 to UTF-32 would never permit those 5 and 6 byte sequences and the maximum sequence is 4 bytes. You would also never allow a code to specify a value in a range other than the range required. For example you could theoretically express the value 0x00007f as: 0xc1 0xbf but this value is expressed by the single code 0x7f and so that sequence 0xc1 0xbf is an invalid sequence.

Converting UTF-32 to UTF-16 is simple:

If the UTF-32 code is in the range 0x000000..0x00d7ff or in the range 0x00e000..0x00ffff then the code fit inside 16 bit and becomes the value, so 0x003f2d becomes 0x3f2d in UTF-16.

If the UTF-32 code is in the range 0x010000..0x10ffff then you first subtract 0x010000 from the code and get a new code value in the range 0x000000..0x0fffff. This value fit inside 20 bit and is split in two, the high 10 bits becomes the low 10 bits of a value with the high 6 bits equal to 110110 and the low 10 bits becomes the low 10 bits of a value with the high 6 bits equal to 110111:

if (utf32 >= 0x10000) {
   w = utf32 - 0x10000;
   put_utf16(0xd800 | (w >> 10));
   put_utf16(0xdc00 | (w & 0x3ff));
} else {
   put_utf16(utf32);
}

This code assumes that utf32 never has a value like 0xd832 or something like that.

Since UTF-16 is combination of more than one byte the machine's endianess becomes important. This code will then store the UTF-16 in whatever endianess is used by the machine. If you write this to a file you should probably add a byte order mark first in the file so that the reader knows which endianess you have used and can swap the bytes if necessary. This is done by simply:

write_utf16_to_file(0xfeff);

The 0xfeff is the BOM, the byte order mark.

Alf
0
 

Author Comment

by:Phoenix_4uk
ID: 8057290
thanks for the prompt response.

The data I first posted in my question didn't display properly. What the actual data i'm am trying to read looks like is +Z?CI<Here there is a spade symbol>(Y<here is a diamond symbol>S.

I am outputting the data from Active Directory to a file. I then save the file as UTF-8. My program then reads each line from the file and does stuff with the data. What I want it to do is display the line exactly as it appears in the file. So that I can manipulte the correct string in my program. At the moment using the readline function it displays: +Z?CI?(Y?S.

I obviously need to get it to process the unicode characters differently in the function but I'm not sure how to do this.
0
 
LVL 12

Expert Comment

by:Salte
ID: 8057830
My guess is that the string is UTF-8, not sure about your display's encoding but if you gave us the binary codes for spade and diamond symbol as well as the others.

As I said, UTF-8 encoding is mostly like this:

UTF-8 is a sequence of codes, each code is a sequence of 1-4 bytes as determined by the first byte of the code sequence:

If the first byte is 0xzz in the range 0x00..0x7f then the UTF-32 code is 0x0000zz this give a UTF-32 code in the range 0x000000..0x00007f.

If the first byte is in the range 0x80..0xbf then it is invalid and you should complain. The input is NOT valid UTF-8 in this case.

If the first byte is in the range 0xc0..0xdf then it is of the form: 1100uuuuu and it is followed by one byte in the range 0x80..0xbf and is of the form 10vvvvvv. Those two bytes form a UTF-32 code of:

0 0000 0000 0uuu uuvv vvvv

The code must be in the range 0x000080..0x0007ff

If the first byte is in the range 0xe0..0xef then it is followed by two bytes in the range 0x80..0xbf and the three bytes are of the form:

1110uuuu 10vvvvvv 10wwwwww

and the code is:

0 0000 uuuu vvvv vvww wwww

The code must be in the range 0x000800..0x00d7ff
or 0x00e000..0x00ffff otherwise it is not valid UTF-8 for UTF-32.

If the first byte is in the range 0xf0..0xf7 then it is followed by 3 bytes in the range 0x80..0xbf and the four bytes are of the form:

1111 0uuu 10vvvvvv 10wwwwww 10zzzzzz

and the code is:

u uuvv vvvv wwww wwzz zzzz

and the code must be in the range 0x010000..0x10ffff.

If the first byte is in the range 0xf8..0xff it is invalid.


The UTF-32 code in the range 0x000000..0x00d7ff and 0x00e000..0x10ffff can then be translated to UTF-16 as followes:

if (utf32 >= 0x110000)
   throw invalid_code(utf32);
else if (utf32 >= 0x10000) {
   int w = utf32 - 0x10000;
   put_utf16(0xd800 | (w >> 10));
   put_utf16(0xdc00 | (w & 0x3ff));
} else if (utf32 >= 0xd800 && utf32 < 0xe000)
   throw invalid_code(utf32);
else
   put_utf16(utf32);

it is possible you don't have to translate to UTF-16. If your wchar_t type and the wctype functions assume wchar_t to be UTF-32 then you skip that part. If they assume that wchar_t is UTF-16 then you include it.

In any case you can attempt to read the char string as UTF-8:

int read_utf32()
{
    int code;
    unsigned char c = static_cast<unsigned char>(*p++);
    unsigned char d, e, f;

    if (c < 0x80)
        return c;
    if (c < 0xc0)
        throw invalid_utf8_code(1, c,0,0,0);
    if (c < 0xe0) { // 110. .... 10......
       // note if this is the last byte in file
       // it is not UTF-8, if it is last in buffer
       // you should refill buffer before getting next
       // byte.
       d = static_cast<unsigned char>(*p++);
       code = (static_cast<int>(c & 0x1f) << 6) | (d & 0x3f);
       if (code < 0x80)
          throw invalid_utf8_code(2, c, d, 0, 0);
       return code;
    }
    if (c < 0xf0) { // 1110.... 10...... 10......
       // here we expect two bytes following
       // refill buffer or return error as appropriate
       // if unable to get those two bytes.
       d = static_cast<unsigned char>(*p++);
       e = static_cast<unsigned char>(*p++);
       code = (static_cast<int>(c & 0x0f) << 12) |
              (static_cast<int>(d & 0x3f) << 6) |
              (e & 0x3f);
       if (code < 0x0800 || (code >= 0xd800 &&
                             code < 0xe000))
          throw invalid_utf8_code(3,c,d,e,0);
       return code;
    }
    if (c < 0xf8) { // 1111 0... 10...... 10...... 10......
       // same as before, we expect three bytes.
       d = static_cast<unsigned char>(*p++);
       e = static_cast<unsigned char>(*p++);
       f = static_cast<unsigned char>(*p++);
       code = (static_cast<int>(c & 0x07) << 18) |
              (static_cast<int>(d & 0x3f) << 12) |
              (static_cast<int>(e & 0x3f) << 6) |
              (f & 0x3f);
       if (code < 0x10000 || code > 0x10ffff)
           throw invalid_utf8_code(4,c,d,e,f);
       return code;
    }
    throw invalid_utf8_code(1,c,0,0,0);
}

This function will read utf8 and return utf32.

To read utf32 and return utf16 you do something like this:
int read_utf16()
{
    static int next = 0;
    int utf32 = next;

    if (utf32 != 0) {
       next = 0;
       return utf32;
    }
    utf32 = read_utf32();
    if (utf32 < 0x10000)
       return utf32;
    utf32 -= 0x10000;
    next = 0xdc00 | (utf32 & 0x03ff);
    return 0xd800 | (utf32 >> 10);
}

To go the other way you just reverse the algorithms:

void write_utf16(int utf16)
{
   static int prev = 0;

   if (prev != 0) {
      if (utf16 < 0xdc00 || utf16 >= 0xe000)
         throw invalid_utf16_code(2, prev, utf16);
      write_utf32((((prev & 0x3ff) << 10)|(utf16 & 0xeff)) + 0x10000);
      prev = 0;
   } else if (utf16 >= 0xd800 && utf16 <= 0xdbff) {
      prev = utf16;
   } else if (utf16 >= 0xdc00 && utf16 <= 0xdfff)
      throw invalid_utf16_code(1, utf16, 0);
   else
      write_utf32(utf16);
}

void write_utf32(int utf32)
{
    if (utf32 < 0x80) {
       write_utf8(utf32);
    } else if (utf32 < 0x800) {
       write_utf8(0xc0 | (utf32 >> 6));
       write_utf8(0x80 | (utf32 & 0x3f));
    } else if (utf32 < 0x10000) {
       write_utf8(0xe0 | (utf32 >> 12));
       write_utf8(0x80 | ((utf32 >> 6) & 0x3f));
       write_utf8(0x80 | (utf32 & 0x3f));
    } else if (utf32 < 0x110000) {
       write_utf8(0xf0 | (utf32 >> 18));
       write_utf8(0x80 | (utf32 >> 12) & 0x3f));
       write_utf8(0x80 | (utf32 >> 6) & 0x3f));
       write_utf8(0x80 | (utf32 & 0x3f));
    } else
       throw invalid_utf32_code(utf32);
}

Be aware that these functions are somewhat simplified, they do very little error handling, for example if you require a byte to be in the buffer and you have reached the end of it you need to refill the buffer before you can retrieve a byte. If you get an invalid code somewhere the code generally only throw some form of exception but there are cases which are errors but which is not explicitely trapped in the above code.

Ordinary common-sense should be the guide in those cases.

Alf
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:Phoenix_4uk
ID: 8070475
Thanks for the reposnse.

My problem appears to be slightly different to I originally thought. The problem I had is now fixed the data I was trying to give it was not in the correct format. Once I got the file into the correct format the data was read correctly.e.g BzFZnfvARE2jTGdB62ZDdA==

The problem was with another part of the app which reads data from Active Directory using the winldap.h command ldap_get_values. This is the command which returns my data in the wrong format.

I think that the data returned from reading the file is in base64 but I'm not sure what format is returned using the ldap_get_values command. I need to get it into base64 so that I can do comparisons.

any ideas?
0
 
LVL 12

Expert Comment

by:Salte
ID: 8070752
Base64 is easy to recognize, it is a standard format primarily used for binary data and not text. I.e. it is a way to transmit binary data as if it was text. This way a medium that normally can only handle text data can also read, send, forward and otherwise transmit binary data.

E-mail attachments are typically sent in base64 format.

Base64 is very simple, as the name suggest it uses a base 64 or radix 64 number system. To do this we need 64 digits:

Uppercase letters A-Z 26
Lowercase letters a-z 26
digits            0-9 10
All together this is 62 so you only need two more characters, I think the two characters chosen are + (plus sign) and / (solidus or slash).

In addition the character = is used for padding, how it is used will be explained later. The base64 encoding is defined in RFC 1521 for those who are interested.

The binary data is read 3 bytes at a time to make a group of 24 bits. These 24 bits is then chopped into 4 groups of 6 bits each and each of those 6 bit groups is one digit in the radix 64 number translated as follows:

0-25  -> A-Z (A == 0 Z == 25)
26-51 -> a-z (a == 26 z == 51)
52-61 -> 0-9 (0 == 52 9 == 61)
62 -> +
63 -> /

so if the three bytes are 0xfe 0xa7 0x3f in that order then
this becomes the 24 bit quantity of 0x00fea73f

This can be split into 4 6 bit quantities:
1111 11 10 1010 0111 00 11 1111
0x3f 0x2a 0x1c 0x3f or in decimal 63 42 28 63
or in our base64 system: /qc/

The = character is used to cover the case that you have fewer than 3 bytes left to read from the binary input stream:

If you have 1 byte then it looks like this:
xxxx xxxx 0000 0000 0000 0000
turned into 6 bit groups you get:
xxxx xx xx 0000 0000 00 00 0000
The first 6 bit group is easy enough, it is whatever character those 6 bits represent. The second group has only 4 different values however, they are all selected from the 64 digits but of course only the digits AQgw are possible, so you get a sequence of @[AQgw]== at the end.
@ is any of the base64 digits and [AQgw] is one of those characters.

If you have 2 bytes at the end then it looks like this:

xxxx xxxx yyyy yyyy 0000 0000
broken into 6 bit groups you get:

xxxx xx xx yyyy yyyy 00 00 0000

The first character and the second character are both taken from the digits set as usual. The third character is also but is only one of the 16 possibilities AEIMQUYcgkosw048.
The fourth character is =.

Note that base64 does not include \n \r space or any such characters. If any of those are encountered in the stream you just skip them until you're done, in particular line breaks can and will occur within the stream of base64 codes.

Of course, you must have some way of knowing that it is end of stream even if it is not a full 3 byte group at the end, you will know by the presense of xx== or xxx= sequence.

Part of the problem is that if the binary data happened to be a multiple of 24 bits there really is no way of knowing when a base64 encoding is over. This means that there must be some form of protocol or format outside of the base64 encoding that must mark the end of base64 data. In a MIME stream this is taken care of by having special markers to mark the end of base64 encoding. But in general there is no way to know when the end is. In principle you can just continue to feed the base64 decoder with characters and whenever it has 4 valid characters input it will dump out 3 bytes (possibly only 1 or 2 if xx== or xxx= codes were fed in). The decoder will simply ignore any \n space, tab, comma etc you feed it. In particular it is also specified (rather unfortunatley I would say) in the MIME standard that a sequence of ==== is to be ignored. Otherwise that would be a fine way to detect end of base64 stream when it happened to be a multiple of 24 bits. It cannot be such a mark since the standard specifically states that such a sequence should be ignored. Why it does that I don't know, sounds stupid to me but that was what they decided.

Hope this is of help for you to understand the base64 encoding.

btw, if the data is in base64 it should be easy for you to figure out the end of the data as that is the end of the text you got from the function. I believe those functions also return the length of the data they returned.

Note that base64 only count A-Za-z0-9+/ and =. All other characters are ignored when reading base64. It therefore follows that you can freely insert other characters into a base64 stream without altering the contents of the stream. For example line breaks, spaces etc can be inserted and the binary data after decoding is unaffected.

Alf
0
 

Author Comment

by:Phoenix_4uk
ID: 8104189
Hi,

Still not fixed my problem am using an encodebase64 function to try and format my value which i return using ldap_get_values_len from the winldap.h API. This is the code I am using:

if ((vals=(char **)ldap_get_values_len(ld,e,attrs[i]))!=NULL)
{
  //-- Assume that this is a single value (discard innermost loop)
  printf("ObjectGuid: %s",vals[0]);
  //-- Calculate the size of the buffer we need to hold the
  //-- converted buffer (The leading +1 is purely to force a trailing
  //-- zero character)
  long base64_size = 1+((strlen(vals[0])+2)/3)*4;
  char *base64_buffer = new char[base64_size];

  //-- Clear the buffer ready...
  memset(base64_buffer,0,base64_size);
  Base64Encode(vals[0],base64_buffer);

  printf("Base64 ObjectGuid: %s\n",base64_buffer);
  ad_objectguid = base64_buffer;

  delete base64_buffer;
}

The results I get rom this are as follows:

ObjectGUID: (a single weird character that will not display on here)

Base64 ObjectGUID: EA==

The result I want is the result that matches what the LDIFDE command gives me for the same object:

ObjectGUID: k/luB9Rp1k+enMp9/MUH5Q==

anyone know why my code isn't working?

The convert funtion is shown below:

static char* _cpBase64Encoding =
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";


void Base64Encode( char* cpInput, char* cpOutput )
{
 int nIdx[ 4 ];  // will contain the indices of coded letters in
                 // _cpBase64Encoding string; valid values [0..64]; the value
                 // of 64 has special meaning - the padding symbol

 // process the data (3 bytes of input provide 4 bytes of output)
 while ( '\0' != *cpInput )
 {
   nIdx[0] = ((*cpInput) & 0xFC)>>2;
   nIdx[1] = ((*cpInput) & 0x03)<<4;
   cpInput++;
   if ( '\0' != *cpInput )
   {
     nIdx[1] |= ((*cpInput) & 0xF0)>>4;
     nIdx[2]  = ((*cpInput) & 0x0F)<<2;
     cpInput++;
     if ( '\0' != (*cpInput) )
     {
       nIdx[2] |= ((*cpInput) & 0xC0) >> 6;
       nIdx[3]  = (*cpInput) & 0x3F;
       cpInput++;
     }
     else
       nIdx[3] = 64;
   }
   else
   { // refer to padding symbol '='
     nIdx[2] = 64;
     nIdx[3] = 64;
   }

   *(cpOutput+0) = *(_cpBase64Encoding + nIdx[0]);
   *(cpOutput+1) = *(_cpBase64Encoding + nIdx[1]);
   *(cpOutput+2) = *(_cpBase64Encoding + nIdx[2]);
   *(cpOutput+3) = *(_cpBase64Encoding + nIdx[3]);
   cpOutput += 4;
 }
 
 // set this to terminate output string
 *cpOutput = '\0';

 return;
}

0
 
LVL 12

Accepted Solution

by:
Salte earned 300 total points
ID: 8105393
as I can see you read the data and then convert them to base64...I thought you said the data was in base64 and you needed to decode them.

The decoding is the opposite function, read 4 and 4 bytes and produce 3 bytes output.

int Base64Decode(unsigned char * out, char * in, char * end)
{
   int st = 0;
   int a = 0;
   int len = 0;

   while (in < end) {
      char c = *in++;
      char * p = strchr(_cpBase64Encoding,c);
      if (p == 0)
         continue;
      int v = p - _cpBase64Encoding;
      if (v < 64) {
         a = (a << 6) | v;
         if (++st == 4) { // got 3 bytes.
            out[len++] = static_cast<unsigned char>(a >>16);
            out[len++] = static_cast<unsigned char>(a >> 8) ;
            out[len++] = a >> 16;
            st = 0;
            a = 0;
         }
      }
   }
   switch (st) {
   case 1: // 6 bits of input.
      out[len++] = static_cast<unsigned char>(a << 2);
      break;
   case 2: // 12 bits of input.
      out[len++] = static_cast<unsigned char>(a >> 4);
      out[len++] = static_cast<unsigned char>(a << 4);
      break;
   case 3: // 18 bits of input.
      out[len++] = static_cast<unsigned char>(a >> 10);
      out[len++] = static_cast<unsigned char>(a >> 2);
      out[len++] = static_cast<unsigned char>(a << 6);
      break;
   }
   /* st == 0 means 0 bits of data */
   return len;
}


Alf
0
 
LVL 30

Expert Comment

by:Axter
ID: 9374599
Phoenix_4uk,
No comment has been added lately (190 days), so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area for this question:

RECOMMENDATION: Award points to Salte http:#8105393

Please leave any comments here within 7 days.

-- Please DO NOT accept this comment as an answer ! --

Thanks,

Axter
EE Cleanup Volunteer
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
Introduction This article is the first in a series of articles about the C/C++ Visual Studio Express debugger.  It provides a quick start guide in using the debugger. Part 2 focuses on additional topics in breakpoints.  Lastly, Part 3 focuses on th…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question