Solved

UTF-8, string and wstring

Posted on 2014-12-10
5
213 Views
Last Modified: 2014-12-12
Hi Experts,

I'm starting a new library and I'd like to use UTF-8.  It seems to be good enough for the majority of the world, so that's what I'll go with.  One of the libraries that my library stores UTF-8 with std::string.  Is there a function in the standard library that converts to and from string and wstring while only ever storing UTF-8?

I'm thinking I'll set up this conversion right when I have to use the string with this other library.  When reading from the other library, I'll convert to wstring.  When writing to it, I'll give it wstring converted to string.

Also, are there any caveats here?

Thanks,
Mike
0
Comment
Question by:thready
  • 3
  • 2
5 Comments
 
LVL 34

Assisted Solution

by:sarabande
sarabande earned 500 total points
ID: 40493336
the standard has no direct conversion (as far as I know) but windows has by converting to utf16 and back to utf8.

you may use the following function:

bool ConvertUtf8ToAnsi(const char * strIn, char strOut[], int sizOut)
{
    bool bok = false;
    int	 len = (int)strlen(strIn);
    wchar_t * pwsz = new wchar_t[len+1];

    int newlen = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, strIn, len, pwsz, len+1);
    if (newlen > 0)
    {
        // you should pass a buffer with 100 extra bytes for safe conversion
        newlen = WideCharToMultiByte(CP_ACP, 0, pwsz, newlen, strOut, sizOut, "?", NULL);
        if (newlen > 0)
        {
            bok = true;
        }
    }
    delete [] pwsz;

    if (bok == false)
    {
        //DWORD dwError = GetLastError();
        //std::cout << dwError << ", Conversion Utf8 to Ansi failed" << std::endl;
    }

    return bok;
}

Open in new window


Sara
0
 
LVL 34

Accepted Solution

by:
sarabande earned 500 total points
ID: 40493342
sorry, I see that you want the opposite and convert from ansi to utf8. simply exchange the CP_UTF8 and CP_APC in the conversion calls and pass a buffer that is at least twice as big as the input string.

Sara
0
 
LVL 1

Author Comment

by:thready
ID: 40495422
Hi Sara, maybe what I'm saying doesn't make sense.  I have a UTF-8 encoded string, which obviously I cannot access individual characters with without looking at the ranges myself.  This is given to me by this library I am using.  Now, I don't want to change the encoding, I just want to store this UTF-8 string into a wstring instead, so that I can access individual characters with it.  Firs this make sense?  Why would it be incorrect to create my wstring like so?

string s =  [some UTF-8 encoded string];
wstring w(s.begin(), s.end());

Thanks again!
Mike
0
 
LVL 34

Assisted Solution

by:sarabande
sarabande earned 500 total points
ID: 40495666
utf-8 and utf-16 are much different. utf-16 has two bytes for each character, regardless whether it was an ascii character or a special Arabic or Chines letter. utf-8 is a multi-byte character set which uses 1 byte for ascii and 2- 4 characters for any other character. so beside of the ascii characters (code 0 ... 127 decimal) there is no commonness between the codes and any translation from one to another needs to perform a conversion which is not trivial.

nevertheless there are a lot of libraries available which could do that. one of the oldest is the winapi where you could call MultiByteToWideChar(CP_UTF8, ...) for doing the translation.

Sara
0
 
LVL 1

Author Closing Comment

by:thready
ID: 40495672
Thank you Sara
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
A theme is a collection of property settings that allow you to define the look of pages and controls, and then apply the look consistently across pages in an application. Themes can be made up of a set of elements: skins, style sheets, images, and o…
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
This is Part 3 in a 3-part series on Experts Exchange to discuss error handling in VBA code written for Excel. Part 1 of this series discussed basic error handling code using VBA. http://www.experts-exchange.com/videos/1478/Excel-Error-Handlin…

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question