Solved

UTF-8, string and wstring

Posted on 2014-12-10
5
217 Views
Last Modified: 2014-12-12
Hi Experts,

I'm starting a new library and I'd like to use UTF-8.  It seems to be good enough for the majority of the world, so that's what I'll go with.  One of the libraries that my library stores UTF-8 with std::string.  Is there a function in the standard library that converts to and from string and wstring while only ever storing UTF-8?

I'm thinking I'll set up this conversion right when I have to use the string with this other library.  When reading from the other library, I'll convert to wstring.  When writing to it, I'll give it wstring converted to string.

Also, are there any caveats here?

Thanks,
Mike
0
Comment
Question by:thready
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 34

Assisted Solution

by:sarabande
sarabande earned 500 total points
ID: 40493336
the standard has no direct conversion (as far as I know) but windows has by converting to utf16 and back to utf8.

you may use the following function:

bool ConvertUtf8ToAnsi(const char * strIn, char strOut[], int sizOut)
{
    bool bok = false;
    int	 len = (int)strlen(strIn);
    wchar_t * pwsz = new wchar_t[len+1];

    int newlen = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, strIn, len, pwsz, len+1);
    if (newlen > 0)
    {
        // you should pass a buffer with 100 extra bytes for safe conversion
        newlen = WideCharToMultiByte(CP_ACP, 0, pwsz, newlen, strOut, sizOut, "?", NULL);
        if (newlen > 0)
        {
            bok = true;
        }
    }
    delete [] pwsz;

    if (bok == false)
    {
        //DWORD dwError = GetLastError();
        //std::cout << dwError << ", Conversion Utf8 to Ansi failed" << std::endl;
    }

    return bok;
}

Open in new window


Sara
0
 
LVL 34

Accepted Solution

by:
sarabande earned 500 total points
ID: 40493342
sorry, I see that you want the opposite and convert from ansi to utf8. simply exchange the CP_UTF8 and CP_APC in the conversion calls and pass a buffer that is at least twice as big as the input string.

Sara
0
 
LVL 1

Author Comment

by:thready
ID: 40495422
Hi Sara, maybe what I'm saying doesn't make sense.  I have a UTF-8 encoded string, which obviously I cannot access individual characters with without looking at the ranges myself.  This is given to me by this library I am using.  Now, I don't want to change the encoding, I just want to store this UTF-8 string into a wstring instead, so that I can access individual characters with it.  Firs this make sense?  Why would it be incorrect to create my wstring like so?

string s =  [some UTF-8 encoded string];
wstring w(s.begin(), s.end());

Thanks again!
Mike
0
 
LVL 34

Assisted Solution

by:sarabande
sarabande earned 500 total points
ID: 40495666
utf-8 and utf-16 are much different. utf-16 has two bytes for each character, regardless whether it was an ascii character or a special Arabic or Chines letter. utf-8 is a multi-byte character set which uses 1 byte for ascii and 2- 4 characters for any other character. so beside of the ascii characters (code 0 ... 127 decimal) there is no commonness between the codes and any translation from one to another needs to perform a conversion which is not trivial.

nevertheless there are a lot of libraries available which could do that. one of the oldest is the winapi where you could call MultiByteToWideChar(CP_UTF8, ...) for doing the translation.

Sara
0
 
LVL 1

Author Closing Comment

by:thready
ID: 40495672
Thank you Sara
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

IntroductionThis article is the second in a three part article series on the Visual Studio 2008 Debugger.  It provides tips in setting and using breakpoints. If not familiar with this debugger, you can find a basic introduction in the EE article loc…
Introduction This article is a continuation of the C/C++ Visual Studio Express debugger series. Part 1 provided a quick start guide in using the debugger. Part 2 focused on additional topics in breakpoints. As your assignments become a little more …
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question