string conversion c++

I am writing small routines to convert RAD studio UnicodeString to std::string
and vice versa. I would appreciate thoughts on how robust and elegant
these functions are.  Is there anything easy that I am overlooking?


inline std::string UCString2STLString(const UnicodeString& ustr){
	std::string str;
	for (int i=0; i < ustr.Length(); i++)
		str += ustr[i+1];
	return str;
};

inline UnicodeString STLString2UCString(const std::string& str){
	UnicodeString ustr;
	for (int i=0; i < str.length(); i++)
		ustr += str[i];
	return ustr;
};

Open in new window

LVL 1
ol muserTechnology GeneralistAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

TomasPCommented:
Maybe I am missing something but shouldn't this line

because you are indexing into a array of shorts when reading the unicode string.
str += ustr[i+1];
be
str += lowbyte(ustr[i]);

Open in new window

0
evilrixSenior Software Engineer (Avast)Commented:
Is your intention to convert from wide (wchar_t) to narrow (char) or just to get a C++ string class populated with the contents of UnicodeString? If you just want to populate a C++ string you can do that by using the wide version of the string.

std::wstring ws = myUnicodeString.c_str();

If you actually want to convert from wide to narrow you'll need to use something like wcstombs to perform a conversion.
0
ol muserTechnology GeneralistAuthor Commented:
I simply need to convert UnicodeString to std::string and vice versa. @TomasP, there does not seem to a function called lowbyte available in RAD studio C++. @evilrix, converting to wide version may not work for me as I need to pass the strings to a library that uses std::string all over. Any comments about returning a local varibale. Am I sending a copy of the varibale? Since when is it safe? Atleast in C, I will be doomed.
0
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

ol muserTechnology GeneralistAuthor Commented:
the implementation I have for conversion is based on the fact that nothing else seem to work from the available list of functions from both families. (correct me if I am wrong).
http://docwiki.embarcadero.com/VCL/XE/en/System.UnicodeString_Functions.
http://www.cplusplus.com/reference/string/string/string/
0
evilrixSenior Software Engineer (Avast)Commented:
Did you not see my comment about using wcstombs to convert from wide to narrow? Was that not helpful? Or maybe my comment just wasn't clear and so you've miss understood it? To be clearer: just convert from wide to narrow, store the result in std::string and return it by value.
0
sarabandeCommented:
to add to above comments:

the UnicodeString::c_str would return a const wchar_t* what exactly is the input for wcstombs function evilrix has mentioned. the output of wctombs is a char * which perfectly fits as input to a std::string.

   
size_t sz = ustr.Length() + 32;   // some extra byte for multibyte characters
   char * buf = new char[sz];   
   wcstombs(buf, ustr.c_str(), sz);
   std::string str(buf);
   delete []buf;

Open in new window


Sara
0
ol muserTechnology GeneralistAuthor Commented:
@eveilrix, I was waiting to try it out. And I did. (code below) My implementation is close to what @sarabande has posted. Out of the two I am trying to take the one that is efficient and elegant. Without using wcstombs I will be making more function calls? No comments on returning a local variable?
inline std::string UCString2STLString(const UnicodeString& ustr){
	std::string str;
	for (int i=0; i < ustr.Length(); i++)
		str += ustr[i+1];
	return str;
};

inline std::string U2Sv2 (const UnicodeString& src){
	char* temp;
	temp = new char[src.Length()];
	std::wcstombs(temp, src.c_str(), src.Length());
	std::string dest(temp);
	delete[] temp;
	return dest;
}

Open in new window

0
sarabandeCommented:
the first shouldn't compile cause the

    str += ustr[i+1];

tries to 'add' a wchar_t to a std::string where there is no operator+= defined.

the second you should at least allocate src.Length()+1 characters or the temp wouldn't get a terminating zero char what can make problems when assigning it to the std::string dest (you might experience garbage characters at end).

note, i allocated 32 additional characters cause conversion could make more than one char for one wide char if special characters were used. generally you should be generous with allocation cause wcstombs will stop at terminator of wide string.

Sara
0
evilrixSenior Software Engineer (Avast)Commented:
// Without using wcstombs I will be making more function calls?
Your original code isn't just inefficiant it is also wrong. It's not correctly converting from wide to narrow, it's just doing a byte for byte copy but that's not enough. To represent a wide character encoding (for example UTF16 encoded Unicode) as narrow it needs to be re-encoded to a narrow character encoding (for example UTF18 or ANSI). The wcstombs converts to ANSI using your current locale (or the value of the LC_CTYPE env variable).

BTW, your code is also a potential memory leak (if it throws between the new and the delete you'll leak. Use a vector to make this neater.

As for returning a local, perfectly safe as long as you return it by value and not pointer or reference since a copy is made on the return stack and passed to the caller.

inline std::string U2Sv2 (const UnicodeString& src){
        std::vector<char> temp(src.Length() * 2); // double length for conversion
        std::wcstombs(&temp[0], src.c_str(), src.Length());
        return &temp[0];
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ol muserTechnology GeneralistAuthor Commented:
@sarabande, it compiles and works fine...
0
evilrixSenior Software Engineer (Avast)Commented:
>> the first shouldn't compile cause the
Why not?

This...

 str += ustr[i+1];

is just the same as this...

wchar_t wc = L'a';
char c = wc;

You'll probably get a warning but it will compile fine.
0
evilrixSenior Software Engineer (Avast)Commented:
In case the reason why they are semantically the same isn't clear, it's because the following operator is being invoked on string.

string& string::operator+= ( char c );
0
sarabandeCommented:
do you mean the first function UCString2STLString would compile and work?

i don't know much of c++/cli but i have my doubts cause the str += ustr.c_str[i+1]; would drop the first char of ustr in my opinion cause a wchar_t * is 0-based in any case and your copy would start with ustr[1].

you also should see that unicode has 2**16 codes where only code 0 to 127 is equivalent to ascii and only 0 to 255 is equivalent to ansi. so even if it works now for simple strings it would fail if any real unicode character would need to get converted.

also the second function cannot work if you don't allocate at least ustr.Length()+1 characters. the wcstombs would fill the temp only for  ustr.Length() characters. so it is not setting a terminating zero char. so at end of the converted char might be any accidental characters until finally also by accident a zero would occur. you may be lucky if that is already the case with your tests. but on the long run it would fail.

Sara

0
evilrixSenior Software Engineer (Avast)Commented:
>> do you mean the first function UCString2STLString would compile and work?
No, I just mean it will compile as there is nothing syntactically wrong with it. I've already explained above that it won't actually work and why.

0
ol muserTechnology GeneralistAuthor Commented:
thanks both for the continued contribution.

they both (code attached) to work to produce output below:

[STL from U2Sv1]unicode string
equals
string from U2Sv2(wcstombs): from unicode
equals

@evilrix, good catch eliminating the new/delete, but I don't get the "// double length for conversion" part..

any concluding thoughts?



inline std::string UCString2STLString(const UnicodeString& ustr){
	std::string str;
	for (int i=0; i < ustr.Length(); i++)
		str += ustr[i+1];
	return str;
};

inline std::string U2Sv2 (const UnicodeString& src){
	char* temp;
	temp = new char[src.Length()];
	std::wcstombs(temp, src.c_str(), src.Length());
	std::string dest(temp);
	delete[] temp;
	return dest;
}



int _tmain(int argc, _TCHAR* argv[])
{
	UnicodeString str1;
	std::string str2="";

	str1 = "unicode string";
	str2 = UCString2STLString(str1);
	std::string str3 = "unicode string";
	cout<<"[STL from U2Sv1]"<<str2;
	if (str2==str3)
		cout<<"\nequals";
	else
		cout<<"\nnot equals";


	UnicodeString src = "from unicode";
	std::string dst = U2Sv2(src), cmp("from unicode");
	cout<<"\nstring from U2Sv2(wcstombs): " <<dst;
	if (dst==cmp)
		cout<<"\nequals";
	else
		cout<<"\nnot equals";



	return 0;
}
//---------------------------------------------------------------------------

Open in new window

0
evilrixSenior Software Engineer (Avast)Commented:
>> but I don't get the "// double length for conversion" part..
Well, since you appear to be using Windows a wchar_t is going to be 2 bytes and a char 1 byte. Since a wchar_t to char conversion is a wide character to multibyte character conversion it stands to reason that each char in the wide could result in more than one char in the narrow. As a good rule of thumb (to be sure you have enough space) I generally allocate my buffer to be the same ratio bigger as a char is to a wchar_t (in this case x2). That way you should have more than enough buffer and what you don't use is released back to the OS when the vector goes out of scope. This is not a hard and fast rule, just me coding defensively :)

>> any concluding thoughts?
Only, do you now understand why you need to use wcstombs to do this rather than just copying the wide to narrow bytes? It's probably a good idea to ensure you understand this so that in future you'll know what to do and why.
0
sarabandeCommented:
sorry, evilrix, my last q. regarding the compile was directed to the author, not to you.

olmuser, i checked the docs and you are right regarding the ustr[i+1]:

System.UnicodeString.operator []
operator [] returns the character in the string at the character index value idx. The operator [] assumes a base index of 1.


you are not right that it works generally, cause unicode-chars cannot simply assigned to char beside they are ascii only.

for the second function it only works by accident that is when the memory allocated for the temp char array has zeros (or non-printables) directly behind the temp array.

the 'double length for conversion' was - same as me adding 32 bytes extra - the attempt to reserve storage for the case where a unicode char was translated into a multibyte sequence. but you would need to pass the double-length also to wcstombs cause wcstombs expects the size of the multi-byte buffer as 3rd argument.

last point, passing a local variable as return is valid when the return type neither is a pointer nor a reference but by value. then the compiler creates a temporary for the return.

Sara


0
evilrixSenior Software Engineer (Avast)Commented:
>> sorry, evilrix, my last q. regarding the compile was directed to the author, not to you.
Heh. No worrys :)

>> cause unicode-chars cannot simply assigned to char beside they are ascii only
AFAIK, that depends on the encoding format (remember, Unicode is a character set not a character encoding -- Unicode can be encoded in many different ways but the most common are UTF8, 16 and 32) -- this will only work with UTF8, which was designed specifically to be backwards compatible with ASCII,  and that's already a narrow MBCS.
0
sarabandeCommented:
the wcstombs doesn't make utf8 but it only will convert layer 0 (codes 0 to 255) of utf16 (== microsoft unicode) to ansi charset. as far as i know there is no conversion of chars beyond code 255.

if you want utf8 conversion you would need to use WideCharToMultiByte function where you can specify UTF8 as 'codepage' argument.

Sara
0
evilrixSenior Software Engineer (Avast)Commented:
>> the wcstombs doesn't make utf
I know (I wasn't saying it does), I pointed that out above.

"The wcstombs converts to ANSI using your current locale (or the value of the LC_CTYPE env variable)."

I was just eluding to the fact that even if the UTF16 encoded wide is just representing ASCII you still can't just chop off the higher-order byte since there is no direct correlation with ASCII (unlike UTF8).

I think we're singing from the same hymn sheet here ;)

0
sarabandeCommented:
:)

Sara
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C++

From novice to tech pro — start learning today.