Link to home
Create AccountLog in
Avatar of ol muser
ol muserFlag for United States of America

asked on

string conversion c++

I am writing small routines to convert RAD studio UnicodeString to std::string
and vice versa. I would appreciate thoughts on how robust and elegant
these functions are.  Is there anything easy that I am overlooking?


inline std::string UCString2STLString(const UnicodeString& ustr){
	std::string str;
	for (int i=0; i < ustr.Length(); i++)
		str += ustr[i+1];
	return str;
};

inline UnicodeString STLString2UCString(const std::string& str){
	UnicodeString ustr;
	for (int i=0; i < str.length(); i++)
		ustr += str[i];
	return ustr;
};

Open in new window

Avatar of TomasP
TomasP
Flag of United States of America image

Maybe I am missing something but shouldn't this line

because you are indexing into a array of shorts when reading the unicode string.
str += ustr[i+1];
be
str += lowbyte(ustr[i]);

Open in new window

Avatar of evilrix
Is your intention to convert from wide (wchar_t) to narrow (char) or just to get a C++ string class populated with the contents of UnicodeString? If you just want to populate a C++ string you can do that by using the wide version of the string.

std::wstring ws = myUnicodeString.c_str();

If you actually want to convert from wide to narrow you'll need to use something like wcstombs to perform a conversion.
Avatar of ol muser

ASKER

I simply need to convert UnicodeString to std::string and vice versa. @TomasP, there does not seem to a function called lowbyte available in RAD studio C++. @evilrix, converting to wide version may not work for me as I need to pass the strings to a library that uses std::string all over. Any comments about returning a local varibale. Am I sending a copy of the varibale? Since when is it safe? Atleast in C, I will be doomed.
the implementation I have for conversion is based on the fact that nothing else seem to work from the available list of functions from both families. (correct me if I am wrong).
http://docwiki.embarcadero.com/VCL/XE/en/System.UnicodeString_Functions.
http://www.cplusplus.com/reference/string/string/string/
Did you not see my comment about using wcstombs to convert from wide to narrow? Was that not helpful? Or maybe my comment just wasn't clear and so you've miss understood it? To be clearer: just convert from wide to narrow, store the result in std::string and return it by value.
SOLUTION
Avatar of sarabande
sarabande
Flag of Luxembourg image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
@eveilrix, I was waiting to try it out. And I did. (code below) My implementation is close to what @sarabande has posted. Out of the two I am trying to take the one that is efficient and elegant. Without using wcstombs I will be making more function calls? No comments on returning a local variable?
inline std::string UCString2STLString(const UnicodeString& ustr){
	std::string str;
	for (int i=0; i < ustr.Length(); i++)
		str += ustr[i+1];
	return str;
};

inline std::string U2Sv2 (const UnicodeString& src){
	char* temp;
	temp = new char[src.Length()];
	std::wcstombs(temp, src.c_str(), src.Length());
	std::string dest(temp);
	delete[] temp;
	return dest;
}

Open in new window

the first shouldn't compile cause the

    str += ustr[i+1];

tries to 'add' a wchar_t to a std::string where there is no operator+= defined.

the second you should at least allocate src.Length()+1 characters or the temp wouldn't get a terminating zero char what can make problems when assigning it to the std::string dest (you might experience garbage characters at end).

note, i allocated 32 additional characters cause conversion could make more than one char for one wide char if special characters were used. generally you should be generous with allocation cause wcstombs will stop at terminator of wide string.

Sara
ASKER CERTIFIED SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
@sarabande, it compiles and works fine...
>> the first shouldn't compile cause the
Why not?

This...

 str += ustr[i+1];

is just the same as this...

wchar_t wc = L'a';
char c = wc;

You'll probably get a warning but it will compile fine.
In case the reason why they are semantically the same isn't clear, it's because the following operator is being invoked on string.

string& string::operator+= ( char c );
do you mean the first function UCString2STLString would compile and work?

i don't know much of c++/cli but i have my doubts cause the str += ustr.c_str[i+1]; would drop the first char of ustr in my opinion cause a wchar_t * is 0-based in any case and your copy would start with ustr[1].

you also should see that unicode has 2**16 codes where only code 0 to 127 is equivalent to ascii and only 0 to 255 is equivalent to ansi. so even if it works now for simple strings it would fail if any real unicode character would need to get converted.

also the second function cannot work if you don't allocate at least ustr.Length()+1 characters. the wcstombs would fill the temp only for  ustr.Length() characters. so it is not setting a terminating zero char. so at end of the converted char might be any accidental characters until finally also by accident a zero would occur. you may be lucky if that is already the case with your tests. but on the long run it would fail.

Sara

>> do you mean the first function UCString2STLString would compile and work?
No, I just mean it will compile as there is nothing syntactically wrong with it. I've already explained above that it won't actually work and why.

thanks both for the continued contribution.

they both (code attached) to work to produce output below:

[STL from U2Sv1]unicode string
equals
string from U2Sv2(wcstombs): from unicode
equals

@evilrix, good catch eliminating the new/delete, but I don't get the "// double length for conversion" part..

any concluding thoughts?



inline std::string UCString2STLString(const UnicodeString& ustr){
	std::string str;
	for (int i=0; i < ustr.Length(); i++)
		str += ustr[i+1];
	return str;
};

inline std::string U2Sv2 (const UnicodeString& src){
	char* temp;
	temp = new char[src.Length()];
	std::wcstombs(temp, src.c_str(), src.Length());
	std::string dest(temp);
	delete[] temp;
	return dest;
}



int _tmain(int argc, _TCHAR* argv[])
{
	UnicodeString str1;
	std::string str2="";

	str1 = "unicode string";
	str2 = UCString2STLString(str1);
	std::string str3 = "unicode string";
	cout<<"[STL from U2Sv1]"<<str2;
	if (str2==str3)
		cout<<"\nequals";
	else
		cout<<"\nnot equals";


	UnicodeString src = "from unicode";
	std::string dst = U2Sv2(src), cmp("from unicode");
	cout<<"\nstring from U2Sv2(wcstombs): " <<dst;
	if (dst==cmp)
		cout<<"\nequals";
	else
		cout<<"\nnot equals";



	return 0;
}
//---------------------------------------------------------------------------

Open in new window

>> but I don't get the "// double length for conversion" part..
Well, since you appear to be using Windows a wchar_t is going to be 2 bytes and a char 1 byte. Since a wchar_t to char conversion is a wide character to multibyte character conversion it stands to reason that each char in the wide could result in more than one char in the narrow. As a good rule of thumb (to be sure you have enough space) I generally allocate my buffer to be the same ratio bigger as a char is to a wchar_t (in this case x2). That way you should have more than enough buffer and what you don't use is released back to the OS when the vector goes out of scope. This is not a hard and fast rule, just me coding defensively :)

>> any concluding thoughts?
Only, do you now understand why you need to use wcstombs to do this rather than just copying the wide to narrow bytes? It's probably a good idea to ensure you understand this so that in future you'll know what to do and why.
sorry, evilrix, my last q. regarding the compile was directed to the author, not to you.

olmuser, i checked the docs and you are right regarding the ustr[i+1]:

System.UnicodeString.operator []
operator [] returns the character in the string at the character index value idx. The operator [] assumes a base index of 1.


you are not right that it works generally, cause unicode-chars cannot simply assigned to char beside they are ascii only.

for the second function it only works by accident that is when the memory allocated for the temp char array has zeros (or non-printables) directly behind the temp array.

the 'double length for conversion' was - same as me adding 32 bytes extra - the attempt to reserve storage for the case where a unicode char was translated into a multibyte sequence. but you would need to pass the double-length also to wcstombs cause wcstombs expects the size of the multi-byte buffer as 3rd argument.

last point, passing a local variable as return is valid when the return type neither is a pointer nor a reference but by value. then the compiler creates a temporary for the return.

Sara


>> sorry, evilrix, my last q. regarding the compile was directed to the author, not to you.
Heh. No worrys :)

>> cause unicode-chars cannot simply assigned to char beside they are ascii only
AFAIK, that depends on the encoding format (remember, Unicode is a character set not a character encoding -- Unicode can be encoded in many different ways but the most common are UTF8, 16 and 32) -- this will only work with UTF8, which was designed specifically to be backwards compatible with ASCII,  and that's already a narrow MBCS.
the wcstombs doesn't make utf8 but it only will convert layer 0 (codes 0 to 255) of utf16 (== microsoft unicode) to ansi charset. as far as i know there is no conversion of chars beyond code 255.

if you want utf8 conversion you would need to use WideCharToMultiByte function where you can specify UTF8 as 'codepage' argument.

Sara
>> the wcstombs doesn't make utf
I know (I wasn't saying it does), I pointed that out above.

"The wcstombs converts to ANSI using your current locale (or the value of the LC_CTYPE env variable)."

I was just eluding to the fact that even if the UTF16 encoded wide is just representing ASCII you still can't just chop off the higher-order byte since there is no direct correlation with ASCII (unlike UTF8).

I think we're singing from the same hymn sheet here ;)

:)

Sara