C++ _tstring variable

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

If it's a home-grown type them it should start with an underscore as all names starting with underscore are reserved! I'm just guessing, but I wonder if it is a std::string that's been typedef'd with a TCHAR to make it work with Microsofts T-types?

typedef std::basic_string<TCHAR> _tstring;

>>>> but I wonder if it is a std::string that's been typedef'd with a TCHAR to make it work with Microsofts T-types?

Yes, that was my first thought either. On the other side, the _t is a poor prefix for switching between string and wstring. Why not simply naming it tstring?

(But it may be one of the cillion of wrong namings we see any day - *sigh*)

ASKER

Hello

Definition of _tstring is as follows:

namespace std
{
typedef basic_string<_TCHAR,char_traits<_TCHAR>,allocator<_TCHAR> > _tstring;
}

and tname and tcollectorname is of type _tstring.

_tstring tname;
_tstring tcollectorname;

So, What could be the answer for my second question in my initial update?

Regards
Sham

>> Definition of _tstring is as follows:
That is pretty much what I suggested. Although the definition you've posted is unnecessarily verbose they mean exactly the same thing, except for one important different. The typedef as you've posted it places _tstring into the std namespace. Given that this is a home-grown type I, personally, would not do that.

>> How do i compare, whether tname is a substring of tcollectorname
Use find()
http://www.cplusplus.com/reference/string/string/find.html

#include <string>
#include <iostream>
 
namespace std // <---RX: Seriously, I would not place this type in the std namespace!
{
           typedef basic_string<_TCHAR,char_traits<_TCHAR>,allocator<_TCHAR> >  _tstring;
}
 
int main()
{
	std::_tstring tcollectorname = _T("I am a string");
	std::_tstring tname = _T("am");
 
	if(tcollectorname.find(tname ) != std::_tstring::npos)
	{
		std::cout << "tname is a sub-string" << std::endl;
	}
}

Open in new window

FYI:

The so-called T-switch is a Microsoft-specific way to switching between ANSI character set (8bit characters) and UNICODE character set (Microsoft: 16bit, the lower 8bit still represent ANSI) by defining the preprocessor macro _UNICODE or not definining it. In an ideal Microsoft world you simply would take one of your old projects which doesn't know of UNICODE, make the _UNICODE definition in your project settings, rebuild it and the program could handle wide characters now instead of single-byte (sometimes also called 'multi-byte' as the characters set also may contain a few '2-byte, 3-bytes or 4-byte sequences' for a character). Unfortunately, the world is bad and there are a few issues which may spoil the ideal. One is, that any text literal must be defined with the _T macro - see the _T("I am a string") in the code of rx. If not, it mostly wouldn't compile if UNICODE is switched on cause literals must have a L prefix, e. g. L"I am a string", to make it a wide string. Another issue is that other software providers rarely have implemented the T-switch for their libraries, so that at least for interfaces you can't simply use the _T or MFC CString class (which has the T-switch implemented) but need to provide either ANSI or wide-strings independent of the _UNICODE macro setting. Same applies for STL (C++ Standard Template Library). Here you have a std::string and a std::wstring and you either take the one or the other. That is normally no problem - actually I never had the requirements for an automatic switch myself and am deeply convinced that the T-switch was a really bad idea of MS which solves problems which we only have because of the T-switch. On the contrary, if for example you want to handle Unicode text files *and* ANSI text files you were fighting against wind-mills when using the T-switch, cause it is either UNICODE or ANSI but never both.

>> when using the T-switch, cause it is either UNICODE or ANSI but never both.
Well, at the risk of being pedantic, it is neither and I wish Microsoft would stop this propaganda!

Unicode is a character encoding format that can be represented in either 8, 16 or 32 (UTF8/16/32) bit data types. All the T switch does is change the compilers idea of a char type between 8 and 16 bit. This does NOT make it Unicode, merely capable of coping with UTF8or UTF16 respectively using a single data type. I find it causes more problems that it solves, you can write perfectly safe/portable Unicode complaint code without it and in fact only UTF8 is going to work x-platform without any code tweaks because, for example, whereas Windows thinks wide format is 16 bit Linux expects it to be 32 bit. Both platforms use 8 bit as their narrow format.

I am also baffled why Microsoft describe UTF8 as multibyte (correctly) and yet imply that UTF16 isn't. In fact, both UTF8 and UTF16 are multibyte, with only UTF32 being a fixed byte format!

Slightly OT Rant ends :)

ASKER

@evil: "you can write perfectly safe/portable Unicode complaint code without it and in fact only UTF8 is going to work x-platform without any code tweaks "

Say, If we try to run an .exe which takes command line argument on Chinese OS, and if one argument takes Chinese string, then will 8-bit pattern work? because ANSI chanracter set(8bit) include {A-Z,0-9, and some more characters} not chinese character set.
Don't u think 16-bit character set helps us in this case?
What does UTF stand for?
Regards
Sham

>> then will 8-bit pattern work?
If it is UTF8 encoded and that is what you are expecting it to be, yes.

>> because ANSI chanracter set(8bit) include {A-Z,0-9, and some more characters} not chinese character set
You seem to be confusing storage types (a char is just an 8 bit storage type, it does not designate any specific encoding), with character encoding. They are not the same thing.

>> Don't u think 16-bit character set helps us in this case?
No, you can represent Unicode using 8, 16 and 32 bit data types. It's all a matter of how they are encoded. Unicode is a character set, made up of code points Each character has a unique code point, which is a 32 bit number that represents it. Unicode code points are ALWAY 32 bit. UTF8/16/32 are ways to encode these code points so they fit into 8, 16 and 32 bit data types respectively (there are other UTF formats too, such as UTF 7 but don't worry too much bout them as they are specialized formats). The C++ types char and wchar_t are types are (normally) 8 bit [narrow] and 16 bit [wide] (32 bit on Linux) respectively. They can be used to store the UTF encodings but they DO NOT themselves represent ANSI or Unicode.

>> What does UTF stand for?
Unicod Transformation Fomat.
http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/UTF-16
http://en.wikipedia.org/wiki/UTF-32

>>>> They can be used to store the UTF encodings but they DO NOT themselves represent ANSI or Unicode.
So the analysis is correct, it doesn't help for ordinary string handling where in 95 + x percent of all cases UNICODE encoding plays no role. Since 1981 I was involved in hundreds of projects (some of them international) and actually never had the need to actually display UNICODE characters outside of ANSI character set.

I agree that the web had found a convincing solution by defining UTF-8, UTF-16, UTF-32. But I don't think we can take it as a general approach for string handling in C/C++.

>> But I don't think we can take it as a general approach for string handling in C/C++.
I wasn't saying we should. I was just trying to make sure the asker understands the difference between Unicode code-points, encoding formats and data types. The important thing is that UTF8 will always work on any platform and it has the nice side effect (by design of course) of being 100% compatible with normal 8bit ASCII. Sweet :)

ASKER

@evil: Does char datatype map to UTF-8 encoding or ANSI character encoding?
Does wchar datatype map to UTF-32 encoding or something else?

Regards
Sham

ASKER

@evil: Unicode is 32-bit representation, so 2power32 characters can be represented. so how UTF-8 encoding is done for 2power32 characters mapped to 2power8 characters?

Regards
Sham

>> Does char datatype map to UTF-8 encoding or ANSI character encoding?
It doesn't map to any encoding, re-read all what I've posted above. There is no direct link between a data type and an encoding except that a data type must be of an appropriate size to be able to support a specific encoding. UTF8 is an 8 bit multi-byte encoding, since a char type is typically a signed 8-bit type in C++ it is a suitable type to support this encoding format.

>> Does wchar datatype map to UTF-32 encoding or something else?
Again, see my comment above. The type wchar_t is different on different platforms. On Windows it is trypically 16 bit so UTF16 is the natural encoding format to use with this type. On Linux it is 32bit, hence UTF32 would be the natural encoding format for this type on Linux. And this is the problem, UTF16/32 do not easily port between platforms in C++ due to data type size inconsistencies. A char type; however, is typically 8 bit on all platforms, therefore UTF8 is generally more portable if your code needs to work x-platform. If you are only tergetting Windows you are probably better of using UTF16 since this is the 'native' Unicode encoding format supported by the Win32 API.

>> Unicode is 32-bit representation, so 2power32 characters can be represented. so how UTF-8 encoding is done for 2power32 characters mapped to 2power8 characters?

It is a multi-byte encoding format (same as UTF16). Multiple 8-bit bytes are used to encode the 32 bit code point. The number of bytes is variable; however, and the number of bytes used to represent a code-point is encoded into the format. I'm not going to try and explain exactly how this works here as the format is well documented on many internet reference sites (for example, the Wikipedia links I posted above).

ASKER

@evil: So, are these UTF encoding formats are placed in 8 bit char datatype?

When i say
char ch = 'a';
in C language, assume i did not enable unicode macro.

What is stored in ch? UTF-8?

Regards
Sham

>> So, are these UTF encoding formats are placed in 8 bit char datatype?
Yes, but multiple 8 bit types may be required, it depends upon the code-point being encoded. For normal ASCII characters only 1 byte is required. For other languages you might need multiple bytes.

>> What is stored in ch? UTF-8?
No, just 1 standard ANSI (ASCII) lower case 'c'.

C/C++ has no native concept of Unicode, at all, so there is no standard way to define a Unicode character, you have to rely on underlying support from either the OS or a 3rd party library such as ICU.
http://www.icu-project.org/

NB. The latest version of the standard (due out in 2009) will have native Unicode support.
http://en.wikipedia.org/wiki/C%2B%2B0x#New_string_literals

This section of the MSDN goes into some detail about how Windows handles Unicode.
http://msdn.microsoft.com/en-us/library/ms776440(VS.85).aspx

ASKER

@evil: When i say
char ch = 'a';
in C language, assume i enable unicode macro.

What is stored in ch?

Regards
Sham

>> in C language, assume i enable unicode macro.
It makes no difference, the UNICODE macro controls how standard Win32 API functions and types work (whether they use the wide or narrow versions of these functions and defining the wide or narrow for on the TCHAR type, which is a Microsoft type and NOT a standard C/C++ type).

You are not doing anything with Unicode here (really!). C/C++ standard types and literals are NOT Unicode aware (in the current standard, see my note about the new standard). Only if you use an OS level function or a 3rd party library (e.g. ICU) will you be able to generate proper Unicode encoded chars. You can always, of course, hex encode the code-points by hand (in the same was as you can hex encode anything in C/C++ into a literal string); however, this is not the same as native support and is not really a workable/usable solution.

>> What is stored in ch?
The ANSI (ASCII) value of the letter a. This is all C/C++ knows about. Try it yourself. Using the debugger runt his code, once with UNICODE defined and once with it not. Each time you'll see the value of ch is the same, and it'll be the ANSI (ASCII) value of the letter a.

Hi TheLearnedOne.

>> itsmeandnobodyelse {http:#23057018}
I'm curious as to why you chose this answer since it just expresses an opinion and provides no answer for the Q asked. Maybe {http:#23002987} would have been a better choice since it answers the original Q, possibly (but not necessarily) along with one of my answers, which provides further information on the TCHAR type and how it is (or isn't) related to Unicode and the various transformation formats.

Alex, do you have a view on this?