• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 119
  • Last Modified:

cdata and \u encoded and unicode and base 64 encoded text utf8 and utf32

i am confused with below terms like

cdata and  \u encoded and Unicode and base 64 encoded text utf8 and utf32

when we use them in which scenarios. Especially in programming lalguages. how it is related to content data like audio, video , image etc data
how json treats it different compared to xml here?
please advise
0
gudii9
Asked:
gudii9
  • 3
  • 2
  • 2
2 Solutions
 
ZoppoCommented:
Hi gudii9,

it's not easy to answer this in short, but I'll try to do as short as I can, if you don't understand something special afterwards please tell.

1. Unicode: In old days texts (=strings) where (7 or 8 Bit)-ASCII, so the maximum number of different characters was 127/255. This is not enough to encode characters of different languages into one text, for some languages it's even too small for the complete character set.

So, to allow using characters for any language a new encoding standard was developed (beside other methods like Multi-Byte characters), called Unicode. A complete Unicode-font contains representation for each character, ciphers, punctuation marks, and any other existing text elements of all existing languages, so they can be used all together, even mixed in texts.

In Unicode single characters are representated by 32-bit values.

To keep it flexible (and to keep memory overhead small) some different transformation mappings were defined, the most important are UTF-8, UTF-16 and UTF-32. In UTF-8 and UTF-16 the characters are encoded in 'units' of 8/16 bit (= 1 or 2 byte) in a way most common used characters are encoded in one single 'unit', further, not so often used characters are encoded useing two or more 'units'.

Now for writing programs this means: in case you have to handle string data from external sources (i.e. XML files, database, UTF-8/16 TXT-files, ...) you probably have to convert from one mapping to another. In Windows native programs (i.e. written in C++ with Unicode-support) internally use UTF-16 (so called 'wide-char'-) strings. When loading an XML file, which uses UTF-8 (so if the 'xml'-PI-node contains encoding="UTF-8"), the loaded strings need to be transformed from UTF-8 to UTF-16 before they can be used in the program.

For more info about BOM take a look at https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

BOM: Even 'normal' text-files (i.e. TXT or JSON) can use Unicode. Generally there are three possibilities:

- use escape sequences to encode single characters. A sample for this can be found in the chapter Data portability issues at https://en.wikipedia.org/wiki/JSON
- 'Pure' text files can really store character as UTF-8/16/32 with apporpriate bytes per character. The problem is: the program which reads the file has to 'guess' whether it's any UTF encoding or ASCII.
- Text files can be marked with a so called Byte Order Marker (BOM). This is a defined character- (or better said: bit-) mask, if a program opens a text file it has to first check for the exiting of such a BOM - if one is found the encoding of the file is clear, of not the program has to guess as mentioned before.

For more info about this you could start at: https://en.wikipedia.org/wiki/Unicode

2. CDATA: In XML/SGML files usually the complete text is interpreted by the appropriate parser. This sometimes makes it difficult to put text into those files, which contains strings which may be interpreted by the parsers in an unexpected way.

To avoid this texts can be put into a CDATA (short for general Character DATA) element, this 'tells' the parsers to treat it as normal text instead of trying to evaluate it.

For more info take a look at https://en.wikipedia.org/wiki/CDATA

3. base64 encoding: This is a method to convert any 8-bit data into texts which only contain language- and codepage-indepenant and printable ASCII-7 characters. This methode used 64 different characters, which means it can representate 6 bits per character, which means there's data overhead (the resulting string needs more memory). The benefit is those strings can be used in nearly every text-file format to store any arbitrary binary data. Beside base64 encodings there exist others, one of the most popular is hex-encoding, which converts 8-bit data to strings which only contain the characters 0-9 and A-F - allthough the memory overhead is larger than with base64 hex-encoding is quite popular because it's implementation is easy and fast.

4. About XML and JSON: In XML the used encoding (as told above) can be set in the 'xml'-PI-node, i.e.:
<?xml version="1.0" encoding="UTF-8"?>

Open in new window

AFAIK for JSON it's even possible to use UTF-8, UTF-16 and UT-32 (where UTF-8 is the default), but there's no kind of defining code like in XML, instead it simply uses the encoding of the text-file (see declaration of BOM above at 1.).

All these points are important especially for text files, with Audio, Video, Image or any other binary file formats it's only interesting when it's needed to embed such data into text files like XML or JSON.


Ok, this was not short, but as I think you can see, it is not a trivial subject. I hope I covered all your question, and please feel free to ask for further exlanations.

Hope this helps,

ZOPPO
0
 
BigRatCommented:
An important point about CDATA is that it is NOT an encoding, but is effectively the same text characters used in XML which may look like XML but isn't. Therefore you cannot use characters from an encoding which are NOT allowed in normal XML. The text enclosed in CDATA "brackets" can look like XML but it will not be interpreted as such.
0
 
gudii9Author Commented:
looks some deep concept. any good free video tutorials around this concepts. please advise
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
ZoppoCommented:
Well, as I think you can see from above that UNICODE chracter encoding is a none-trivial issue. IMO what I wrote should answer many of your questions from above. If you have problems with some details or particular points please feel free to ask.

I'm not sure if there's simply such one good video (in fact I don't like video tutorials for such things, because I have my own learning rate, and because there's no cut-and-paste ;o) - unfortunateley I can't tell you about other tutorials than those everyone can find, i.e. via Google: https://www.google.de/search?q=unicode+video+tutorial

Besr regards,

ZOPPO
0
 
BigRatCommented:
Around what "deep concept"?

For what purpose do you want to know this information? The question is far too wide to be answered in detail.
0
 
gudii9Author Commented:
AFAIK for JSON it's even possible to use UTF-8, UTF-16 and UT-32 (where UTF-8 is the default), but there's no kind of defining code like in XML, instead it simply uses the encoding of the text-file (see declaration of BOM above at 1.).
in json how we specify different typease like UTF-8/UTF-16/UTF-32/UTF-64 etc

is it UT-32 or UTF-32 is there any difference?
0
 
BigRatCommented:
Difference? I don't understand all of these abbrieviations which aren't standard.

As far as JSON is concerned, if it is the response to an HTTP request, it'll be in the mine-tye as to whether JSON is ANSI/UTF-8/UTF-16.
If it is Node.js it will almost always be UTF-8, otherwise a file BOM will be present, or on Linux it's mostly UTF-8 on Windows ANSI.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

  • 3
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now