• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 312
  • Last Modified:

UTF-16 Support in CSS

I need to create a CSS document having Double byte characters. All the HTML tags will be in single byte only the data that is to be displayed will be in double bytes. Please specify what encoding i have to specify in the beginning of HTML document and if somebosy has some example do let me know.

Thanks
0
Capslock
Asked:
Capslock
1 Solution
 
avnerCommented:
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=UTF-16">
0
 
CapslockAuthor Commented:
I have already used this solution but it didn't worked . IF somebody has any example of an HTML document do specify.
0
 
dorwardCommented:
I believe you need to configure the _server_ to send the correct content type and character set in the HTTP headers - for both HTML and CSS documents.

How you do this depends on the server software you use.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
cirtapCommented:
Hi Capslock,

at least Mozilla wants the file TO BE saved as UTF-16 as well as the CSS, in case you use any of the content, :before, :after rules/selectors.
This happened to me with a file 'declared' as UTF-8 (using <meta>) but was actually saved as ANSI.
This is the first think you should do especially if the files are loaded from the local drive for testing.

Then check the headers sent from the server with this free online-tool: http://www.delorie.com/web/headers.html

I had to add the encoding in <link type="text/css; encoding=UTF-8"> of the stylesheet as well - this is basically the same as the meta content-type, and to be safe those additional metas:
<meta http-equiv="Content-Style-Type" content="text/css; encoding=UTF-8" />
<meta http-equiv="Content-Script-Type" content="text/javascript; encoding=UTF-8" />
(or UTF-16 in your case)

Using the meta http-equiv "statements" as well as the type attributes in link, script, a etc. were "invented" to avoid configuration changes to the server, and simply because not every server can be configured to know every single MIME-type. Because in the western world iso-8859-1 is the default for all files and works well, they are usually ommitted, and barely mentioned somewhere.
They usually override the HTTP respons sent from the server with those given in the content="" or type="" attributes.

CirTap
0
 
cirtapCommented:
just checked a couple of my files: the HTML and CSS are sent w/o encoding, so it's the browser's deafult. Unlike PHP pages which by default sends an addition Content-Encoding header with default to "iso-8859-1".
So the METAs and types should work - on the client side.

Good luck
0
 
CapslockAuthor Commented:
Well i must explain soemthing , I have a program which works on certain input file and produces an MHTML file. This program runs on Unix so there is no point of saving the file as ANSI or UTF-8. Now when i save an HTML file on NT as UTF-8 i get three byte of header which is 0xEF,0xBB,0xBF. When i change my program to put this header in the mhtml file after the MHTML header that is

Content-Type: multipart/related; boundary="==boundary-1"; type="text/html

Text displayed only to non-MIME-compliant mailers
--==boundary-1
Content-Type: text/html; charset=utf-16
Content-Transfer-Encoding: 16bit

I am able to view my mhtml file with correct UTF-8 data in the Internet Explorer . I have two questions related to this

1) I have data in UTF-8 format why do i need to specify charset as UTF-16 if modifyt this to be utf-8 and Encoding as 8 bit it doesn't work

2) This same file which opens in IE doesnot open in Netscape.

Please help me.

0
 
cirtapCommented:
> Please help me
I'm trying to :)

First: I'm wondering about your "MHTML file". I never figuered out what this "Microsoft HTML File" is supposed to be and be good for in real-life practice, except there's a New-File-Template for it. I guess it appears if MS Office or FrontPad/FrontPage are installed so Explorer and MSIE may "switch" to the corresponding editor. In MSIE, a HTML file saved from Word (with .html as the extension) will always open with Word, even if you have another default editor for .html files.
This is Windoze specific stuff.

The "magic header" 0xEF,0xBB,0xBF comes from NT-Notepad (or Emacs as well) if the file is saved as UFT-8/UTF-16. The headers of course differ and also indicate the byte-order in that file: big endian/low-endian aka Intel/Motorola format.
AFAIK it's "recommended" for Unicode [XML-]files having this header if no encoding is otherwise given.
So when you file is "labels" with the magic header as UTF-16 BUT does not contain double-byte code, some User-Agents/applications may not recognize the content as what it maybe is: ASCII or ANSI, and they're not required to. It's like havng a GIF file and change the extension to .TIF and expect this will also convert the 'data' in the file.
Same applies to double-byte TEXT files. ANSI, UTF-8 and UTF-16 (Unicode) *ARE* different things!
assume a file with 6 characters only abcdef
ANSI: 6 Bytes
UTF-8: 12 bytes
UTF-16 small-endian: 14 bytes; magic-header FF FE
UTF-16 big-endian: 14 bytes; magic header FE FF

Understanding this, something like
> All the HTML tags will be in single byte only the data
> that is to be displayed will be in double bytes.
is not possible. Yure, you can create such files, but it's like having a BMP file containing some parts having JPG compression. This won't work either.
Infact, if you save a HTML file as UTF-8 or UTF-16, the tags will be encoded in douby-bytes as well. No way to mix them.

Just because MSIE is nice and 'analyzes' the data and finds out the file contains both ANSI and UTF-* although labeled as UTF-8/16 does not mean Netscape and others have to do this as well: they TRUST the header and they are right to do so.
Because MSIE always tries to be so super-smart, it's why many (mail-)viruses work in Windows.

From you last posting I assume you're actually creating an e-mail with a HTML body/content or attachment.
If you DO need double-byte characters in your mail, because there's asia text or alike in it, your mail-builder script must be able to add this "boundry" with the right encoding making this line
  Content-Type: multipart/related; boundary="==boundary-1"; type="text/html
produce in
  Content-Type: multipart/related; boundary="==boundary-1"; type="text/html; encoding=UTF-8"

and the WHOLE FILE must be in that format incl. tags or conformant applications will ignore it or show stupid things, showing the sencond bytes as blocks or weired special characters.

I had to do sth. in PHP lately when creating a XML file in UTF-8. Every single tag needed to be 'converted' to UFT-8 before I could add the UTF-8 data/string.
The XML and XSLT files started with <?xml version="1.0" encoding="UTF-8"?>, so the XSLT parser EXPECTED the file to BE UTF-8 and got confused about certain "invalid characters" - the parser was MSXML from Microsoft.

It *may* happen that due to the absence of the encoding in this "boundary header" your mail-builder-script assumes UTF-16 because it may also "parse" the file, finds a present Unicode magic-header (you added manually) and adds this to the final boundry item.

So what you have to do is make your attachment/body be UTF-* in total, not just the data. If it's UTF-8 you must skip the magic-header, they're only required for UTF-16 to determine the byte order.

Same will be necessary for an external CSS. As I already said: Netscape WANTS this file to have the same encoding if the CSS adds content to the HTML.

Hope this helps.

CirTap

PS: maybe you provide a URL with the html and css files in an archive (ZIP, TAR, GZ will do) so I can see what data you actually have.
0
 
CapslockAuthor Commented:
Hi Cirtap
          Thnaks for the good response . Here i am attaching the code of the MHTML Document.

Content-Type: multipart/related; boundary="==boundary-1"; type="text/html

Text displayed only to non-MIME-compliant mailers
--==boundary-1
Content-Type: text/html; charset=utf-16
Content-Transfer-Encoding: 16bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
<html>
<head>
<title>HTML bill for customer no. </title>
<meta name="customer-id" content=" ">
<style type="text/css">
<!--
BODY {background: #ffffff;color: #000000;}
P.F0 {font-size: 13px;font-family: Arial Unicode MS;color: #000000;}
p
{
position: absolute;
width: auto;
z-index : 1
}
  --></style></head>
<body>
<div style="width:1262">
<p style="top:204;left:24;" class="F0">WORK FIELD</p>
<p style="top:159;left:135;" class="F0">English Name</p>
<p style="top:204;left:147;" class="F0">English Name</p>
<p style="top:121;left:150;" class="F0">English</p>
<p style="top:1103;left:240;" class="F0">End of BILL</p>
<p style="top:122;left:261;" class="F0">English char 1</p>
<p style="top:159;left:304;" class="F0">E</p>
<p style="top:55;left:360;" class="F0">This is a test format to test UTF8</p>
<p style="top:124;left:420;" class="F0">Chinese</p>
<p style="top:159;left:422;" class="F0">地用</p>
<p style="top:205;left:422;" class="F0">地用</p>
<p style="top:125;left:552;" class="F0">Chinese char 1</p>
<p style="top:159;left:612;" class="F0">地</p></div>
<div style="width:1262"></div></body></html>
0
 
CapslockAuthor Commented:
Hi Cirtap
            I checked today if i don't give the Magic header in the MHTML file than also it opens correctly in the IE. The only problem now i have is the MHTML header i use

1) The MHTML file opens correctly in IE and not in Netscape if i use the below header

Content-Type: multipart/related; boundary="==boundary-1"; type="text/html

Text displayed only to non-MIME-compliant mailers
--==boundary-1
Content-Type: text/html; charset=utf-16
Content-Transfer-Encoding: 16bit

2) The MHTML file opens correctly in netscape and not in IE if i use the below header

Content-Type: multipart/related; boundary="==boundary-1"; type="text/html

Text displayed only to non-MIME-compliant mailers
--==boundary-1
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

As u can see the only differece is the charset and the Encoding . Can u help me with this situation.





0
 
cirtapCommented:
Hi Capslock,

actually there are two issues. The mail format and the content format. The first tells the client what is in the mail and how it's encoded, the second tells the "viewer" how to handle this data (in the given encoding), in your case a HTML file.

Mail format:
I suppose the very first line
 Content-Type: multipart/related; boundary="==boundary-1"; type="text/html
may cause trouble, too but may not necessarily be the main issue (see below).

I checked a couple of mailbox files, and I found this --lets call it "primary mail header"-- to be either  Content-Type: multipart/mixed; boundary="part1_26.blabla.boundary"
or
 Content-Type: multipart/alternative; boundary="part1_26.blabla.boundary"
when there was more than just plain text in it.
(boundary names change of course)

Maybe there 'is' a multipart/related, but I never found there was never a
  type="mime/type"
following the multipart/* entry. This makes sense in a "multipart" file: which part of the message belongs to the inital text/html if every boundary has it's own specific type inits own header?

Followed by two line feeds (\n\n) there's usually the "plaintext" body, which may be left empty or just a single line, often encoded in 7bit, quoted-printable (US-ASCII) or 8bit (iso-####-#) whatever it needs.

Then in every "boundary sub-header" you have the actual type, charset and encoding, each followed by two \n\n and the actual content -- which is what you have, so this part should be fine.

The Content format
At this point the mail-client "knows" what parts the mail is made of and launches a "viewer" for each of the parts. For MSIE or NS this will be the browser in most cases, and as we enter "the magic world of HTML" here, there's ONE thing that may be another cause of your trouble: The DOCTYPE!

I copied your HTML fragment, save it both as UTF-8 and UTF-18, and loaded it into NS6: the CSS didn't work; all in a single line. Then I removed the DOCTYPE (which declares this file as HTML4.0 STRICT!) and all was fine.

Now we come back to the file encoding. One can say, a Unicode textfile becomes a "binary" file, so its content must follow the "format" given in the "file header" (like a GIF starts with "GIF89a" or alike following the binary data)

The "physical" UTF-16 file contained the magic-header I was talking before to tell the "HTML viewer" (~ browser) how this file is coded and the WHOLE FILE was UTF-16! EVERY character was double-byte -- and please note: characters and bytes are two different things!
The UTF-16 file looked kinda like this when I loaded it in ANSI mode into my text editor:
<html>
<head>
<title>Bill for Customer ... etc. pp.
incl. the CSS part of course!!
You get the picture?

From a certain point of view, Mozilla/Netscape is doing the right job: you have ANSI characters only (exept for a few entries) but declare the doument to contain double-bytes, so every second byte(!) is "sacrificed" to be conformant with the "file header": UTF-16.

Example:
the character codes for the first two "bytes" in <html>, '<' and 'h' are
  in 8bit: \x68 and \x3C == 104 / 60 == &#104; / &#60;
  interpreted as UTF-16 they become \x683C == &#26684;
a Kanjii character.
This happens to all 'bytes' in yout file including the inline CSS, so the latter becomes invalid: There is no four 8bit character sequence "P.F0" (\x50 \x2E \x46 \x30) anymore but only TWO 16bit characters: \x502E and \x4630 - and they are not know to be valid CSS rules :-)
A similar thing happens when you say its UTF-8 but the data is still ANSI.

I saved the file in 8bit ANSI and then added the FF FE header manually: very interesting result in both Mozilla and MSIE :-)  If I could read chinese...

Now for the DOCTYPE:
All new browsers (version 6 up) read the DOCTYPE to switch between so-called "quirks mode" and "standard mode". It's typical to MSIE to be "nice" and renders the file even if it's NOT what the DOCTYPE declares.
As you declared HTML4.0 strict, Mozilla assumed yout CSS to be correct as well, but it was not: you missed the unit for the inline top and width values.
Removing the DOCTYPE makes conformant browser switch to !quirk mode" and they'll accept measures without units.
Keeping it, requires your HTML and CSS to be 'VALID'.

<div style="width:1262px">
<p style="top:204px;left:24px;" class="F0">WORK FIELD</p>
<p style="top:159px;left:135px;" class="F0">English Name</p>
<p style="top:204px;left:147px;" class="F0">English Name</p>
<p style="top:121px;left:150px;" class="F0">English</p>
<p style="top:1103px;left:240px;" class="F0">End of BILL</p>
...

Netscape/Mozilla, Opera and maybe other "standard conforming" HTML User-Agents are less forgiving.

So as a conclusion: you declared all of your data to be something specific but none was :-)
 - your Unicode file/data was ANSI
 - your HTML4 strict was HTML3.2 or HTML4 transition


A final note to this particual piece of data:
it appears to be tabular data, so why not use a TABLE?


Another suggestion:
If you only need a couple of Unicode characters in your file, use numeric HTML entities for them, like &#8220; etc (this is the decimal value of a curly quote in Unicode, I don't remember how to encode hex values for entities).
This way you may use double-byte character only where necessary, and reduce the file size at the same time.
UTF-18 will always double the file-size, even if there are only ANSI "letters" in it.
If the document is 99% "Latin 1" and ten characters are Unicode, it'll be worth getting their character-code and use entities instead. You may still need a CSS to assign a proper Unicode-Font for it (as you did),

Hope this helps

CirTap

PS: I still don't know, why you're using this "MHTML" term. You do have an ordinary HTML file, there's no 'microsoftific" content (<meta>) or data in it -- or is the "M" standing for MIME?
0
 
cirtapCommented:
Hi Capslock,

do you need some more info?
Still testing?
Disappointed? =)

Regards,

CirTap
0
 
COBOLdinosaurCommented:
This question has been classified abandoned. I will make a recommendation to the
moderators on its resolution in a week or two. I appreciate any comments
that would help me to make a recommendation.

<note>
Unless it is clear to me that the question has been answered I will recommend delete.  It is possible that a Grade less than A will be given if no expert makes a case for an A grade. It is assumed that any participant not responding to this request is no longer interested in its final disposition.
</note>

If the user does not know how to close the question, the options are here:
http://www.experts-exchange.com/help/closing.jsp


Cd&
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now