[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 384
  • Last Modified:

Displaying Numeric Character Entities

I’m having a problem displaying numeric character entities such as “—” (m-dash, & + #151;) in my Java application. I have noticed that some characters will show up correctly and some will not, for example “}” (Right curly brace, & + #125;) shows up just fine. I’m having the same problem with Japanese numeric character entities such as “式” (& + #24335;). The encoding that is being used here is UTF-8.

I am reading the text that contains these entities from a XML file using the following code:

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setNamespaceAware(true);
    dbf.setIgnoringComments(true);
    dbf.setIgnoringElementContentWhitespace(true);
       
    try {
      FileInputStream fff = new FileInputStream(theFile);
      InputSource inSource = new InputSource(new InputStreamReader(fff,"UTF-8"));
     
      DocumentBuilder db = dbf.newDocumentBuilder();
   
      Document doc = db.parse(inSource);
   
    } catch (Exception e) {
      System.out.println("Exception: " + e);
    }

Here is a simple example of an xml file.
<?xml version="1.0" encoding="UTF-8"?>
<toc name="Course One" file="course1/toc.xml">
<topic name="Topic &#151; One" file="course1/source/topic1.html"/>
<topic name="&#24335; (Japanese char)" file="course1/source/topic2.html"/>
</toc>

When I print out the contents of the name attribute I get boxes (or a ?) in place of the character entity. So my question is what can I do to read this file in as be able to see the correct characters in the Java application. Any help that can be offered here would be greatly appreciated.

Thanks,

David
0
vanfleet
Asked:
vanfleet
  • 4
  • 3
2 Solutions
 
aozarovCommented:
I think you should pick the right encoding to your system
e.g instead of having encoding="UTF-8" do encoding="Cp930" for Japanese
see: http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html for supported encoding for Java. (and match the one you need)
0
 
vanfleetAuthor Commented:
Thanks for your comment. In the case of the mdash entity (& #151;) I think the encoding is the problem, I don’t think that it's a valid entity for that encoding (UTF-8). If I use "& #x2014;" in it’s place it works.

However, in the case of the Japanese character “& #24335;” it is a valid numeric character entity for UTF-8. So, I don't understand why the application won't display it? We are processing all of our output into UTF-8 because it supports all languages.

Thanks,

David

0
 
JakobACommented:
Are you sure the font you are using have that character ?  There are not many fonts yet that have the entire unicode set. of characters
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
aozarovCommented:
Adding to JakobA comment. you can check if the value in the staring is actually having the right value but  comparing the charAt to 24335 -> '\u5F0F'
0
 
vanfleetAuthor Commented:
OK, I did compare the characters with the following line as you suggested:

    System.out.println("name: " + name + " - " + new Character(name.charAt(0)).compareTo(new Character('\u5F0F')));

For the Japanese character it returned a 0 which I think tells me that the two are identical.

What I don’t understand is that if I look at these characters in IE they are displayed just fine, so my system is finding the correct fonts. Where is the JDK getting it’s fonts from?

On Monday I will be testing this with a system booted in to the Japanese locale, if it works there I’m not going to worry about this any more.

Thanks.

dv
0
 
aozarovCommented:
try to include the i18n.jar in your classpath when you do those testing.
0
 
vanfleetAuthor Commented:
I don't see a i18n.jar file in my JDK 1.4.2 or 1.5.0 environments. However I do see a file called charsets.jar, it is also referenced at the following url:

http://java.sun.com/j2se/1.5.0/docs/guide/intl/encoding.doc.html

I would think that this file would be automatically added to my class path along with all the others. I did try manually adding it but it made no difference.

I did find out this morning that when XP is booted into the Japanese Locale the characters do show up correctly. I am still having a problem with localized text from the properties file not showing up correctly, but I think this is beyond the scope of my original question. If I can't solve this problem I'll post a new question for this issue. Unless there are any objections I will go a head and close this question and split the points between the two of you who have posted comments.

Thanks for your help.

dv

0
 
aozarovCommented:
Yes, i18n.jar is charsets.jar (since 1.4) and I think you are right that now there is no need to add it to your classpath.
I have no objection regarding closing this thread.
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now