Solved

Java read and display html with encoding specified

Posted on 2010-11-24
44
569 Views
Last Modified: 2012-06-21
Hi,

I want to read an HTML file with Chinese characters inside, but my code can't return correct Chinese characters. However, I think the issue is not related to what language is inside, but the method I am currently using.

My method is:

A) use the function, readFileAsString(String filePath), to read the whole html file into a string;
B) use new String(byte[], "gb2312") function to encode the whole html content string;

My questions are:

1) Should I specify the charset during A or B? What are the difference between using charset in A and in B?

2) If the whole HTML file is read into byte[], rather than a String variable, will the order of using the "gb2312" charset in A or B be different?

3) I have already used the following code (using inputStream variable) to encode HTML file, and it works. But I don't know why inputStream works but String doesn't.

  public static String convertStreamToString(InputStream is) throws Exception {
        byte[] bytes=new byte[is.available()];
        is.read(bytes);
        String s1 = new String(bytes, "UTF8");
        if(s1.contains("¿¿¿"))
              System.out.println("bingo");
        String s = new String(bytes, "GB2312");
        if(s.contains("¿¿¿"))
              System.out.println("bingo");
        //System.out.println(s);
        return s;
  }

Thanks a lot.
import java.io.BufferedReader;
import java.io.FileReader;

public class TestChineseWithJsoup {


	public static void main(String[] args) throws Exception {

	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";
	    String htmltext=readFileAsString(name);
	    System.out.println(new String(htmltext.getBytes(),"gb2312"));

	}
	
	public static String readFileAsString(String filePath) throws java.io.IOException{
	        StringBuffer fileData = new StringBuffer();
	        BufferedReader reader = new BufferedReader(
	                new FileReader(filePath));
	        char[] buf = new char[1024];
	        int numRead=0;
	        while((numRead=reader.read(buf)) != -1){
	            String readData = String.valueOf(buf, 0, numRead);
	            fileData.append(readData);
	            buf = new char[1024];
	        }
	        reader.close();
	        return fileData.toString();
	}

}

Open in new window

0
Comment
Question by:wsyy
  • 16
  • 16
  • 12
44 Comments
 
LVL 92

Expert Comment

by:objects
ID: 34210209
looks like your problem is in readFileAsString(), you need to specify the encoding of the file you are reading

              BufferedReader reader = new BufferedReader(
                      new InputStreamReader(new FileInputStream(filePath), encoding));
0
 

Author Comment

by:wsyy
ID: 34210293
objects, I changed the code and it doesn't work yet. But thanks for inputs.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class TestChineseWithJsoup {


	public static void main(String[] args) throws Exception {

	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";
	    String htmltext=readFileAsString(name);
	    System.out.println(new String(htmltext.getBytes(),"gb2312"));

	}
	
	public static String readFileAsString(String filePath) throws java.io.IOException{
	        StringBuffer fileData = new StringBuffer();
	        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "utf8"));
	        char[] buf = new char[1024];
	        int numRead=0;
	        while((numRead=reader.read(buf)) != -1){
	            String readData = String.valueOf(buf, 0, numRead);
	            fileData.append(readData);
	            buf = new char[1024];
	        }
	        reader.close();
	        return fileData.toString();
	}

}

Open in new window

96747.html
0
 
LVL 92

Expert Comment

by:objects
ID: 34210579
the file is gb2312, not utf8

you can simplify the code to read the file

http://helpdesk.objects.com.au/java/how-do-i-read-a-text-file-line-by-line

or you could just read it into a byte array

http://helpdesk.objects.com.au/java/how-do-i-read-the-contents-of-a-file-into-a-byte-array

and create your string from that byte array (with required encoding)
0
 

Author Comment

by:wsyy
ID: 34210651
No luck even if I change utf8 to gb2312.
0
 
LVL 92

Expert Comment

by:objects
ID: 34210895
>           System.out.println(new String(htmltext.getBytes(),"gb2312"));

did you get rid of that?

And are you using a font that supports chinese
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34211586
The file is actually gb18030, so try
Reader in = new InputStreamReader(new FileInputStream("96747.html"), "gb18030");

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34211611
Attached below as utf-8:
chin-utf8.html
0
 

Author Comment

by:wsyy
ID: 34213243
CEHJ, when I used your chin-utf8.html as input and changed "gb2312" to "utf8" (two places) the content can be shown correctly.

However, if I used the old html file and changed "utf8" to "gb2312" in my code, it didn't work.

By the way, you suggested that I use

Reader in = new InputStreamReader(new FileInputStream("96747.html"), "gb18030");

I want to know

1)  how you know it is gb18030.
2) why you suggest using Reader in.
0
 

Author Comment

by:wsyy
ID: 34213485
objects,

you commented the below:

>           System.out.println(new String(htmltext.getBytes(),"gb2312"));

did you get rid of that?

And are you using a font that supports chinese


I just tried getting rid of System.out.println(new String(htmltext.getBytes(),"gb2312"));, and replacing it with System.out.println(htmltext);, and it worked.

Now I am more confused (please see code examples for my following two questions)

1) when I use correct charset in the readFileAsString, why it won't work if I specify the correct charset in the System.out.println statement, and why it works if I don't specify the charset?

2) if I didn't specify the correct charset in the readFileAsString function but specified the correct charset in the System.out.println statement, why it didn't work? DOES THE readFileAsString (NO CHARSET SPECIFIED) CORRUPT THE STRING READ FROM A FILE¿

Thanks
Code for question 1:



import java.io.BufferedReader;

import java.io.FileInputStream;

import java.io.InputStreamReader;



public class TestChineseWithJsoup {





	public static void main(String[] args) throws Exception {



	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";

	    String htmltext=readFileAsString(name);

	    //System.out.println(htmltext);

	    System.out.println(new String(htmltext.getBytes(),"gb2312"));



	}

	

	public static String readFileAsString(String filePath) throws java.io.IOException{

	        StringBuffer fileData = new StringBuffer();

	        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb2312"));

	        char[] buf = new char[1024];

	        int numRead=0;

	        while((numRead=reader.read(buf)) != -1){

	            String readData = String.valueOf(buf, 0, numRead);

	            fileData.append(readData);

	            buf = new char[1024];

	        }

	        reader.close();

	        return fileData.toString();

	}



}



Code for example 2:



import java.io.BufferedReader;

import java.io.FileInputStream;

import java.io.FileReader;

import java.io.InputStreamReader;



public class TestChineseWithJsoup {





	public static void main(String[] args) throws Exception {



	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";

	    String htmltext=readFileAsString(name);

	    //System.out.println(htmltext);

	    System.out.println(new String(htmltext.getBytes(),"gb2312"));



	}

	

	public static String readFileAsString(String filePath) throws java.io.IOException{

	        StringBuffer fileData = new StringBuffer();

	        //BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb2312"));

	        BufferedReader reader = new BufferedReader(new FileReader(filePath));

	        char[] buf = new char[1024];

	        int numRead=0;

	        while((numRead=reader.read(buf)) != -1){

	            String readData = String.valueOf(buf, 0, numRead);

	            fileData.append(readData);

	            buf = new char[1024];

	        }

	        reader.close();

	        return fileData.toString();

	}



}

Open in new window

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34213697
>>CEHJ, when I used your chin-utf8.html as input and changed "gb2312" to "utf8" (two places) the content can be shown correctly.

Good - that means you have Chinese font support
>>
I want to know

1)  how you know it is gb18030.

2) why you suggest using Reader in.
>>

1) Experimentation: gb2312 doesn't work as well as gb18030

2) Because using a Reader is the correct way of reading streams where a specific encoding needs to be used
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34213708
>>BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb2312"));

Change that to the following and you should be fine

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb18030"));

Open in new window

0
 

Author Comment

by:wsyy
ID: 34213852
CEHJ, when I changed "utf8" to "gb2312" in the readFileAsString function and print the converted string without specifying the charset again, it works. So charset gb18030 is not the key issue. The key issue is, however, that why it didn't work if I specified the charset both in the readFileAsString function and in the following statement:

System.out.println(new String(htmltext.getBytes(),"gb2312"));

whereas the htmltext is the returned string of the readFileAsString function.

Also CEHJ, I hope you may be able to help me out with the following question:

I have to read a file into byte[], and the charset of the fileis unknown when I read it. But the charset is pivotal for my analysis later. So my approaches are:

1) read the html file into byte[] fileByteArr

 byte[] fileByteArr = readFileAsByte(String filePath); (the readFileAsByte function works and it not the key point here).

2) convert the fileByteArr into a string variable and assign the default charset "utf8"

String fileStr = new String(fileByteArr, "utf8");

3) detect the charset if the html file by using the fileStr variable

String encoding = encodeDetect(fileStr); (the encodeDetect function works and is not the key point here).

4) convert the fileStr into a new string variable using the detected charset

String newFileStr = new String(fileStr.getBytes(),"gb2312")); (assuming gb2312 is the detected, correct charset).

After 4), I print the newFileStr and the Chinese characters are not displayed properly. However, if I use gb2312 rather than utf8 to convert byte[] into string fileStr, and if I print fileStr, the Chinese characters are displayed properly.

Do you know why? More importantly, how can I convert byte[] into string without knowing charset at the beginning, detect the charset using the string variable, and then convert the string variable into a correctly encoded string?

Thanks a lot.


0
 
LVL 92

Expert Comment

by:objects
ID: 34214485
> I just tried getting rid of System.out.println(new String(htmltext.getBytes(),"gb2312"));, and replacing it with System.out.println(htmltext);, and it worked.

great, so that is the solution :)

> Now I am more confused (please see code examples for my following two questions)

> 1) when I use correct charset in the readFileAsString, why it won't work if I specify the correct charset in the System.out.println statement, and why it works if I don't specify the charset?

because that line will treat the string as if it is encoded using the default platform encoding.

2) if I didn't specify the correct charset in the readFileAsString function but specified the correct charset in the System.out.println statement, why it didn't work? DOES THE readFileAsString (NO CHARSET SPECIFIED) CORRUPT THE STRING READ FROM A FILE¿

because it was being read using the default encoding

Typically when you go between byte array and characters and you don't specify an encoding then the default i used

> how can I convert byte[] into string without knowing charset at the beginning, detect the charset using the string variable, and then convert the string variable into a correctly encoded string?

you can't convert a byte array to a string  without knowing the charset.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34214798
Are you telling me that generally you don't know the encoding of a file that you need to read? If so, that's not a good state of affairs. Any attempts to read it will boil down to a more or less sophisticated process of guesswork. I suggest you avoid guesswork and  determine the encoding before you read it.

>>>>
>>CEHJ, when I used your chin-utf8.html as input and changed "gb2312" to "utf8" (two places) the content can be shown correctly.

Good - that means you have Chinese font support
>>>>

It also means something else: that the encoding of gb18030 is correct. That is what i used to read it. When i attempted to read it  using gb2312, parts of it were garbage
0
 

Author Comment

by:wsyy
ID: 34215280
CEHJ,

My situation is that I have to crawl hundreds of thousands of web pages which I don't know the charset of the web pages beforehand. So I guess I can read the page sources into a string using default charset, say utf8, and then detect the charset from the string. If the detected charset is different from utf8, I would like to convert the string to a new string using the newly detected charset.

The challenge seems to be that once the first string is encoded by utf8, it can't be convert to other string with another charset.

I hope this will clarify my situation. I want to know if my approach stated above is possible, and if there is another walk-around that I can convert a string encoded by utf8 to a string encoded by gb2312 for instance.


0
 
LVL 92

Expert Comment

by:objects
ID: 34215303
you can generally get the encoding from the response

> The challenge seems to be that once the first string is encoded by utf8, it can't be convert to other string with another charset.

it can, your problem is that you keep using the default encoding (instead of the actual encoding) as I explained in my earlier comment
0
 
LVL 92

Expert Comment

by:objects
ID: 34215327
> The challenge seems to be that once the first string is encoded by utf8, it can't be convert to other string with another charset.

The string is actually unicode, the encoding is only used when converting it to/from a byte array

eg.

byte[] gb2312 = string.getBytes("gb2312");

0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34216583
>>I would like to convert the string to a new string using the newly detected charset.

There's no need to convert to String. All that's required is for the page to be decoded using the correct charset. How you do that depends on the api you're using to crawl. What is it?
0
 

Author Comment

by:wsyy
ID: 34219278
CEHJ, the crawler is nutch.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34219366
Well i'd be very surprised if Nutch didn't already read the page using the correct encoding
0
 

Author Comment

by:wsyy
ID: 34220276
Nutch uses windows-1252 as default charset and then guess the charset.
0
 
LVL 92

Expert Comment

by:objects
ID: 34220285
how is nutch related to the code you are running?

As I mentioned above the charset can be gotten from the http response

What is your current confusion as I believe I have already answered your question
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:wsyy
ID: 34220442
objects, can you please provide an example of how http response gets the charset?

I actually have the content of an html page in a byte[] variable. How can I detect the charset in the byte[] variable? Do you have example code which is what I really need?

By the way, Nutch doesn't use http response to get the charset.

Please see the attached code. The following statements are to detect charset:

EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);






DocumentFragment root;
    try {
      byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
      root = parse(input);
    } catch (IOException e) {
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    } catch (DOMException e) {
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    } catch (SAXException e) {
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    } catch (Exception e) {
      e.printStackTrace(LogUtil.getWarnStream(LOG));
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    }

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 34220460
> I actually have the content of an html page in a byte[] variable. How can I detect the charset in the byte[] variable? Do you have example code which is what I really need?
 
the http response contains more information than the page eg. it also contains http headers

Why do you think you need if nutch has already detected it?

> By the way, Nutch doesn't use http response to get the charset.

It does, one of the techniques the detector use it to check response
0
 

Author Comment

by:wsyy
ID: 34220658
>Why do you think you need if nutch has already detected it?

I want to write my own parser for Chinese.

>It does, one of the techniques the detector use it to check response

Where do you see it?
0
 
LVL 92

Expert Comment

by:objects
ID: 34220684
> I want to write my own parser for Chinese.

do you mean to parse the Nutch output? If so then I think you should be checking the meta data stored in nutch (parse_data segment I think)

> Where do you see it?

we've used nutch on a few occasions in the past
0
 

Author Comment

by:wsyy
ID: 34220868
Actually I don't mean to parse the Nutch output, but want to handle HTML parse by myself. The input to my own parser is a byte[] variable contentInOctets (please see below).

byte[] contentInOctets = content.getContent();

The contentInOctets variable contains the contents of the HTML page source.

I want to detect charset myself using this contentInOctets variable. Please help.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34221541
That sounds all rather strange, but you can use Content.getContentType
0
 
LVL 92

Expert Comment

by:objects
ID: 34221563
> That sounds all rather strange, but you can use Content.getContentType

have already mentioned that a few times
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34221643
>>have already mentioned that a few times

Where?
0
 

Author Comment

by:wsyy
ID: 34222302
objects, I output the Content.getContentType, but it is not the charset. The variable detects the mime type which then is used to guess the charset by Nutch.
0
 

Author Comment

by:wsyy
ID: 34222316
I think the broad issue is a loop issue. Having said so, I mean if the crawler doesn't know the charset beforehand, the crawler and the parser can't convert the page source to a correctly encoded string. At the same time, without knowing the correctly encoded string, the crawler and the parser don't know the charset and thus can't convert the string.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34222999
If you use org.apache.nutch.parse.html.HtmlParser, you'll find it uses the correct encoding (the one specified in the html headers)
0
 
LVL 92

Expert Comment

by:objects
ID: 34223776
>  I mean if the crawler doesn't know the charset beforehand,

most of the time the crawler does know the correct encoding, nothing you have posted suggests that it doesn't. And the fact that you can display the file correctly using my suggestions suggests nutch is using the correct encoding when saving the file

>  the crawler and the parser don't know the charset and thus can't convert the string.

they do know the charset and store it in the meta data as I explained earlier
0
 
LVL 92

Expert Comment

by:objects
ID: 34223809
>  I output the Content.getContentType, but it is not the charset.

you want getContentEncoding() for that


0
 

Author Comment

by:wsyy
ID: 34223832
CEHJ, I meant I would want to write my own parser, which means I don't use the HtmlParser containing the charset.

objects, I have used Nutch for a while, and the encoding is detected in the HtmlParser.java, and the variable carrying the encoding information is not in the Content.Type variable, but in the content variable which is passed into the getParse(Content content) in the .java file. Further, the encoding is included in the contentInOctets variable, which is defined by:

byte[] contentInOctets = content.getContent();

No more abstract explanation please. I am 100% sure of what I am saying.

I think now my question is: how to detect encoding in the contentInOctets variable, and the contentInOctets is preserved such that I can convert it into string.

0
 
LVL 92

Expert Comment

by:objects
ID: 34223840
>  and the variable carrying the encoding information is not in the Content.Type variable

never said it was, the content type is the mime type
the encoding is stored in the meta data (as I mentioned earlier)

You can access that meta using content.getMetadata()
0
 

Author Comment

by:wsyy
ID: 34223857
EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);


      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }

objects, the encoding is stored in metadata after encoding detection, which is in the HtmlParser.java file. I want to write my own parser, which means I won't be able to use the code in bold type.
0
 
LVL 92

Accepted Solution

by:
objects earned 250 total points
ID: 34223907
if you're not using nutch then you need to detect it yourself doing the same thing as nutch is doing. Theres no magic method for getting it (or nutch would be using it)

not sure what more details you need.
0
 
LVL 92

Expert Comment

by:objects
ID: 34223915
and we're getting off track from the original question which I have already answered.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 34225021
>>CEHJ, I meant I would want to write my own parser, which means I don't use the HtmlParser containing the charset.

Not sure why you'd want to do that, but there's nothing to stop you from following the same principles as Nutch to do so. Of course, Nutch is guided by the charset in the headers. The problem in this particular case is that the incorrect content type is specified in the file, as i proved at

http:#34211586
http#:34211611

And you acknowledged that proof at http:#34214798
0
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 250 total points
ID: 34225023
That should have been http:#34211611
0
 

Author Closing Comment

by:wsyy
ID: 34233171
Sorry to get a low rate on these solutions which indeed do not satisfy my needs.
0
 
LVL 92

Expert Comment

by:objects
ID: 34234112
That would be because your question had nothing to detecting the charset :)
Your question asked what the problem was with your code which I answered in my first comments.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This article describes how to create custom column layout styles for Bootstrap. The article uses 5 columns to illustrate the concept, but the principle can be extended to any number of columns.
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now