Link to home
Start Free TrialLog in
Avatar of wsyy
wsyy

asked on

Java read and display html with encoding specified

Hi,

I want to read an HTML file with Chinese characters inside, but my code can't return correct Chinese characters. However, I think the issue is not related to what language is inside, but the method I am currently using.

My method is:

A) use the function, readFileAsString(String filePath), to read the whole html file into a string;
B) use new String(byte[], "gb2312") function to encode the whole html content string;

My questions are:

1) Should I specify the charset during A or B? What are the difference between using charset in A and in B?

2) If the whole HTML file is read into byte[], rather than a String variable, will the order of using the "gb2312" charset in A or B be different?

3) I have already used the following code (using inputStream variable) to encode HTML file, and it works. But I don't know why inputStream works but String doesn't.

  public static String convertStreamToString(InputStream is) throws Exception {
        byte[] bytes=new byte[is.available()];
        is.read(bytes);
        String s1 = new String(bytes, "UTF8");
        if(s1.contains("¿¿¿"))
              System.out.println("bingo");
        String s = new String(bytes, "GB2312");
        if(s.contains("¿¿¿"))
              System.out.println("bingo");
        //System.out.println(s);
        return s;
  }

Thanks a lot.
import java.io.BufferedReader;
import java.io.FileReader;

public class TestChineseWithJsoup {


	public static void main(String[] args) throws Exception {

	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";
	    String htmltext=readFileAsString(name);
	    System.out.println(new String(htmltext.getBytes(),"gb2312"));

	}
	
	public static String readFileAsString(String filePath) throws java.io.IOException{
	        StringBuffer fileData = new StringBuffer();
	        BufferedReader reader = new BufferedReader(
	                new FileReader(filePath));
	        char[] buf = new char[1024];
	        int numRead=0;
	        while((numRead=reader.read(buf)) != -1){
	            String readData = String.valueOf(buf, 0, numRead);
	            fileData.append(readData);
	            buf = new char[1024];
	        }
	        reader.close();
	        return fileData.toString();
	}

}

Open in new window

Avatar of Mick Barry
Mick Barry
Flag of Australia image

looks like your problem is in readFileAsString(), you need to specify the encoding of the file you are reading

              BufferedReader reader = new BufferedReader(
                      new InputStreamReader(new FileInputStream(filePath), encoding));
Avatar of wsyy
wsyy

ASKER

objects, I changed the code and it doesn't work yet. But thanks for inputs.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class TestChineseWithJsoup {


	public static void main(String[] args) throws Exception {

	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";
	    String htmltext=readFileAsString(name);
	    System.out.println(new String(htmltext.getBytes(),"gb2312"));

	}
	
	public static String readFileAsString(String filePath) throws java.io.IOException{
	        StringBuffer fileData = new StringBuffer();
	        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "utf8"));
	        char[] buf = new char[1024];
	        int numRead=0;
	        while((numRead=reader.read(buf)) != -1){
	            String readData = String.valueOf(buf, 0, numRead);
	            fileData.append(readData);
	            buf = new char[1024];
	        }
	        reader.close();
	        return fileData.toString();
	}

}

Open in new window

96747.html
the file is gb2312, not utf8

you can simplify the code to read the file

http://helpdesk.objects.com.au/java/how-do-i-read-a-text-file-line-by-line

or you could just read it into a byte array

http://helpdesk.objects.com.au/java/how-do-i-read-the-contents-of-a-file-into-a-byte-array

and create your string from that byte array (with required encoding)
Avatar of wsyy

ASKER

No luck even if I change utf8 to gb2312.
>           System.out.println(new String(htmltext.getBytes(),"gb2312"));

did you get rid of that?

And are you using a font that supports chinese
The file is actually gb18030, so try
Reader in = new InputStreamReader(new FileInputStream("96747.html"), "gb18030");

Open in new window

Attached below as utf-8:
chin-utf8.html
Avatar of wsyy

ASKER

CEHJ, when I used your chin-utf8.html as input and changed "gb2312" to "utf8" (two places) the content can be shown correctly.

However, if I used the old html file and changed "utf8" to "gb2312" in my code, it didn't work.

By the way, you suggested that I use

Reader in = new InputStreamReader(new FileInputStream("96747.html"), "gb18030");

I want to know

1)  how you know it is gb18030.
2) why you suggest using Reader in.
Avatar of wsyy

ASKER

objects,

you commented the below:

>           System.out.println(new String(htmltext.getBytes(),"gb2312"));

did you get rid of that?

And are you using a font that supports chinese


I just tried getting rid of System.out.println(new String(htmltext.getBytes(),"gb2312"));, and replacing it with System.out.println(htmltext);, and it worked.

Now I am more confused (please see code examples for my following two questions)

1) when I use correct charset in the readFileAsString, why it won't work if I specify the correct charset in the System.out.println statement, and why it works if I don't specify the charset?

2) if I didn't specify the correct charset in the readFileAsString function but specified the correct charset in the System.out.println statement, why it didn't work? DOES THE readFileAsString (NO CHARSET SPECIFIED) CORRUPT THE STRING READ FROM A FILE¿

Thanks
Code for question 1:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public class TestChineseWithJsoup {


	public static void main(String[] args) throws Exception {

	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";
	    String htmltext=readFileAsString(name);
	    //System.out.println(htmltext);
	    System.out.println(new String(htmltext.getBytes(),"gb2312"));

	}
	
	public static String readFileAsString(String filePath) throws java.io.IOException{
	        StringBuffer fileData = new StringBuffer();
	        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb2312"));
	        char[] buf = new char[1024];
	        int numRead=0;
	        while((numRead=reader.read(buf)) != -1){
	            String readData = String.valueOf(buf, 0, numRead);
	            fileData.append(readData);
	            buf = new char[1024];
	        }
	        reader.close();
	        return fileData.toString();
	}

}

Code for example 2:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.InputStreamReader;

public class TestChineseWithJsoup {


	public static void main(String[] args) throws Exception {

	  	String name = "c:\\iWeb2\\data\\ch02\\96747.html";
	    String htmltext=readFileAsString(name);
	    //System.out.println(htmltext);
	    System.out.println(new String(htmltext.getBytes(),"gb2312"));

	}
	
	public static String readFileAsString(String filePath) throws java.io.IOException{
	        StringBuffer fileData = new StringBuffer();
	        //BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb2312"));
	        BufferedReader reader = new BufferedReader(new FileReader(filePath));
	        char[] buf = new char[1024];
	        int numRead=0;
	        while((numRead=reader.read(buf)) != -1){
	            String readData = String.valueOf(buf, 0, numRead);
	            fileData.append(readData);
	            buf = new char[1024];
	        }
	        reader.close();
	        return fileData.toString();
	}

}

Open in new window

>>CEHJ, when I used your chin-utf8.html as input and changed "gb2312" to "utf8" (two places) the content can be shown correctly.

Good - that means you have Chinese font support
>>
I want to know

1)  how you know it is gb18030.

2) why you suggest using Reader in.
>>

1) Experimentation: gb2312 doesn't work as well as gb18030

2) Because using a Reader is the correct way of reading streams where a specific encoding needs to be used
>>BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb2312"));

Change that to the following and you should be fine

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "gb18030"));

Open in new window

Avatar of wsyy

ASKER

CEHJ, when I changed "utf8" to "gb2312" in the readFileAsString function and print the converted string without specifying the charset again, it works. So charset gb18030 is not the key issue. The key issue is, however, that why it didn't work if I specified the charset both in the readFileAsString function and in the following statement:

System.out.println(new String(htmltext.getBytes(),"gb2312"));

whereas the htmltext is the returned string of the readFileAsString function.

Also CEHJ, I hope you may be able to help me out with the following question:

I have to read a file into byte[], and the charset of the fileis unknown when I read it. But the charset is pivotal for my analysis later. So my approaches are:

1) read the html file into byte[] fileByteArr

 byte[] fileByteArr = readFileAsByte(String filePath); (the readFileAsByte function works and it not the key point here).

2) convert the fileByteArr into a string variable and assign the default charset "utf8"

String fileStr = new String(fileByteArr, "utf8");

3) detect the charset if the html file by using the fileStr variable

String encoding = encodeDetect(fileStr); (the encodeDetect function works and is not the key point here).

4) convert the fileStr into a new string variable using the detected charset

String newFileStr = new String(fileStr.getBytes(),"gb2312")); (assuming gb2312 is the detected, correct charset).

After 4), I print the newFileStr and the Chinese characters are not displayed properly. However, if I use gb2312 rather than utf8 to convert byte[] into string fileStr, and if I print fileStr, the Chinese characters are displayed properly.

Do you know why? More importantly, how can I convert byte[] into string without knowing charset at the beginning, detect the charset using the string variable, and then convert the string variable into a correctly encoded string?

Thanks a lot.


> I just tried getting rid of System.out.println(new String(htmltext.getBytes(),"gb2312"));, and replacing it with System.out.println(htmltext);, and it worked.

great, so that is the solution :)

> Now I am more confused (please see code examples for my following two questions)

> 1) when I use correct charset in the readFileAsString, why it won't work if I specify the correct charset in the System.out.println statement, and why it works if I don't specify the charset?

because that line will treat the string as if it is encoded using the default platform encoding.

2) if I didn't specify the correct charset in the readFileAsString function but specified the correct charset in the System.out.println statement, why it didn't work? DOES THE readFileAsString (NO CHARSET SPECIFIED) CORRUPT THE STRING READ FROM A FILE¿

because it was being read using the default encoding

Typically when you go between byte array and characters and you don't specify an encoding then the default i used

> how can I convert byte[] into string without knowing charset at the beginning, detect the charset using the string variable, and then convert the string variable into a correctly encoded string?

you can't convert a byte array to a string  without knowing the charset.
Are you telling me that generally you don't know the encoding of a file that you need to read? If so, that's not a good state of affairs. Any attempts to read it will boil down to a more or less sophisticated process of guesswork. I suggest you avoid guesswork and  determine the encoding before you read it.

>>>>
>>CEHJ, when I used your chin-utf8.html as input and changed "gb2312" to "utf8" (two places) the content can be shown correctly.

Good - that means you have Chinese font support
>>>>

It also means something else: that the encoding of gb18030 is correct. That is what i used to read it. When i attempted to read it  using gb2312, parts of it were garbage
Avatar of wsyy

ASKER

CEHJ,

My situation is that I have to crawl hundreds of thousands of web pages which I don't know the charset of the web pages beforehand. So I guess I can read the page sources into a string using default charset, say utf8, and then detect the charset from the string. If the detected charset is different from utf8, I would like to convert the string to a new string using the newly detected charset.

The challenge seems to be that once the first string is encoded by utf8, it can't be convert to other string with another charset.

I hope this will clarify my situation. I want to know if my approach stated above is possible, and if there is another walk-around that I can convert a string encoded by utf8 to a string encoded by gb2312 for instance.


you can generally get the encoding from the response

> The challenge seems to be that once the first string is encoded by utf8, it can't be convert to other string with another charset.

it can, your problem is that you keep using the default encoding (instead of the actual encoding) as I explained in my earlier comment
> The challenge seems to be that once the first string is encoded by utf8, it can't be convert to other string with another charset.

The string is actually unicode, the encoding is only used when converting it to/from a byte array

eg.

byte[] gb2312 = string.getBytes("gb2312");

>>I would like to convert the string to a new string using the newly detected charset.

There's no need to convert to String. All that's required is for the page to be decoded using the correct charset. How you do that depends on the api you're using to crawl. What is it?
Avatar of wsyy

ASKER

CEHJ, the crawler is nutch.
Well i'd be very surprised if Nutch didn't already read the page using the correct encoding
Avatar of wsyy

ASKER

Nutch uses windows-1252 as default charset and then guess the charset.
how is nutch related to the code you are running?

As I mentioned above the charset can be gotten from the http response

What is your current confusion as I believe I have already answered your question
Avatar of wsyy

ASKER

objects, can you please provide an example of how http response gets the charset?

I actually have the content of an html page in a byte[] variable. How can I detect the charset in the byte[] variable? Do you have example code which is what I really need?

By the way, Nutch doesn't use http response to get the charset.

Please see the attached code. The following statements are to detect charset:

EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);






DocumentFragment root;
    try {
      byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
      root = parse(input);
    } catch (IOException e) {
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    } catch (DOMException e) {
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    } catch (SAXException e) {
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    } catch (Exception e) {
      e.printStackTrace(LogUtil.getWarnStream(LOG));
      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
    }

Open in new window

> I actually have the content of an html page in a byte[] variable. How can I detect the charset in the byte[] variable? Do you have example code which is what I really need?
 
the http response contains more information than the page eg. it also contains http headers

Why do you think you need if nutch has already detected it?

> By the way, Nutch doesn't use http response to get the charset.

It does, one of the techniques the detector use it to check response
Avatar of wsyy

ASKER

>Why do you think you need if nutch has already detected it?

I want to write my own parser for Chinese.

>It does, one of the techniques the detector use it to check response

Where do you see it?
> I want to write my own parser for Chinese.

do you mean to parse the Nutch output? If so then I think you should be checking the meta data stored in nutch (parse_data segment I think)

> Where do you see it?

we've used nutch on a few occasions in the past
Avatar of wsyy

ASKER

Actually I don't mean to parse the Nutch output, but want to handle HTML parse by myself. The input to my own parser is a byte[] variable contentInOctets (please see below).

byte[] contentInOctets = content.getContent();

The contentInOctets variable contains the contents of the HTML page source.

I want to detect charset myself using this contentInOctets variable. Please help.
That sounds all rather strange, but you can use Content.getContentType
> That sounds all rather strange, but you can use Content.getContentType

have already mentioned that a few times
>>have already mentioned that a few times

Where?
Avatar of wsyy

ASKER

objects, I output the Content.getContentType, but it is not the charset. The variable detects the mime type which then is used to guess the charset by Nutch.
Avatar of wsyy

ASKER

I think the broad issue is a loop issue. Having said so, I mean if the crawler doesn't know the charset beforehand, the crawler and the parser can't convert the page source to a correctly encoded string. At the same time, without knowing the correctly encoded string, the crawler and the parser don't know the charset and thus can't convert the string.
If you use org.apache.nutch.parse.html.HtmlParser, you'll find it uses the correct encoding (the one specified in the html headers)
>  I mean if the crawler doesn't know the charset beforehand,

most of the time the crawler does know the correct encoding, nothing you have posted suggests that it doesn't. And the fact that you can display the file correctly using my suggestions suggests nutch is using the correct encoding when saving the file

>  the crawler and the parser don't know the charset and thus can't convert the string.

they do know the charset and store it in the meta data as I explained earlier
>  I output the Content.getContentType, but it is not the charset.

you want getContentEncoding() for that


Avatar of wsyy

ASKER

CEHJ, I meant I would want to write my own parser, which means I don't use the HtmlParser containing the charset.

objects, I have used Nutch for a while, and the encoding is detected in the HtmlParser.java, and the variable carrying the encoding information is not in the Content.Type variable, but in the content variable which is passed into the getParse(Content content) in the .java file. Further, the encoding is included in the contentInOctets variable, which is defined by:

byte[] contentInOctets = content.getContent();

No more abstract explanation please. I am 100% sure of what I am saying.

I think now my question is: how to detect encoding in the contentInOctets variable, and the contentInOctets is preserved such that I can convert it into string.

>  and the variable carrying the encoding information is not in the Content.Type variable

never said it was, the content type is the mime type
the encoding is stored in the meta data (as I mentioned earlier)

You can access that meta using content.getMetadata()
Avatar of wsyy

ASKER

EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);


      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }

objects, the encoding is stored in metadata after encoding detection, which is in the HtmlParser.java file. I want to write my own parser, which means I won't be able to use the code in bold type.
ASKER CERTIFIED SOLUTION
Avatar of Mick Barry
Mick Barry
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
and we're getting off track from the original question which I have already answered.
>>CEHJ, I meant I would want to write my own parser, which means I don't use the HtmlParser containing the charset.

Not sure why you'd want to do that, but there's nothing to stop you from following the same principles as Nutch to do so. Of course, Nutch is guided by the charset in the headers. The problem in this particular case is that the incorrect content type is specified in the file, as i proved at

http:#34211586
http#:34211611

And you acknowledged that proof at http:#34214798
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of wsyy

ASKER

Sorry to get a low rate on these solutions which indeed do not satisfy my needs.
That would be because your question had nothing to detecting the charset :)
Your question asked what the problem was with your code which I answered in my first comments.