Java how to read all the page source of a large web page

wsyy
wsyy used Ask the Experts™
on
Hi,

I am using Nutch to crawl some web pages, some of which are big .html file. The http plugin used in Nutch can't read all the page source of such big .html file.

I attach two code snippets down below. In either case,  the content variable  (byte[] content) actually holds the page source of a web page.

It turns out that both ways return almost the same length of the contents requested. So I guess maybe there is a maximum length of contents that can be returned.

However, how can I overcome this content limit?

Thanks
this.url = url;
    GetMethod get = new GetMethod(url.toString());
    get.setFollowRedirects(followRedirects);
    get.setDoAuthentication(true);
    if (datum.getModifiedTime() > 0) {
      get.setRequestHeader("If-Modified-Since",
          HttpDateFormat.toString(datum.getModifiedTime()));
    }

    // Set HTTP parameters
    HttpMethodParams params = get.getParams();
    if (http.getUseHttp11()) {
      params.setVersion(HttpVersion.HTTP_1_1);
    } else {
      params.setVersion(HttpVersion.HTTP_1_0);
    }
    params.makeLenient();
    params.setContentCharset("UTF-8");
    params.setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
    params.setBooleanParameter(HttpMethodParams.SINGLE_COOKIE_HEADER, true);
    // XXX (ab) not sure about this... the default is to retry 3 times; if
    // XXX the request body was sent the method is not retried, so there is
    // XXX little danger in retrying...
    // params.setParameter(HttpMethodParams.RETRY_HANDLER, null);
    try {
      code = Http.getClient().executeMethod(get);

      Header[] heads = get.getResponseHeaders();

      for (int i = 0; i < heads.length; i++) {
        headers.set(heads[i].getName(), heads[i].getValue());
      }
      
      // Limit download size
      int contentLength = Integer.MAX_VALUE;
      String contentLengthString = headers.get(Response.CONTENT_LENGTH);
      if (contentLengthString != null) {
        try {
          contentLength = Integer.parseInt(contentLengthString.trim());
        } catch (NumberFormatException ex) {
          throw new HttpException("bad content length: " +
              contentLengthString);
        }
      }
      if (http.getMaxContent() >= 0 &&
          contentLength > http.getMaxContent()) {
        contentLength = http.getMaxContent();
      }

      // always read content. Sometimes content is useful to find a cause
      // for error.
      InputStream in = get.getResponseBodyAsStream();
      try {
        byte[] buffer = new byte[HttpBase.BUFFER_SIZE];
    	//byte[] buffer = new byte[contentLength];
        int bufferFilled = 0;
        int totalRead = 0;
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1
            && totalRead < contentLength) {
          totalRead += bufferFilled;
          out.write(buffer, 0, bufferFilled);
        }

        content = out.toByteArray();
        
      } catch (Exception e) {
        if (code == 200) throw new IOException(e.toString());
        // for codes other than 200 OK, we are fine with empty content
      } finally {
        if (in != null) {
          in.close();
        }
        get.abort();
      }

Open in new window

try {
      socket = new Socket();                    // create the socket
      socket.setSoTimeout(http.getTimeout());


      // connect
      String sockHost = http.useProxy() ? http.getProxyHost() : host;
      int sockPort = http.useProxy() ? http.getProxyPort() : port;
      InetSocketAddress sockAddr= new InetSocketAddress(sockHost, sockPort);
      socket.connect(sockAddr, http.getTimeout());

      // make request
      OutputStream req = socket.getOutputStream();

      StringBuffer reqStr = new StringBuffer("GET ");
      if (http.useProxy()) {
      	reqStr.append(url.getProtocol()+"://"+host+portString+path);
      } else {
      	reqStr.append(path);
      }

      reqStr.append(" HTTP/1.0\r\n");

      reqStr.append("Host: ");
      reqStr.append(host);
      reqStr.append(portString);
      reqStr.append("\r\n");

      reqStr.append("Accept-Encoding: x-gzip, gzip, deflate\r\n");

      String userAgent = http.getUserAgent();
      if ((userAgent == null) || (userAgent.length() == 0)) {
        if (Http.LOG.isFatalEnabled()) { Http.LOG.fatal("User-agent is not set!"); }
      } else {
        reqStr.append("User-Agent: ");
        reqStr.append(userAgent);
        reqStr.append("\r\n");
      }
      
      reqStr.append("Accept-Language: ");
      reqStr.append(this.http.getAcceptLanguage());
      reqStr.append("\r\n");

      if (datum.getModifiedTime() > 0) {
        reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime()));
        reqStr.append("\r\n");
      }
      reqStr.append("\r\n");
      
      byte[] reqBytes= reqStr.toString().getBytes();

      req.write(reqBytes);
      req.flush();
        
      PushbackInputStream in =                  // process response
        new PushbackInputStream(
          new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), 
          Http.BUFFER_SIZE) ;

      StringBuffer line = new StringBuffer();

      boolean haveSeenNonContinueStatus= false;
      while (!haveSeenNonContinueStatus) {
        // parse status code line
        this.code = parseStatusLine(in, line); 
        // parse headers
        parseHeaders(in, line);
        haveSeenNonContinueStatus= code != 100; // 100 is "Continue"
      }

      readPlainContent(in);

      String contentEncoding = getHeader(Response.CONTENT_ENCODING);
      if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) {
        content = http.processGzipEncoded(content, url);
      } else if ("deflate".equals(contentEncoding)) {
       content = http.processDeflateEncoded(content, url);
      } else {
        if (Http.LOG.isTraceEnabled()) {
          Http.LOG.trace("fetched " + content.length + " bytes from " + url);
        }
      }

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Awarded 2011
Awarded 2011

Commented:
I'm sorry, I kind of probably missed something, do you mean that this code
does not read any long page, or some particulor long page - I didn't spot particular URL there.
If this is about all long pages is there approximate maximum size that you are getting?
Is it every time duifferent or the same?
Can we try it with some page, where you see the problem?

Author

Commented:
For example, the codes can't get the long page like http://www.dianping.com/shop/511825.

The total number of the characters is 100,156. I always got 62,980 one time, and 62974 the other time. I also checked it a few more time the results seem the same.

Author

Commented:
I also used the code to try this url, http://www.dianping.com/shop/511825/review_all. The result was the same. Only part of page source was returned.
Introduction to R

R is considered the predominant language for data scientist and statisticians. Learn how to use R for your own data science projects.

Awarded 2011
Awarded 2011

Commented:
I see, thanks.
Maybe you can post some compilable piece, so that we could try it ourselves?

Author

Commented:
what compilable piece do you mean?

The codes are from Nutch, and there are quite a lot dependent pieces there. I am afraid I can't give you the compilable piece. The two .java files are attached though. HttpResponse.java HttpResponse.java
Awarded 2011
Awarded 2011

Commented:
Sure, I understand, it is not always possible to extract something
we can try on our own. That was just a question.
But even to look sometimes better to see bigger piece of code.Thanks.

Author

Commented:
for_yan, thanks a lot for your kind help.

Do you think it is because the server refuses to send out all of contents since I am using Nutch to crawl a site?

I just tried using the following code which works! Jsoup is an HTML parser which provides an HTTP function.
Document doc = Jsoup.connect("http://www.dianping.com/shop/511825")
		  .data("query", "Java")
		  .userAgent("Mozilla")
		  .cookie("auth", "token")
		  .timeout(3000)
		  .post();
		
		System.out.println(doc.html());

Open in new window

Awarded 2011
Awarded 2011

Commented:
Yes, I was just thinking about suggesting to you to try to read it with some
standard way like HttpUnit.
Sorry, I don't know. Don't have experience with these crawlers.

Maybe you can make a list of all such very big pages and make for them special
treatment, when you'll first grab them by some other tool and then feed them to
your crawler separately form your location.
I understand that would be a big disruption.

Author

Commented:
What are the standard ways? Could you please provide some examples? I don't want to use Jsoup which wraps a lot of details already.
Top Expert 2016

Commented:
>>The total number of the characters is 100,156

That's actually small compared to some of the question pages on this site ;)

You say you're using Nutch but afaicr you've implemented your own parser haven't you?

Author

Commented:
Yes I have my own parser, but the parser takes the input from Nutch's default requested contents by protocol http.
Java Developer
Top Expert 2010
Commented:
edit nutch-site.xml, and add

    <property>
      <name>http.content.limit</name>
      <value>'''-1'''</value>
    </property>

or use a value if you want to put a specific limit

Author

Commented:
objects, should be -1 or "-1"?

The original is:

<property>
  <name>http.content.limit</name>
  <value>65536</value>
</property>
Mick BarryJava Developer
Top Expert 2010

Commented:
just -1 should be fine

Author

Commented:
Fabulous, objects. Also thanks for others for inputs.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial