Avatar of wsyy
wsyy
 asked on

Java how to read all the page source of a large web page

Hi,

I am using Nutch to crawl some web pages, some of which are big .html file. The http plugin used in Nutch can't read all the page source of such big .html file.

I attach two code snippets down below. In either case,  the content variable  (byte[] content) actually holds the page source of a web page.

It turns out that both ways return almost the same length of the contents requested. So I guess maybe there is a maximum length of contents that can be returned.

However, how can I overcome this content limit?

Thanks
this.url = url;
    GetMethod get = new GetMethod(url.toString());
    get.setFollowRedirects(followRedirects);
    get.setDoAuthentication(true);
    if (datum.getModifiedTime() > 0) {
      get.setRequestHeader("If-Modified-Since",
          HttpDateFormat.toString(datum.getModifiedTime()));
    }

    // Set HTTP parameters
    HttpMethodParams params = get.getParams();
    if (http.getUseHttp11()) {
      params.setVersion(HttpVersion.HTTP_1_1);
    } else {
      params.setVersion(HttpVersion.HTTP_1_0);
    }
    params.makeLenient();
    params.setContentCharset("UTF-8");
    params.setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
    params.setBooleanParameter(HttpMethodParams.SINGLE_COOKIE_HEADER, true);
    // XXX (ab) not sure about this... the default is to retry 3 times; if
    // XXX the request body was sent the method is not retried, so there is
    // XXX little danger in retrying...
    // params.setParameter(HttpMethodParams.RETRY_HANDLER, null);
    try {
      code = Http.getClient().executeMethod(get);

      Header[] heads = get.getResponseHeaders();

      for (int i = 0; i < heads.length; i++) {
        headers.set(heads[i].getName(), heads[i].getValue());
      }
      
      // Limit download size
      int contentLength = Integer.MAX_VALUE;
      String contentLengthString = headers.get(Response.CONTENT_LENGTH);
      if (contentLengthString != null) {
        try {
          contentLength = Integer.parseInt(contentLengthString.trim());
        } catch (NumberFormatException ex) {
          throw new HttpException("bad content length: " +
              contentLengthString);
        }
      }
      if (http.getMaxContent() >= 0 &&
          contentLength > http.getMaxContent()) {
        contentLength = http.getMaxContent();
      }

      // always read content. Sometimes content is useful to find a cause
      // for error.
      InputStream in = get.getResponseBodyAsStream();
      try {
        byte[] buffer = new byte[HttpBase.BUFFER_SIZE];
    	//byte[] buffer = new byte[contentLength];
        int bufferFilled = 0;
        int totalRead = 0;
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1
            && totalRead < contentLength) {
          totalRead += bufferFilled;
          out.write(buffer, 0, bufferFilled);
        }

        content = out.toByteArray();
        
      } catch (Exception e) {
        if (code == 200) throw new IOException(e.toString());
        // for codes other than 200 OK, we are fine with empty content
      } finally {
        if (in != null) {
          in.close();
        }
        get.abort();
      }

Open in new window

try {
      socket = new Socket();                    // create the socket
      socket.setSoTimeout(http.getTimeout());


      // connect
      String sockHost = http.useProxy() ? http.getProxyHost() : host;
      int sockPort = http.useProxy() ? http.getProxyPort() : port;
      InetSocketAddress sockAddr= new InetSocketAddress(sockHost, sockPort);
      socket.connect(sockAddr, http.getTimeout());

      // make request
      OutputStream req = socket.getOutputStream();

      StringBuffer reqStr = new StringBuffer("GET ");
      if (http.useProxy()) {
      	reqStr.append(url.getProtocol()+"://"+host+portString+path);
      } else {
      	reqStr.append(path);
      }

      reqStr.append(" HTTP/1.0\r\n");

      reqStr.append("Host: ");
      reqStr.append(host);
      reqStr.append(portString);
      reqStr.append("\r\n");

      reqStr.append("Accept-Encoding: x-gzip, gzip, deflate\r\n");

      String userAgent = http.getUserAgent();
      if ((userAgent == null) || (userAgent.length() == 0)) {
        if (Http.LOG.isFatalEnabled()) { Http.LOG.fatal("User-agent is not set!"); }
      } else {
        reqStr.append("User-Agent: ");
        reqStr.append(userAgent);
        reqStr.append("\r\n");
      }
      
      reqStr.append("Accept-Language: ");
      reqStr.append(this.http.getAcceptLanguage());
      reqStr.append("\r\n");

      if (datum.getModifiedTime() > 0) {
        reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime()));
        reqStr.append("\r\n");
      }
      reqStr.append("\r\n");
      
      byte[] reqBytes= reqStr.toString().getBytes();

      req.write(reqBytes);
      req.flush();
        
      PushbackInputStream in =                  // process response
        new PushbackInputStream(
          new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), 
          Http.BUFFER_SIZE) ;

      StringBuffer line = new StringBuffer();

      boolean haveSeenNonContinueStatus= false;
      while (!haveSeenNonContinueStatus) {
        // parse status code line
        this.code = parseStatusLine(in, line); 
        // parse headers
        parseHeaders(in, line);
        haveSeenNonContinueStatus= code != 100; // 100 is "Continue"
      }

      readPlainContent(in);

      String contentEncoding = getHeader(Response.CONTENT_ENCODING);
      if ("gzip".equals(contentEncoding) || "x-gzip".equals(contentEncoding)) {
        content = http.processGzipEncoded(content, url);
      } else if ("deflate".equals(contentEncoding)) {
       content = http.processDeflateEncoded(content, url);
      } else {
        if (Http.LOG.isTraceEnabled()) {
          Http.LOG.trace("fetched " + content.length + " bytes from " + url);
        }
      }

Open in new window

JavaHTTP Protocol

Avatar of undefined
Last Comment
wsyy

8/22/2022 - Mon
for_yan

I'm sorry, I kind of probably missed something, do you mean that this code
does not read any long page, or some particulor long page - I didn't spot particular URL there.
If this is about all long pages is there approximate maximum size that you are getting?
Is it every time duifferent or the same?
Can we try it with some page, where you see the problem?
wsyy

ASKER
For example, the codes can't get the long page like http://www.dianping.com/shop/511825.

The total number of the characters is 100,156. I always got 62,980 one time, and 62974 the other time. I also checked it a few more time the results seem the same.
wsyy

ASKER
I also used the code to try this url, http://www.dianping.com/shop/511825/review_all. The result was the same. Only part of page source was returned.
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
for_yan

I see, thanks.
Maybe you can post some compilable piece, so that we could try it ourselves?
wsyy

ASKER
what compilable piece do you mean?

The codes are from Nutch, and there are quite a lot dependent pieces there. I am afraid I can't give you the compilable piece. The two .java files are attached though. HttpResponse.java HttpResponse.java
for_yan

Sure, I understand, it is not always possible to extract something
we can try on our own. That was just a question.
But even to look sometimes better to see bigger piece of code.Thanks.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
wsyy

ASKER
for_yan, thanks a lot for your kind help.

Do you think it is because the server refuses to send out all of contents since I am using Nutch to crawl a site?

I just tried using the following code which works! Jsoup is an HTML parser which provides an HTTP function.
Document doc = Jsoup.connect("http://www.dianping.com/shop/511825")
		  .data("query", "Java")
		  .userAgent("Mozilla")
		  .cookie("auth", "token")
		  .timeout(3000)
		  .post();
		
		System.out.println(doc.html());

Open in new window

for_yan

Yes, I was just thinking about suggesting to you to try to read it with some
standard way like HttpUnit.
Sorry, I don't know. Don't have experience with these crawlers.

Maybe you can make a list of all such very big pages and make for them special
treatment, when you'll first grab them by some other tool and then feed them to
your crawler separately form your location.
I understand that would be a big disruption.
wsyy

ASKER
What are the standard ways? Could you please provide some examples? I don't want to use Jsoup which wraps a lot of details already.
I started with Experts Exchange in 2004 and it's been a mainstay of my professional computing life since. It helped me launch a career as a programmer / Oracle data analyst
William Peck
CEHJ

>>The total number of the characters is 100,156

That's actually small compared to some of the question pages on this site ;)

You say you're using Nutch but afaicr you've implemented your own parser haven't you?
wsyy

ASKER
Yes I have my own parser, but the parser takes the input from Nutch's default requested contents by protocol http.
ASKER CERTIFIED SOLUTION
Mick Barry

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
wsyy

ASKER
objects, should be -1 or "-1"?

The original is:

<property>
  <name>http.content.limit</name>
  <value>65536</value>
</property>
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Mick Barry

just -1 should be fine
wsyy

ASKER
Fabulous, objects. Also thanks for others for inputs.