Java how to read all the page source of a large web page
Hi,
I am using Nutch to crawl some web pages, some of which are big .html file. The http plugin used in Nutch can't read all the page source of such big .html file.
I attach two code snippets down below. In either case, the content variable (byte[] content) actually holds the page source of a web page.
It turns out that both ways return almost the same length of the contents requested. So I guess maybe there is a maximum length of contents that can be returned.
However, how can I overcome this content limit?
Thanks
this.url = url; GetMethod get = new GetMethod(url.toString()); get.setFollowRedirects(followRedirects); get.setDoAuthentication(true); if (datum.getModifiedTime() > 0) { get.setRequestHeader("If-Modified-Since", HttpDateFormat.toString(datum.getModifiedTime())); } // Set HTTP parameters HttpMethodParams params = get.getParams(); if (http.getUseHttp11()) { params.setVersion(HttpVersion.HTTP_1_1); } else { params.setVersion(HttpVersion.HTTP_1_0); } params.makeLenient(); params.setContentCharset("UTF-8"); params.setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY); params.setBooleanParameter(HttpMethodParams.SINGLE_COOKIE_HEADER, true); // XXX (ab) not sure about this... the default is to retry 3 times; if // XXX the request body was sent the method is not retried, so there is // XXX little danger in retrying... // params.setParameter(HttpMethodParams.RETRY_HANDLER, null); try { code = Http.getClient().executeMethod(get); Header[] heads = get.getResponseHeaders(); for (int i = 0; i < heads.length; i++) { headers.set(heads[i].getName(), heads[i].getValue()); } // Limit download size int contentLength = Integer.MAX_VALUE; String contentLengthString = headers.get(Response.CONTENT_LENGTH); if (contentLengthString != null) { try { contentLength = Integer.parseInt(contentLengthString.trim()); } catch (NumberFormatException ex) { throw new HttpException("bad content length: " + contentLengthString); } } if (http.getMaxContent() >= 0 && contentLength > http.getMaxContent()) { contentLength = http.getMaxContent(); } // always read content. Sometimes content is useful to find a cause // for error. InputStream in = get.getResponseBodyAsStream(); try { byte[] buffer = new byte[HttpBase.BUFFER_SIZE]; //byte[] buffer = new byte[contentLength]; int bufferFilled = 0; int totalRead = 0; ByteArrayOutputStream out = new ByteArrayOutputStream(); while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1 && totalRead < contentLength) { totalRead += bufferFilled; out.write(buffer, 0, bufferFilled); } content = out.toByteArray(); } catch (Exception e) { if (code == 200) throw new IOException(e.toString()); // for codes other than 200 OK, we are fine with empty content } finally { if (in != null) { in.close(); } get.abort(); }
I'm sorry, I kind of probably missed something, do you mean that this code
does not read any long page, or some particulor long page - I didn't spot particular URL there.
If this is about all long pages is there approximate maximum size that you are getting?
Is it every time duifferent or the same?
Can we try it with some page, where you see the problem?
The total number of the characters is 100,156. I always got 62,980 one time, and 62974 the other time. I also checked it a few more time the results seem the same.
I see, thanks.
Maybe you can post some compilable piece, so that we could try it ourselves?
wsyy
ASKER
what compilable piece do you mean?
The codes are from Nutch, and there are quite a lot dependent pieces there. I am afraid I can't give you the compilable piece. The two .java files are attached though. HttpResponse.javaHttpResponse.java
for_yan
Sure, I understand, it is not always possible to extract something
we can try on our own. That was just a question.
But even to look sometimes better to see bigger piece of code.Thanks.
Yes, I was just thinking about suggesting to you to try to read it with some
standard way like HttpUnit.
Sorry, I don't know. Don't have experience with these crawlers.
Maybe you can make a list of all such very big pages and make for them special
treatment, when you'll first grab them by some other tool and then feed them to
your crawler separately form your location.
I understand that would be a big disruption.
wsyy
ASKER
What are the standard ways? Could you please provide some examples? I don't want to use Jsoup which wraps a lot of details already.
does not read any long page, or some particulor long page - I didn't spot particular URL there.
If this is about all long pages is there approximate maximum size that you are getting?
Is it every time duifferent or the same?
Can we try it with some page, where you see the problem?