[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now


Java Grab Webpage from Wiki Edit

Posted on 2007-10-16
Medium Priority
Last Modified: 2013-11-24

I'd like to grab the above page, but the middle section that has the wiki text isn't coming through.  I'm getting above and below it.  When I do a straight Curl on it, it works, but my java code doesn't.  Anyone know what I might be missing?


            URL url = new URL(urlLocation);
                URLConnection connection = url.openConnection();
                connection.setDoInput ( true ) ;
                connection.setDoOutput ( true ) ;
                connection.setRequestProperty ( "Accept-Language", "en-us" ) ;
                connection.setRequestProperty ( "Accept", "*/*" ) ;
                connection.setRequestProperty ( "Connection", "Keep-Alive" ) ;
                connection.setRequestProperty ( "Cache-Control", "no-cache" ) ;
                connection.connect ();
                  OutputStreamWriter out = new OutputStreamWriter(
                  BufferedReader in = new BufferedReader(
                        new InputStreamReader(
                  String decodedString;
                  StringBuffer entireSite = new StringBuffer();
                  while ((decodedString = in.readLine()) != null) {
                  System.out.println("WikiPage: "+entireSite);

Question by:ecuguru
  • 6
  • 2
LVL 86

Expert Comment

ID: 20088528
Try getting rid of the OutputStream stuff - it's not being used anyway
LVL 86

Accepted Solution

CEHJ earned 1200 total points
ID: 20088645
This approach will be more efficient and works fine for me with your URL

public static StringBuilder getWiki(String address) throws Exception {
      StringBuilder sb = new StringBuilder();
      final int BUF_SIZE = 1 << 10 << 4; //16KiB buffer
      int charsRead = -1;
      char[] buffer = new char[BUF_SIZE];
      URL url = new URL(address);
      InputStreamReader in = new InputStreamReader(url.openStream());
      while ((charsRead = in.read(buffer)) > -1) {
            sb.append(buffer, 0, charsRead);
      return sb;

Author Comment

ID: 20089237
CEHJ, Really liked your second answer.  It maintained the formatting of the source page, rather than shoving it all into one string, which lets me parse it easier.

But it doesn't hold up for international characters:

And it's converting < and > into &lt; / gt

Is there a way to use your code, and still maintain special character formatting?

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

LVL 92

Assisted Solution

objects earned 800 total points
ID: 20090379
try the following, you'll also find it a bit faster

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.MalformedURLException;
import java.net.URL;

public class Wiki {

       * @param args
       * @throws IOException
      public static void main(String[] args) throws IOException {
            URL url = new URL("http://en.wikipedia.org/w/index.php?title=McDonald%27s&action=edit");
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            InputStream in = url.openStream();
            byte[] buf = new byte[2048];
            int n = 0;
            while ((n=in.read(buf))>=0) {
                  out.write(buf, 0, n);
            String s = out.toString("UTF-8");

LVL 86

Expert Comment

ID: 20091298
No conversions should be taking place. Can you give me an example of a URL where you think that's happening?
LVL 86

Expert Comment

ID: 20091336
As for international characters, you can't really be certain about reproducing them in binary form unless you know the source coding of the original although you can cast a bigger net by changing the appropriate line to

InputStreamReader in = new InputStreamReader(url.openStream(), "UTF-8");
LVL 92

Expert Comment

ID: 20091832
And with the code I posted above if you don't actually need it as a String then just keep it as a byte array for processing.
In fact you may be able to use the stream directly eg. if you're writing it to a file them replace ByteArrayOutputStream with a FileOutputStream.

Let me know if u have any questgions or problems with the code :)
LVL 86

Expert Comment

ID: 20091992
Depending on how you're doing your parsing, it may be better to have the page in a document object model to do the parsing, in which case you could use the Neko html parser, or for something more high level and simple, you could try HttpUnit
LVL 86

Expert Comment

ID: 20099169

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
In this post we will learn how to make Android Gesture Tutorial and give different functionality whenever a user Touch or Scroll android screen.
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
Suggested Courses
Course of the Month19 days, 19 hours left to enroll

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question