asked on

Java Grab Webpage from Wiki Edit

http://en.wikipedia.org/w/index.php?title=McDonald%27s&action=edit

I'd like to grab the above page, but the middle section that has the wiki text isn't coming through. I'm getting above and below it. When I do a straight Curl on it, it works, but my java code doesn't. Anyone know what I might be missing?

///Code

            URL url = new URL(urlLocation);
            try{
              URLConnection connection = url.openConnection();
              connection.setUseCaches(false);
              connection.setDoInput ( true ) ;
              connection.setDoOutput ( true ) ;
              connection.setRequestProperty ( "Accept-Language", "en-us" ) ;
              connection.setRequestProperty ( "Accept", "*/*" ) ;
              connection.setRequestProperty ( "Connection", "Keep-Alive" ) ;
              connection.setRequestProperty ( "Cache-Control", "no-cache" ) ;
              connection.connect ();
                  OutputStreamWriter out = new OutputStreamWriter(
       connection.getOutputStream());
                  out.close();

                  BufferedReader in = new BufferedReader(
                        new InputStreamReader(
                        connection.getInputStream()));
                  String decodedString;
                  StringBuffer entireSite = new StringBuffer();
                  while ((decodedString = in.readLine()) != null) {
                        entireSite.append(decodedString);
                  }
                  System.out.println("WikiPage: "+entireSite);

CEHJ

Try getting rid of the OutputStream stuff - it's not being used anyway

ASKER CERTIFIED SOLUTION

CEHJ

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ecuguru

ASKER

CEHJ, Really liked your second answer. It maintained the formatting of the source page, rather than shoving it all into one string, which lets me parse it easier.

But it doesn't hold up for international characters:
http://en.wikipedia.org/wiki/Arabic_language

And it's converting < and > into < / gt

Is there a way to use your code, and still maintain special character formatting?

SOLUTION

Mick Barry

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CEHJ

No conversions should be taking place. Can you give me an example of a URL where you think that's happening?

CEHJ

As for international characters, you can't really be certain about reproducing them in binary form unless you know the source coding of the original although you can cast a bigger net by changing the appropriate line to

InputStreamReader in = new InputStreamReader(url.openStream(), "UTF-8");

Mick Barry

And with the code I posted above if you don't actually need it as a String then just keep it as a byte array for processing.
In fact you may be able to use the stream directly eg. if you're writing it to a file them replace ByteArrayOutputStream with a FileOutputStream.

Let me know if u have any questgions or problems with the code :)

CEHJ

Depending on how you're doing your parsing, it may be better to have the page in a document object model to do the parsing, in which case you could use the Neko html parser, or for something more high level and simple, you could try HttpUnit

CEHJ

:-)