Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Parsing HTML

Posted on 2004-04-20
17
Medium Priority
?
319 Views
Last Modified: 2010-03-31
Hi

I have html files on my sites, which contain download links in a specfic way. e.g

<p align="center"><font color="#FFFF00"><strong>FS2004 Aircraft</strong><br>
        </font><font color="#87A9DC">Download: </font> <a href="http://www.@@@@.com/cgi-bin/download.pl?url=uploads04/apr/247hal.zip" target="main2">
        <u>247hal.zip</u></a><font
        color="#87A9DC"> (1936 KB)</font><img
        src="images2/uploads04/apr/247hal.jpg"
        align="right" hspace="0" width="189" height="73"></p>
        <p><font color="#FFFF00"><strong>Author:</strong></font>
        <font color="#87A9DC">Heather Sherman</font><br>
        <font color="#FFFF00"><strong>Date:</strong></font> <font
        color="#87A9DC">2004-04-19</font><br>
        <font color="#87A9DC" size="2" face="Times New Roman">FS2004 Boeing
        247D, Heather Aviation Ltd.<br>
        These are textures ***ONLY*** and are applied to any of Dee Waldron's
        models, this one in particular being the WhiteYellow Boeing 247,
        filename: boeing247yellowwhite.zip (which contains the entire model).
        This is an FS2002 model but this texture package also contains an
        updated aircraft.cfg file making it compatible for FS2004.</font></p>
        <hr color="#87A9DC">

This pattern is repeated for the rest of the downloads on the page. How would use java to get the urllink, the picture and the the rest of the details on the page. Example

Outputs using system.out.prinln :
FS2004 Aircraft
247hal.zip
1936 KB
Heather Sherman
FS2004 Boeing
247D, Heather Aviation Ltd.<br>
These are textures ***ONLY*** and are applied to any of Dee Waldron's
models, this one in particular being the WhiteYellow Boeing 247,
filename: boeing247yellowwhite.zip (which contains the entire model).
This is an FS2002 model but this texture package also contains an
updated aircraft.cfg file making it compatible for FS2004.

------------
This is being done, so I can change the ouput into a insertsql query.

Thanks in advance of any help
Akbar
0
Comment
Question by:akzah
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 6
17 Comments
 
LVL 92

Expert Comment

by:objects
ID: 10872598
Use HTMLEditorKit.
0
 
LVL 92

Accepted Solution

by:
objects earned 400 total points
ID: 10872608
0
 

Author Comment

by:akzah
ID: 10872802
Hi

Thanks for the link, though I don't want to get the links, but just the data. As the output contains no links.

Akbar
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 92

Expert Comment

by:objects
ID: 10872843
u can use similiar techniques to extract whatever data you require.
0
 

Author Comment

by:akzah
ID: 10872907
Can you explain how I can change that code, to get the authors name by its self?

Also, the code does'nt not run, it does'nt seem to recognise HTMLDocument doc = new HTMLDocument() . even when you import import javax.swing.*;

Akbar
0
 
LVL 92

Expert Comment

by:objects
ID: 10872933
import javax.swing.text.html.*;
0
 

Author Comment

by:akzah
ID: 10873319
The java code still does'nt compile. I have added the following imports:
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;
import javax.swing.text.EditorKit;

and it has trouble on } catch (BadLocationException e) {.

I just want to get specfic data back from the html, not all of it.

Thanks

Akbar
0
 
LVL 92

Expert Comment

by:objects
ID: 10873349
what are the errors exactly?
you need to add that method to your class.
0
 

Author Comment

by:akzah
ID: 10873405
The error from bluej says " cannot resolve symbol" on that line.

If I got the html file, and placed the file data into a txt file, then opened it up and it was all a string, would I be able to split the file up taking the part "<p align="center">....<hr color="#87A9DC">" ?

Thanks again

Akbar
0
 
LVL 92

Expert Comment

by:objects
ID: 10873424
import javax.swing.text.*;
0
 

Author Comment

by:akzah
ID: 10873472
Well it comiples, though whatever url I try it returns with a empty String.

I am totally lost on which way to go about it. There must be way to search a string and return certain parts of it???

Akbar
0
 
LVL 92

Expert Comment

by:objects
ID: 10873497
you can parse the string manually if you like but its a lot simpler to use an existing parser isn't it?
0
 
LVL 23

Assisted Solution

by:rama_krishna580
rama_krishna580 earned 400 total points
ID: 10873984
Hi You can try anyone of the below options...

1. http://java.sun.com/products/jfc/tsc/articles/bookmarks/
2. you can look at this code...

// Nuti Rama Krishna - Compiled on 2000-02-21

import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

public class HTMLDocumentLoader {
      public HTMLDocument loadDocument(HTMLDocument doc,
                                                      URL url, String charSet)
                                                      throws IOException {
            doc.putProperty(Document.StreamDescriptionProperty, url);

            /*
             * This loop allows the document read to be retried if
             * the character encoding changes during processing.
             */
            InputStream in = null;
            boolean ignoreCharSet = false;

            for (;;) {
                  try {
                        // Remove any document content
                        doc.remove(0, doc.getLength());
                        
                        URLConnection urlc = url.openConnection();
                        in = urlc.getInputStream();
                        Reader reader = (charSet == null) ?
                                                new InputStreamReader(in) :
                                                new InputStreamReader(in, charSet);

                        HTMLEditorKit.Parser parser = getParser();
                        HTMLEditorKit.ParserCallback htmlReader = getParserCallback(doc);
                        parser.parse(reader, htmlReader, ignoreCharSet);
                        htmlReader.flush();
                        
                        // All done
                        break;
                  } catch (BadLocationException ex) {
                        // Should not happen - throw an IOException
                        throw new IOException(ex.getMessage());
                  } catch (ChangedCharSetException e) {
                        // The character set has changed - restart
                        charSet = getNewCharSet(e);

                        // Prevent recursion by suppressing further exceptions
                        ignoreCharSet = true;

                        // Close original input stream
                        in.close();

                        // Continue the loop to read with the correct encoding
                  }
            }

            return doc;
      }

      public HTMLDocument loadDocument(URL url, String charSet)
                                                      throws IOException {
            return loadDocument((HTMLDocument)kit.createDefaultDocument(), url, charSet);
      }

      public HTMLDocument loadDocument(URL url) throws IOException {
            return loadDocument(url, null);
      }
      
      // Methods that allow customization of the parser and the callback
      public synchronized HTMLEditorKit.Parser getParser() {
            if (parser == null) {
                  try {
                        Class c = Class.forName("javax.swing.text.html.parser.ParserDelegator");
                        parser = (HTMLEditorKit.Parser)c.newInstance();
                  } catch (Throwable e) {
                  }
            }
            return parser;
      }      

      public synchronized HTMLEditorKit.ParserCallback getParserCallback(
                                          HTMLDocument doc) {
            return doc.getReader(0);
      }

      protected String getNewCharSet(ChangedCharSetException e) {
            String spec = e.getCharSetSpec();
            if (e.keyEqualsCharSet()) {
                  // The event contains the new CharSet
                  return spec;
            }
            
            // The event contains the content type
            // plus ";" plus qualifiers which may
            // contain a "charset" directive. First
            // remove the content type.
            int index = spec.indexOf(";");
            if (index != -1) {
                  spec = spec.substring(index + 1);
            }
            
            // Force the string to lower case
            spec = spec.toLowerCase();

            StringTokenizer st = new StringTokenizer(spec, " \t=", true);
            boolean foundCharSet = false;
            boolean foundEquals = false;
            while (st.hasMoreTokens()) {
                  String token = st.nextToken();
                  if (token.equals(" ") || token.equals("\t")) {
                        continue;
                  }
                  if (foundCharSet == false && 
                              foundEquals == false &&
                              token.equals("charset")) {
                        foundCharSet = true;
                        continue;
                  } else if (foundEquals == false && 
                        token.equals("=")) {
                        foundEquals = true;
                        continue;
                  } else if (foundEquals == true &&
                        foundCharSet == true) {
                        return token;
                  }

                  // Not recognized
                  foundCharSet = false;
                  foundEquals = false;
            }

            // No charset found - return a guess
            return "8859_1";
      }
      
      protected static HTMLEditorKit kit;      
      protected static HTMLEditorKit.Parser parser;

      static {
            kit = new HTMLEditorKit();
      }
}

Best of luck....
R.K.
0
 

Author Comment

by:akzah
ID: 10877426
Thanks for your help guys, though both way seem to complicated for me, I have put up another post using the string method. I will close this question soon, just in case I get further replies.

Akbar
0

Featured Post

Enroll in October's Free Course of the Month

Do you work with and analyze data? Enroll in October's Course of the Month for 7+ hours of SQL training, allowing you to quickly and efficiently store or retrieve data. It's free for Premium Members, Team Accounts, and Qualified Experts!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
Suggested Courses

598 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question