write an IE system in java

Posted on 2004-11-29
Last Modified: 2010-04-06
how would i write an information extraction sytem with java?im not sure what the best way to go about it would be and  im wondering what techniques work best?the system will get the information from searched websites on the net and download it (probably stored in a database then).Would that be done using a web crawler or how?
any sample code for reference would be very handy and helpful.
Question by:shamshanaghy
    1 Comment
    LVL 29

    Accepted Solution

    You need to connect to a URL then parse the HTML for links present in the page and then add the links to a list to visit.

    Here's a method that gets the content of a URL as a string:

    public static String getURL( final String pURL ){
                StringBuffer sb = new StringBuffer();
                try {
                      URL u = new URL( pURL );
                      HttpURLConnection huc = (HttpURLConnection) u.openConnection();
                      huc.setRequestMethod( GET );
                      BufferedReader br = new BufferedReader(new InputStreamReader(huc.getInputStream()));

                      int b = 0;
                      while((b = != -1) {
                      // disconnect HttpURLConnection
                      huc.disconnect() ;
                catch (IOException e){
                      return "Unable to open connection: " + e.getMessage();
                return sb.toString();

    It is then just a question of parsing this String for instances of <a href="xxx"....... and adding them to a database for new links.

    What you could do is crawl a limited number of websites, store the page String returned from the above method in the database and then search on a word in the database.  

    Any questions?

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Course: HTML5 Specialist

    HTML5 development skills are critical to all developers. HTML5 is the foundation to almost any development process. That's why business, design studios, development shops and other organizations need HTML5 developers. Get your foot in the door as a HTML5 specialist.

    Browsers only know CSS so your awesome SASS code needs to be translated into normal CSS. Here I'll try to explain what you should aim for in order to take full advantage of SASS.
    JavaScript has plenty of pieces of code people often just copy/paste from somewhere but never quite fully understand. Self-Executing functions are just one good example that I'll try to demystify here.
    Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
    Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    11 Experts available now in Live!

    Get 1:1 Help Now