[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now


write an IE system in java

Posted on 2004-11-29
Medium Priority
Last Modified: 2010-04-06
how would i write an information extraction sytem with java?im not sure what the best way to go about it would be and  im wondering what techniques work best?the system will get the information from searched websites on the net and download it (probably stored in a database then).Would that be done using a web crawler or how?
any sample code for reference would be very handy and helpful.
Question by:shamshanaghy
1 Comment
LVL 29

Accepted Solution

bloodredsun earned 2000 total points
ID: 12907992
You need to connect to a URL then parse the HTML for links present in the page and then add the links to a list to visit.

Here's a method that gets the content of a URL as a string:

public static String getURL( final String pURL ){
            StringBuffer sb = new StringBuffer();
            try {
                  URL u = new URL( pURL );
                  HttpURLConnection huc = (HttpURLConnection) u.openConnection();
                  huc.setRequestMethod( GET );
                  BufferedReader br = new BufferedReader(new InputStreamReader(huc.getInputStream()));

                  int b = 0;
                  while((b = br.read()) != -1) {
                  // disconnect HttpURLConnection
                  huc.disconnect() ;
            catch (IOException e){
                  return "Unable to open connection: " + e.getMessage();
            return sb.toString();

It is then just a question of parsing this String for instances of <a href="xxx"....... and adding them to a database for new links.

What you could do is crawl a limited number of websites, store the page String returned from the above method in the database and then search on a word in the database.  

Any questions?

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you tried to learn about Unicode, UTF-8, and multibyte text encoding and all the articles are just too "academic" or too technical? This article aims to make the whole topic easy for just about anyone to understand.
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses

873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question