Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 171
  • Last Modified:

Efficient and Quick way to extract links

Hi,

Is there any "EFFICIENT" way to extract links from html pages as I have to extract links from hundreds of web pages.

Now I am doing it as in http://javaalmanac.com/egs/javax.swing.text.html/GetLinks.html

Code is highly appreciated.
0
sumantedla
Asked:
sumantedla
1 Solution
 
aozarovCommented:
Another method to do it is by using httpunit : http://httpunit.sourceforge.net/doc/cookbook.html
Can't tell you which one performs better but if HTMLEditorKit doesn't perform you might want to give it a try and compare.
You can also do that by using string matching (search for href) see: http://moguntia.ucd.ie/programming/webcrawler/
code is provided and the class that handles extract teh links is: http://moguntia.ucd.ie/programming/webcrawler/src/ie/moguntia/webcrawler/SaveURL.java
0
 
aozarovCommented:
The benefits of using search by href pattern is that you don't need to parse and analyze the whole HTML Document (which probably takes time). But if you do it yourself you need to be aware of many factors like links inside an html comment, javascript based links, ...
0
 
NaeemgCommented:
yes, searh for every string in your file, that contains "http://" then get its lenght, and finally get whole url and so on.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
bloodredsunCommented:
Rather than using indexOf, would it not be better to use a capturing Regular expression. You could use a pattern along the lines of "\\<a href=\"Q_([\\p{ASCII}]*?)</a>" as in this very basic example...

Pattern pattern = Pattern.compile(  "\\<a href=\"([\\p{ASCII}]*?)</a>", Pattern.MULTILINE );
Matcher matcher = pattern.matcher( pHTML );//where pHTML is your HTML
while ( matcher.find() ) {
    strCapturedHTML = matcher.group(0) ;
   //etc
}

I sometimes use something similar to keep a watch on sites to see whether they have been updated.
0
 
sumantedlaAuthor Commented:
Can I do anything to minimise the time of blocking for each request for the webpage.
0
 
aozarovCommented:
You can use asynchronous IO and Selectors
using nio (Provided since 1.4) http://www.javaalmanac.com/cgi-bin/search/find.pl?words=nio
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now