parsing html document for A HREF tags in Java

Posted on 2003-02-18
Medium Priority
Last Modified: 2009-07-29
How do we parse an HTML document to extract only the hyperlinks (A HREF  tags) in Java? Example code will be appreciated.
Thanks in advance.
Question by:java_fan

Expert Comment

ID: 7979858
If the HTML document is well formed, you can treat it as an XML document. Then use a stylesheet to extract the 'A' tags.
Hope this helps

Accepted Solution

bworm3002 earned 150 total points
ID: 7979871
1. parse the link
int index1 = theWebpage.indexOf("<a href=") + 8;
int index2 = theWebpage.indexOf(">", index1);
String theLink = theWebpage.substring(index1, index2);

You may want to check if they use capital letters instead of small letter. i.e. (<A HREF=)

2. some links may be enclosed with quote or double quote, you need to remove them.
if (theLink.startsWith("\"") || theLink.endsWith("'")) {
  theLink = theLink.substring(1, theLink.length()-1);

Then, you should be able to extract a link in a webpage.  You can loop thru the webpage and extract all the links you want.  This is a quick and dirty way to extract a link.

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

616 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question