parsing html document for A HREF tags in Java

Posted on 2003-02-18
Medium Priority
Last Modified: 2009-07-29
How do we parse an HTML document to extract only the hyperlinks (A HREF  tags) in Java? Example code will be appreciated.
Thanks in advance.
Question by:java_fan
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions

Expert Comment

ID: 7979858
If the HTML document is well formed, you can treat it as an XML document. Then use a stylesheet to extract the 'A' tags.
Hope this helps

Accepted Solution

bworm3002 earned 150 total points
ID: 7979871
1. parse the link
int index1 = theWebpage.indexOf("<a href=") + 8;
int index2 = theWebpage.indexOf(">", index1);
String theLink = theWebpage.substring(index1, index2);

You may want to check if they use capital letters instead of small letter. i.e. (<A HREF=)

2. some links may be enclosed with quote or double quote, you need to remove them.
if (theLink.startsWith("\"") || theLink.endsWith("'")) {
  theLink = theLink.substring(1, theLink.length()-1);

Then, you should be able to extract a link in a webpage.  You can loop thru the webpage and extract all the links you want.  This is a quick and dirty way to extract a link.

Featured Post

Optimize your web performance

What's in the eBook?
- Full list of reasons for poor performance
- Ultimate measures to speed things up
- Primary web monitoring types
- KPIs you should be monitoring in order to increase your ROI

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question