Efficient and Quick way to extract links

Posted on 2005-04-06
Last Modified: 2010-03-31

Is there any "EFFICIENT" way to extract links from html pages as I have to extract links from hundreds of web pages.

Now I am doing it as in

Code is highly appreciated.
Question by:sumantedla
    LVL 15

    Expert Comment

    Another method to do it is by using httpunit :
    Can't tell you which one performs better but if HTMLEditorKit doesn't perform you might want to give it a try and compare.
    You can also do that by using string matching (search for href) see:
    code is provided and the class that handles extract teh links is:
    LVL 15

    Expert Comment

    The benefits of using search by href pattern is that you don't need to parse and analyze the whole HTML Document (which probably takes time). But if you do it yourself you need to be aware of many factors like links inside an html comment, javascript based links, ...
    LVL 5

    Expert Comment

    yes, searh for every string in your file, that contains "http://" then get its lenght, and finally get whole url and so on.
    LVL 29

    Expert Comment

    Rather than using indexOf, would it not be better to use a capturing Regular expression. You could use a pattern along the lines of "\\<a href=\"Q_([\\p{ASCII}]*?)</a>" as in this very basic example...

    Pattern pattern = Pattern.compile(  "\\<a href=\"([\\p{ASCII}]*?)</a>", Pattern.MULTILINE );
    Matcher matcher = pattern.matcher( pHTML );//where pHTML is your HTML
    while ( matcher.find() ) {
        strCapturedHTML = ;

    I sometimes use something similar to keep a watch on sites to see whether they have been updated.

    Author Comment

    Can I do anything to minimise the time of blocking for each request for the webpage.
    LVL 15

    Accepted Solution

    You can use asynchronous IO and Selectors
    using nio (Provided since 1.4)

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    Top 6 Sources for Identifying Threat Actor TTPs

    Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

    Suggested Solutions

    INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
    Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
    Viewers learn about the “while” loop and how to utilize it correctly in Java. Additionally, viewers begin exploring how to include conditional statements within a while loop and avoid an endless loop. Define While Loop: Basic Example: Explanatio…
    Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    14 Experts available now in Live!

    Get 1:1 Help Now