Solved

javax html parser help ?

Posted on 2013-06-15
10
373 Views
Last Modified: 2013-11-19
Hi,

What would Javax code look like to :

from a URL input, return a list of all the anchor links in the page

So, I'd like a list of all the possible destinations.

Can I also retrieve a list of emails contained within a page? How?

Thanks
0
Comment
Question by:beavoid
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
10 Comments
 

Author Comment

by:beavoid
ID: 39250865
I found this page to send Yahoo mail from code, it works :

http://stackoverflow.com/questions/11356237/send-mail-from-yahoo-id-to-other-email-ids-using-javamail-api

but I still need code to extract a list of all links within a specified URL's page?

I'm sure it needs serious regular expression handling. Is there  class / method to return a URL[] type object list?

Thanks
0
 

Author Comment

by:beavoid
ID: 39250889
some links now don't start with anchor <a
but,
I see they now have

<link

which appears to allow many options,
will the code you give, to find a list of links, , work with that?

Thanks
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39251279
Please post the test data set and show us the outputs you need.  If you don't have a test data set yet, please create one so we can look at this problem in the form of the SSCCE.  I would envision that you could put up a web page with the sort of links you want to extract, and you could give us the expected output from the extraction process.  Armed with that you can probably get a tested and working code example from Experts-Exchange.

If you're interested, this article shows some of the kind of thought processes that go into answering a question like this one.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

The process might be started with a code outline like this one.
http://www.laprbass.com/RAY_extract_urls.php

<?php // RAY_extract_urls.php
error_reporting(E_ALL);
echo '<pre>';

// EXTRACT URLS FROM A WEB PAGE, USING ANCHOR, LINK, AND BASE TAGS

// A WEB PAGE PROVIDES OUR TEST DATA
$url = 'http://tmxtech.com/';

// READ THE TEST DATA
$htm = file_get_contents($url);

// VISUALIZE THE TEST DATA
echo htmlentities($htm);

// CREATE A REGULAR EXPRESSION TO EXTRACT THE VALUES IN href= ATTRIBUTES
$rgx
= '#'          // A REGEX DELIMITER
. '('          // CAPTURE GROUP
. 'href="'     // A STARTER STRING TO LOOK FOR
. ')'          // END CAPTURE GROUP
. '(.*?)'      // A GROUP OF CHARACTERS, UNGREEDY
. '"'          // ENDING STRING TO LOOK FOR
. '#'          // END REGEX DELIMITER
. 'i'          // CASE-INSENSITIVE
;

// EXTRACT AND SHOW THE WORK PRODUCT
preg_match_all($rgx, $htm, $mat);
print_r($mat[2]);

Open in new window

Thanks, ~Ray
0
Why Off-Site Backups Are The Only Way To Go

You are probably backing up your data—but how and where? Ransomware is on the rise and there are variants that specifically target backups. Read on to discover why off-site is the way to go.

 

Author Comment

by:beavoid
ID: 39251960
Here is a good API that I found called jsoup, that works for extracting links

http://jsoup.org/cookbook/extracting-data/example-list-links

it works, getting links in the page, both types :   <a>  and <link>

Been a *long* time since i've seen a regular expression table :) Good to have API's
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39253087
Both the anchor tag and link tag use the href attribute to identify the URL.

Can we get the test data and the expected output, please?
0
 

Author Comment

by:beavoid
ID: 39255059
Ray, jsoup is perfect, thanks, but I found some compile problems, specified at end of my
other question here

Thanks
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39256209
jsoup is perfect
Great!  Then maybe you're the one programmer ever who did not need any test data.  Good luck with your project, ~Ray
0
 

Author Comment

by:beavoid
ID: 39256791
Okay, Sorry

I didn't have any test data, because diving recursively through links in html on the internet is a huge test data.

I think my other concern is the regular expression type system jsoup uses for finding tags.

I'd like to find email addresses in a page.

It works like this : it is selecting text in Document Objects.

 Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

What would emails look like? nothing I try works
An email  address looks like this :

<a href="mailto:youremailaddress">Email Me</a>

so it must resemble

 Elements emails = doc.select("a[href]");

How would I add the mailto requirement?

On a page jamescomp.com, it would return   james@jamescomp.com

Thanks
0
 

Author Closing Comment

by:beavoid
ID: 39260206
Hi Ray
Sorry to tick you off. You have helped me on so much, and I am grateful.
Most of my programming is personal project / game programming, and so it's mostly obvious if it works. I am going to start again, fresh.
Thanks
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39261298
No worries -- I just find that a wide swath of highly constrained and predictable test data (along with expected outcomes) makes the path to success so much shorter!  Best of luck with it, and thanks for the points, ~Ray
0

Featured Post

What Is Transaction Monitoring and who needs it?

Synthetic Transaction Monitoring that you need for the day to day, which ensures your business website keeps running optimally, and that there is no downtime to impact your customer experience.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.
The is a quite short video tutorial. In this video, I'm going to show you how to create self-host WordPress blog with free hosting service.

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question