Solved

javax html parser help ?

Posted on 2013-06-15
10
352 Views
Last Modified: 2013-11-19
Hi,

What would Javax code look like to :

from a URL input, return a list of all the anchor links in the page

So, I'd like a list of all the possible destinations.

Can I also retrieve a list of emails contained within a page? How?

Thanks
0
Comment
Question by:beavoid
  • 6
  • 4
10 Comments
 

Author Comment

by:beavoid
ID: 39250865
I found this page to send Yahoo mail from code, it works :

http://stackoverflow.com/questions/11356237/send-mail-from-yahoo-id-to-other-email-ids-using-javamail-api

but I still need code to extract a list of all links within a specified URL's page?

I'm sure it needs serious regular expression handling. Is there  class / method to return a URL[] type object list?

Thanks
0
 

Author Comment

by:beavoid
ID: 39250889
some links now don't start with anchor <a
but,
I see they now have

<link

which appears to allow many options,
will the code you give, to find a list of links, , work with that?

Thanks
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39251279
Please post the test data set and show us the outputs you need.  If you don't have a test data set yet, please create one so we can look at this problem in the form of the SSCCE.  I would envision that you could put up a web page with the sort of links you want to extract, and you could give us the expected output from the extraction process.  Armed with that you can probably get a tested and working code example from Experts-Exchange.

If you're interested, this article shows some of the kind of thought processes that go into answering a question like this one.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

The process might be started with a code outline like this one.
http://www.laprbass.com/RAY_extract_urls.php

<?php // RAY_extract_urls.php
error_reporting(E_ALL);
echo '<pre>';

// EXTRACT URLS FROM A WEB PAGE, USING ANCHOR, LINK, AND BASE TAGS

// A WEB PAGE PROVIDES OUR TEST DATA
$url = 'http://tmxtech.com/';

// READ THE TEST DATA
$htm = file_get_contents($url);

// VISUALIZE THE TEST DATA
echo htmlentities($htm);

// CREATE A REGULAR EXPRESSION TO EXTRACT THE VALUES IN href= ATTRIBUTES
$rgx
= '#'          // A REGEX DELIMITER
. '('          // CAPTURE GROUP
. 'href="'     // A STARTER STRING TO LOOK FOR
. ')'          // END CAPTURE GROUP
. '(.*?)'      // A GROUP OF CHARACTERS, UNGREEDY
. '"'          // ENDING STRING TO LOOK FOR
. '#'          // END REGEX DELIMITER
. 'i'          // CASE-INSENSITIVE
;

// EXTRACT AND SHOW THE WORK PRODUCT
preg_match_all($rgx, $htm, $mat);
print_r($mat[2]);

Open in new window

Thanks, ~Ray
0
 

Author Comment

by:beavoid
ID: 39251960
Here is a good API that I found called jsoup, that works for extracting links

http://jsoup.org/cookbook/extracting-data/example-list-links

it works, getting links in the page, both types :   <a>  and <link>

Been a *long* time since i've seen a regular expression table :) Good to have API's
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39253087
Both the anchor tag and link tag use the href attribute to identify the URL.

Can we get the test data and the expected output, please?
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:beavoid
ID: 39255059
Ray, jsoup is perfect, thanks, but I found some compile problems, specified at end of my
other question here

Thanks
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39256209
jsoup is perfect
Great!  Then maybe you're the one programmer ever who did not need any test data.  Good luck with your project, ~Ray
0
 

Author Comment

by:beavoid
ID: 39256791
Okay, Sorry

I didn't have any test data, because diving recursively through links in html on the internet is a huge test data.

I think my other concern is the regular expression type system jsoup uses for finding tags.

I'd like to find email addresses in a page.

It works like this : it is selecting text in Document Objects.

 Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

What would emails look like? nothing I try works
An email  address looks like this :

<a href="mailto:youremailaddress">Email Me</a>

so it must resemble

 Elements emails = doc.select("a[href]");

How would I add the mailto requirement?

On a page jamescomp.com, it would return   james@jamescomp.com

Thanks
0
 

Author Closing Comment

by:beavoid
ID: 39260206
Hi Ray
Sorry to tick you off. You have helped me on so much, and I am grateful.
Most of my programming is personal project / game programming, and so it's mostly obvious if it works. I am going to start again, fresh.
Thanks
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39261298
No worries -- I just find that a wide swath of highly constrained and predictable test data (along with expected outcomes) makes the path to success so much shorter!  Best of luck with it, and thanks for the points, ~Ray
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

I've been asked to discuss some of the UX activities that I'm using with my team. Here I will share some details about how we approach UX projects.
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
This tutorial demonstrates how to identify and create boundary or building outlines in Google Maps. In this example, I outline the boundaries of an enclosed skatepark within a community park.  Login to your Google Account, then  Google for "Google M…
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now