?
Solved

javax html parser help ?

Posted on 2013-06-15
10
Medium Priority
?
380 Views
Last Modified: 2013-11-19
Hi,

What would Javax code look like to :

from a URL input, return a list of all the anchor links in the page

So, I'd like a list of all the possible destinations.

Can I also retrieve a list of emails contained within a page? How?

Thanks
0
Comment
Question by:beavoid
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
10 Comments
 

Author Comment

by:beavoid
ID: 39250865
I found this page to send Yahoo mail from code, it works :

http://stackoverflow.com/questions/11356237/send-mail-from-yahoo-id-to-other-email-ids-using-javamail-api

but I still need code to extract a list of all links within a specified URL's page?

I'm sure it needs serious regular expression handling. Is there  class / method to return a URL[] type object list?

Thanks
0
 

Author Comment

by:beavoid
ID: 39250889
some links now don't start with anchor <a
but,
I see they now have

<link

which appears to allow many options,
will the code you give, to find a list of links, , work with that?

Thanks
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39251279
Please post the test data set and show us the outputs you need.  If you don't have a test data set yet, please create one so we can look at this problem in the form of the SSCCE.  I would envision that you could put up a web page with the sort of links you want to extract, and you could give us the expected output from the extraction process.  Armed with that you can probably get a tested and working code example from Experts-Exchange.

If you're interested, this article shows some of the kind of thought processes that go into answering a question like this one.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

The process might be started with a code outline like this one.
http://www.laprbass.com/RAY_extract_urls.php

<?php // RAY_extract_urls.php
error_reporting(E_ALL);
echo '<pre>';

// EXTRACT URLS FROM A WEB PAGE, USING ANCHOR, LINK, AND BASE TAGS

// A WEB PAGE PROVIDES OUR TEST DATA
$url = 'http://tmxtech.com/';

// READ THE TEST DATA
$htm = file_get_contents($url);

// VISUALIZE THE TEST DATA
echo htmlentities($htm);

// CREATE A REGULAR EXPRESSION TO EXTRACT THE VALUES IN href= ATTRIBUTES
$rgx
= '#'          // A REGEX DELIMITER
. '('          // CAPTURE GROUP
. 'href="'     // A STARTER STRING TO LOOK FOR
. ')'          // END CAPTURE GROUP
. '(.*?)'      // A GROUP OF CHARACTERS, UNGREEDY
. '"'          // ENDING STRING TO LOOK FOR
. '#'          // END REGEX DELIMITER
. 'i'          // CASE-INSENSITIVE
;

// EXTRACT AND SHOW THE WORK PRODUCT
preg_match_all($rgx, $htm, $mat);
print_r($mat[2]);

Open in new window

Thanks, ~Ray
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:beavoid
ID: 39251960
Here is a good API that I found called jsoup, that works for extracting links

http://jsoup.org/cookbook/extracting-data/example-list-links

it works, getting links in the page, both types :   <a>  and <link>

Been a *long* time since i've seen a regular expression table :) Good to have API's
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39253087
Both the anchor tag and link tag use the href attribute to identify the URL.

Can we get the test data and the expected output, please?
0
 

Author Comment

by:beavoid
ID: 39255059
Ray, jsoup is perfect, thanks, but I found some compile problems, specified at end of my
other question here

Thanks
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 39256209
jsoup is perfect
Great!  Then maybe you're the one programmer ever who did not need any test data.  Good luck with your project, ~Ray
0
 

Author Comment

by:beavoid
ID: 39256791
Okay, Sorry

I didn't have any test data, because diving recursively through links in html on the internet is a huge test data.

I think my other concern is the regular expression type system jsoup uses for finding tags.

I'd like to find email addresses in a page.

It works like this : it is selecting text in Document Objects.

 Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

What would emails look like? nothing I try works
An email  address looks like this :

<a href="mailto:youremailaddress">Email Me</a>

so it must resemble

 Elements emails = doc.select("a[href]");

How would I add the mailto requirement?

On a page jamescomp.com, it would return   james@jamescomp.com

Thanks
0
 

Author Closing Comment

by:beavoid
ID: 39260206
Hi Ray
Sorry to tick you off. You have helped me on so much, and I am grateful.
Most of my programming is personal project / game programming, and so it's mostly obvious if it works. I am going to start again, fresh.
Thanks
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39261298
No worries -- I just find that a wide swath of highly constrained and predictable test data (along with expected outcomes) makes the path to success so much shorter!  Best of luck with it, and thanks for the points, ~Ray
0

Featured Post

Get MySQL database support online, now!

At Percona’s web store you can order your MySQL database support needs in minutes. No hassles, no fuss, just pick and click. Pay online with a credit card.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to security, close monitoring is a must. According to WhiteHat Security annual report, a substantial number of all web applications are vulnerable always. Monitis offers a new product - fully-featured Website security monitoring and pr…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Any person in technology especially those working for big companies should at least know about the basics of web accessibility. Believe it or not there are even laws in place that require businesses to provide such means for the disabled and aging p…
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question