Solved

javax html parser help ?

Posted on 2013-06-15
10
360 Views
Last Modified: 2013-11-19
Hi,

What would Javax code look like to :

from a URL input, return a list of all the anchor links in the page

So, I'd like a list of all the possible destinations.

Can I also retrieve a list of emails contained within a page? How?

Thanks
0
Comment
Question by:beavoid
  • 6
  • 4
10 Comments
 

Author Comment

by:beavoid
ID: 39250865
I found this page to send Yahoo mail from code, it works :

http://stackoverflow.com/questions/11356237/send-mail-from-yahoo-id-to-other-email-ids-using-javamail-api

but I still need code to extract a list of all links within a specified URL's page?

I'm sure it needs serious regular expression handling. Is there  class / method to return a URL[] type object list?

Thanks
0
 

Author Comment

by:beavoid
ID: 39250889
some links now don't start with anchor <a
but,
I see they now have

<link

which appears to allow many options,
will the code you give, to find a list of links, , work with that?

Thanks
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39251279
Please post the test data set and show us the outputs you need.  If you don't have a test data set yet, please create one so we can look at this problem in the form of the SSCCE.  I would envision that you could put up a web page with the sort of links you want to extract, and you could give us the expected output from the extraction process.  Armed with that you can probably get a tested and working code example from Experts-Exchange.

If you're interested, this article shows some of the kind of thought processes that go into answering a question like this one.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

The process might be started with a code outline like this one.
http://www.laprbass.com/RAY_extract_urls.php

<?php // RAY_extract_urls.php
error_reporting(E_ALL);
echo '<pre>';

// EXTRACT URLS FROM A WEB PAGE, USING ANCHOR, LINK, AND BASE TAGS

// A WEB PAGE PROVIDES OUR TEST DATA
$url = 'http://tmxtech.com/';

// READ THE TEST DATA
$htm = file_get_contents($url);

// VISUALIZE THE TEST DATA
echo htmlentities($htm);

// CREATE A REGULAR EXPRESSION TO EXTRACT THE VALUES IN href= ATTRIBUTES
$rgx
= '#'          // A REGEX DELIMITER
. '('          // CAPTURE GROUP
. 'href="'     // A STARTER STRING TO LOOK FOR
. ')'          // END CAPTURE GROUP
. '(.*?)'      // A GROUP OF CHARACTERS, UNGREEDY
. '"'          // ENDING STRING TO LOOK FOR
. '#'          // END REGEX DELIMITER
. 'i'          // CASE-INSENSITIVE
;

// EXTRACT AND SHOW THE WORK PRODUCT
preg_match_all($rgx, $htm, $mat);
print_r($mat[2]);

Open in new window

Thanks, ~Ray
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 

Author Comment

by:beavoid
ID: 39251960
Here is a good API that I found called jsoup, that works for extracting links

http://jsoup.org/cookbook/extracting-data/example-list-links

it works, getting links in the page, both types :   <a>  and <link>

Been a *long* time since i've seen a regular expression table :) Good to have API's
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39253087
Both the anchor tag and link tag use the href attribute to identify the URL.

Can we get the test data and the expected output, please?
0
 

Author Comment

by:beavoid
ID: 39255059
Ray, jsoup is perfect, thanks, but I found some compile problems, specified at end of my
other question here

Thanks
0
 
LVL 109

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39256209
jsoup is perfect
Great!  Then maybe you're the one programmer ever who did not need any test data.  Good luck with your project, ~Ray
0
 

Author Comment

by:beavoid
ID: 39256791
Okay, Sorry

I didn't have any test data, because diving recursively through links in html on the internet is a huge test data.

I think my other concern is the regular expression type system jsoup uses for finding tags.

I'd like to find email addresses in a page.

It works like this : it is selecting text in Document Objects.

 Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

What would emails look like? nothing I try works
An email  address looks like this :

<a href="mailto:youremailaddress">Email Me</a>

so it must resemble

 Elements emails = doc.select("a[href]");

How would I add the mailto requirement?

On a page jamescomp.com, it would return   james@jamescomp.com

Thanks
0
 

Author Closing Comment

by:beavoid
ID: 39260206
Hi Ray
Sorry to tick you off. You have helped me on so much, and I am grateful.
Most of my programming is personal project / game programming, and so it's mostly obvious if it works. I am going to start again, fresh.
Thanks
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39261298
No worries -- I just find that a wide swath of highly constrained and predictable test data (along with expected outcomes) makes the path to success so much shorter!  Best of luck with it, and thanks for the points, ~Ray
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Developer portfolios can be a bit of an enigma—how do you present yourself to employers without burying them in lines of code?  A modern portfolio is more than just work samples, it’s also a statement of how you work.
This article will inform Clients about common and important expectations from the freelancers (Experts) who are looking at your Gig.
This video teaches users how to migrate an existing Wordpress website to a new domain.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question