Tech or Treat! Write an article about your scariest tech disaster to win gadgets!Learn more

x
?
Solved

javax html parser help ?

Posted on 2013-06-15
10
Medium Priority
?
387 Views
Last Modified: 2013-11-19
Hi,

What would Javax code look like to :

from a URL input, return a list of all the anchor links in the page

So, I'd like a list of all the possible destinations.

Can I also retrieve a list of emails contained within a page? How?

Thanks
0
Comment
Question by:beavoid
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
10 Comments
 

Author Comment

by:beavoid
ID: 39250865
I found this page to send Yahoo mail from code, it works :

http://stackoverflow.com/questions/11356237/send-mail-from-yahoo-id-to-other-email-ids-using-javamail-api

but I still need code to extract a list of all links within a specified URL's page?

I'm sure it needs serious regular expression handling. Is there  class / method to return a URL[] type object list?

Thanks
0
 

Author Comment

by:beavoid
ID: 39250889
some links now don't start with anchor <a
but,
I see they now have

<link

which appears to allow many options,
will the code you give, to find a list of links, , work with that?

Thanks
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39251279
Please post the test data set and show us the outputs you need.  If you don't have a test data set yet, please create one so we can look at this problem in the form of the SSCCE.  I would envision that you could put up a web page with the sort of links you want to extract, and you could give us the expected output from the extraction process.  Armed with that you can probably get a tested and working code example from Experts-Exchange.

If you're interested, this article shows some of the kind of thought processes that go into answering a question like this one.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

The process might be started with a code outline like this one.
http://www.laprbass.com/RAY_extract_urls.php

<?php // RAY_extract_urls.php
error_reporting(E_ALL);
echo '<pre>';

// EXTRACT URLS FROM A WEB PAGE, USING ANCHOR, LINK, AND BASE TAGS

// A WEB PAGE PROVIDES OUR TEST DATA
$url = 'http://tmxtech.com/';

// READ THE TEST DATA
$htm = file_get_contents($url);

// VISUALIZE THE TEST DATA
echo htmlentities($htm);

// CREATE A REGULAR EXPRESSION TO EXTRACT THE VALUES IN href= ATTRIBUTES
$rgx
= '#'          // A REGEX DELIMITER
. '('          // CAPTURE GROUP
. 'href="'     // A STARTER STRING TO LOOK FOR
. ')'          // END CAPTURE GROUP
. '(.*?)'      // A GROUP OF CHARACTERS, UNGREEDY
. '"'          // ENDING STRING TO LOOK FOR
. '#'          // END REGEX DELIMITER
. 'i'          // CASE-INSENSITIVE
;

// EXTRACT AND SHOW THE WORK PRODUCT
preg_match_all($rgx, $htm, $mat);
print_r($mat[2]);

Open in new window

Thanks, ~Ray
0
Understanding Web Applications

Without even knowing it, most of us are using web applications on a daily basis. Gmail and Yahoo email, Twitter, Facebook, and eBay are used by most of us daily—and they are web applications. We often confuse these web applications tools for websites.  So, what is the difference?

 

Author Comment

by:beavoid
ID: 39251960
Here is a good API that I found called jsoup, that works for extracting links

http://jsoup.org/cookbook/extracting-data/example-list-links

it works, getting links in the page, both types :   <a>  and <link>

Been a *long* time since i've seen a regular expression table :) Good to have API's
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39253087
Both the anchor tag and link tag use the href attribute to identify the URL.

Can we get the test data and the expected output, please?
0
 

Author Comment

by:beavoid
ID: 39255059
Ray, jsoup is perfect, thanks, but I found some compile problems, specified at end of my
other question here

Thanks
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 39256209
jsoup is perfect
Great!  Then maybe you're the one programmer ever who did not need any test data.  Good luck with your project, ~Ray
0
 

Author Comment

by:beavoid
ID: 39256791
Okay, Sorry

I didn't have any test data, because diving recursively through links in html on the internet is a huge test data.

I think my other concern is the regular expression type system jsoup uses for finding tags.

I'd like to find email addresses in a page.

It works like this : it is selecting text in Document Objects.

 Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

What would emails look like? nothing I try works
An email  address looks like this :

<a href="mailto:youremailaddress">Email Me</a>

so it must resemble

 Elements emails = doc.select("a[href]");

How would I add the mailto requirement?

On a page jamescomp.com, it would return   james@jamescomp.com

Thanks
0
 

Author Closing Comment

by:beavoid
ID: 39260206
Hi Ray
Sorry to tick you off. You have helped me on so much, and I am grateful.
Most of my programming is personal project / game programming, and so it's mostly obvious if it works. I am going to start again, fresh.
Thanks
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39261298
No worries -- I just find that a wide swath of highly constrained and predictable test data (along with expected outcomes) makes the path to success so much shorter!  Best of luck with it, and thanks for the points, ~Ray
0

Featured Post

[Webinar] Lessons on Recovering from Petya

Skyport is working hard to help customers recover from recent attacks, like the Petya worm. This work has brought to light some important lessons. New malware attacks like this can take down your entire environment. Learn from others mistakes on how to prevent Petya like worms.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Originally, this post was published on Monitis Blog, you can check it here . It goes without saying that technology has transformed society and the very nature of how we live, work, and communicate in ways that would’ve been incomprehensible 5 ye…
When the s#!t hits the fan, you don’t have time to look up who’s on call, draft emails, call collaborators, or send text messages. An instant chat window is definitely the way to go, especially one like HipChat. HipChat is a true business app. An…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Any person in technology especially those working for big companies should at least know about the basics of web accessibility. Believe it or not there are even laws in place that require businesses to provide such means for the disabled and aging p…
Suggested Courses

647 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question