• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 386
  • Last Modified:

Finding emails and links in an html, PHP file

Hi

I'm dabbling with the code to pull from html code, all links and email addresses from a page.
I'm using the jsoup API, which performs wonderfully.
I have got some of it working, listing links,
but, I don't know the regular expression system jsoup uses for email addresses.
Maybe with the Document section..
(the code is in a Document object) with the Jsoup API . .

        //place source in a Document Object

        Document doc = Jsoup.connect(url).get();

        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

what would the email collecting line be?
Elements emails = ?
Thanks
0
beavoid
Asked:
beavoid
1 Solution
 
haloexpertsexchangeCommented:
check for mailto?
0
 
beavoidAuthor Commented:
Thanks for all this advice

I found an API called jsoup that handles all of these issues, probably regular expressions to find text within a page.
It finds links perfectly. I attached a zip with the main files I use, and my listlinks.java where my code with problems is

I am making an HTML diver that starts at one page, and recursively goes through every link from page down to pages beneath it. It works fine, but claims that some URL findings are invalid and its seems silly. I can't see the exact line that is the problem. - or how to avoid problem pages. I avoid certain extensions like .swf and .png
You can comment out which URL you'd like to begin with in main(

I can't attach JAR files. Google it :)
It is
jsoup-1.7.2.jar

My java file is below

Thanks
ListLinks.java
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now