Solved

extracting data from HTML email

Posted on 2008-06-09
5
261 Views
Last Modified: 2008-07-02
HI

I'm trying to "extract" HTML "blocks" from an HTML e-mail so I can add the data into a DB.

I've already managed to write a script that successfully polls my imap acount for the message according to subject, and it then reads the body into $body.

The data I'm looking for are always "encapsulated" between <p style="width:800px;"> </p> tags, BUT, there are other <p></p> tags in the HTML, and, inside the <p style="width:800px;"> </p> blocks, I want to extract the title, the link of the title and the body as vars so i can write them to DB.

So, a typical email body would look something like this:
<html><head></head><body><div>
<p><b>Welcome</b>
<p style="width:800px;">
<a href="somelink">title</a>
<font color="red">Some stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink2">title2</a>
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink3">title3</a>
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
...
...
...
</div></body></html>

Can someone help me with a script that can iteratively go through all the <p></p> blocks so I can get the link, title and body of each block read into a var?




0
Comment
Question by:psimation
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 48

Expert Comment

by:hernst42
ID: 21745787
If it's well form HTML you can use http://www.php.net/manual/en/domdocument.loadhtml.php to parse that html and then convert it to a simpleXML-Object
http://www.php.net/manual/en/function.simplexml-import-dom.php
and then use xpath to search those nodes
http://www.php.net/manual/en/function.simplexml-element-xpath.php
0
 
LVL 17

Author Comment

by:psimation
ID: 21746059
Thanks Hernst42

There will most probably going to be some elements I wil have to manaully remove to make it "well formed html" - but in the event one cannot "clean" it up, is there another way (easy) that you know of?

0
 
LVL 48

Expert Comment

by:hernst42
ID: 21746079
the other option is to write a parser, but that parser can't also handle wrong nested tags. Maybe you need to run http://www.php.net/tidy to make the HTML well formed.
So I don't know an easier solution.
0
 
LVL 17

Author Comment

by:psimation
ID: 21746123
Thx, I'll play with that tomorrow - one more thing:

How does this method handle the "attributes" of tags like <a href="link">xxx</a>

From the example, it looks like it can only "extract" content "between" tags, and not the "attributes" - so how would I get the link ( which is theoretically an attribute of the <a> tag )?
0
 
LVL 48

Accepted Solution

by:
hernst42 earned 500 total points
ID: 21754450
you can convert it back to a domDocument and then output that with all nested tags an attributes.
0

Featured Post

PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.

751 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question