Solved

extracting data from HTML email

Posted on 2008-06-09
5
263 Views
Last Modified: 2008-07-02
HI

I'm trying to "extract" HTML "blocks" from an HTML e-mail so I can add the data into a DB.

I've already managed to write a script that successfully polls my imap acount for the message according to subject, and it then reads the body into $body.

The data I'm looking for are always "encapsulated" between <p style="width:800px;"> </p> tags, BUT, there are other <p></p> tags in the HTML, and, inside the <p style="width:800px;"> </p> blocks, I want to extract the title, the link of the title and the body as vars so i can write them to DB.

So, a typical email body would look something like this:
<html><head></head><body><div>
<p><b>Welcome</b>
<p style="width:800px;">
<a href="somelink">title</a>
<font color="red">Some stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink2">title2</a>
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink3">title3</a>
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
...
...
...
</div></body></html>

Can someone help me with a script that can iteratively go through all the <p></p> blocks so I can get the link, title and body of each block read into a var?




0
Comment
Question by:psimation
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 48

Expert Comment

by:hernst42
ID: 21745787
If it's well form HTML you can use http://www.php.net/manual/en/domdocument.loadhtml.php to parse that html and then convert it to a simpleXML-Object
http://www.php.net/manual/en/function.simplexml-import-dom.php
and then use xpath to search those nodes
http://www.php.net/manual/en/function.simplexml-element-xpath.php
0
 
LVL 17

Author Comment

by:psimation
ID: 21746059
Thanks Hernst42

There will most probably going to be some elements I wil have to manaully remove to make it "well formed html" - but in the event one cannot "clean" it up, is there another way (easy) that you know of?

0
 
LVL 48

Expert Comment

by:hernst42
ID: 21746079
the other option is to write a parser, but that parser can't also handle wrong nested tags. Maybe you need to run http://www.php.net/tidy to make the HTML well formed.
So I don't know an easier solution.
0
 
LVL 17

Author Comment

by:psimation
ID: 21746123
Thx, I'll play with that tomorrow - one more thing:

How does this method handle the "attributes" of tags like <a href="link">xxx</a>

From the example, it looks like it can only "extract" content "between" tags, and not the "attributes" - so how would I get the link ( which is theoretically an attribute of the <a> tag )?
0
 
LVL 48

Accepted Solution

by:
hernst42 earned 500 total points
ID: 21754450
you can convert it back to a domDocument and then output that with all nested tags an attributes.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

623 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question