psimation
asked on
extracting data from HTML email
HI
I'm trying to "extract" HTML "blocks" from an HTML e-mail so I can add the data into a DB.
I've already managed to write a script that successfully polls my imap acount for the message according to subject, and it then reads the body into $body.
The data I'm looking for are always "encapsulated" between <p style="width:800px;"> </p> tags, BUT, there are other <p></p> tags in the HTML, and, inside the <p style="width:800px;"> </p> blocks, I want to extract the title, the link of the title and the body as vars so i can write them to DB.
So, a typical email body would look something like this:
<html><head></head><body>< div>
<p><b>Welcome</b>
<p style="width:800px;">
<a href="somelink">title</a>
<font color="red">Some stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink2">title2</a >
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink3">title3</a >
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
...
...
...
</div></body></html>
Can someone help me with a script that can iteratively go through all the <p></p> blocks so I can get the link, title and body of each block read into a var?
I'm trying to "extract" HTML "blocks" from an HTML e-mail so I can add the data into a DB.
I've already managed to write a script that successfully polls my imap acount for the message according to subject, and it then reads the body into $body.
The data I'm looking for are always "encapsulated" between <p style="width:800px;"> </p> tags, BUT, there are other <p></p> tags in the HTML, and, inside the <p style="width:800px;"> </p> blocks, I want to extract the title, the link of the title and the body as vars so i can write them to DB.
So, a typical email body would look something like this:
<html><head></head><body><
<p><b>Welcome</b>
<p style="width:800px;">
<a href="somelink">title</a>
<font color="red">Some stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink2">title2</a
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
<p style="width:800px;">
<a href="somelink3">title3</a
<font color="red">Some more stuff</font>here with lots of <b>normal</b> html tags inbetween
</p>
...
...
...
</div></body></html>
Can someone help me with a script that can iteratively go through all the <p></p> blocks so I can get the link, title and body of each block read into a var?
ASKER
Thanks Hernst42
There will most probably going to be some elements I wil have to manaully remove to make it "well formed html" - but in the event one cannot "clean" it up, is there another way (easy) that you know of?
There will most probably going to be some elements I wil have to manaully remove to make it "well formed html" - but in the event one cannot "clean" it up, is there another way (easy) that you know of?
the other option is to write a parser, but that parser can't also handle wrong nested tags. Maybe you need to run http://www.php.net/tidy to make the HTML well formed.
So I don't know an easier solution.
So I don't know an easier solution.
ASKER
Thx, I'll play with that tomorrow - one more thing:
How does this method handle the "attributes" of tags like <a href="link">xxx</a>
From the example, it looks like it can only "extract" content "between" tags, and not the "attributes" - so how would I get the link ( which is theoretically an attribute of the <a> tag )?
How does this method handle the "attributes" of tags like <a href="link">xxx</a>
From the example, it looks like it can only "extract" content "between" tags, and not the "attributes" - so how would I get the link ( which is theoretically an attribute of the <a> tag )?
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
http://www.php.net/manual/en/function.simplexml-import-dom.php
and then use xpath to search those nodes
http://www.php.net/manual/en/function.simplexml-element-xpath.php