?
Solved

Need to convert scraped HTML to XML nodes, without tags

Posted on 2010-01-09
4
Medium Priority
?
522 Views
Last Modified: 2012-05-08
Hi,

I'm grabbing HTML from a forum, with output that looks like the first part attached snippet.  The second part of the snippet is an example of what I'm trying to get back from the first part.

I'm trying to achieve:
- strip html tags, exept <BR> using strip_tags($markup, "<BR>");
- convert single- and double-quotes to htmlentities using htmlspecialchars($markup, ENT_QUOTES);
- convert line breaks to <BR>, using both nl2br and preg_replace (to condense multiple consecutive line breaks to one or two <BR>)
the above seems to be working OK, but the problem is in the next step:

I need to convert it into XML nodes.  Each "code" block needs to be in a <code> node (the code blocks are marked by both normal code tags as well as the <!-- php buffer start/end --> comment elements), and everything between, before or after a code block should go in a <paragraph> node.

So basically everything up to the start of the first code block (<!-- php buffer start -->) goes into a paragraph node, then open a code node and put everything in that until we hit the end of the code block (<!-- php buffer end -->), then open a new paragraph node and put everything else until it hits another new code block, etc.

All nodes should use CDATA wrappers.

I've been trying mostly regexp functions like preg_replace and preg_match, and i've gotten close a couple times, but everytime I think I'm making progress, some other bit falls apart.  For example, if I strip_tags in the wrong spot, I lose the php buffer comments, so I can't determine where code blocks start and finish.  My regexps are also commonly failing to match properly.

I've spent more time on this than I'd care to admit.  If anyone could help me get this wrapped up, I'd really appreciate it.

TYIA




///////////////////////////////
// the html i'm staring with //
///////////////////////////////

first, you need to give Flash the new file name. You can use a standard querystring or Flash vars, and php to echo it out:<BR>
<DIV style="MARGIN: 5px 20px 20px">
<DIV class=smallfont style="MARGIN-BOTTOM: 2px">PHP Code:</DIV>
<DIV class=alt2 dir=ltr style="BORDER-RIGHT: 1px inset; PADDING-RIGHT: 6px; BORDER-TOP: 1px inset; PADDING-LEFT: 6px; PADDING-BOTTOM: 6px; MARGIN: 0px; OVERFLOW: auto; BORDER-LEFT: 1px inset; WIDTH: 640px; PADDING-TOP: 6px; BORDER-BOTTOM: 1px inset; HEIGHT: 66px; TEXT-ALIGN: left"><CODE style="WHITE-SPACE: nowrap"><!-- php buffer start --><CODE><SPAN style="COLOR: #000000">&lt;param name="movie" value="Test_Editor.swf?filename=<SPAN style="COLOR: #0000bb">&lt;?php </SPAN><SPAN style="COLOR: #007700">echo </SPAN><SPAN style="COLOR: #0000bb">$newname</SPAN><SPAN style="COLOR: #007700">; </SPAN><SPAN style="COLOR: #0000bb">?&gt;</SPAN>" /&gt;<BR>...<BR>&lt;object type="application/x-shockwave-flash" data="Test_Editor.swf?filename=<SPAN style="COLOR: #0000bb">&lt;?php </SPAN><SPAN style="COLOR: #007700">echo </SPAN><SPAN style="COLOR: #0000bb">$newname</SPAN><SPAN style="COLOR: #007700">; </SPAN><SPAN style="COLOR: #0000bb">?&gt;</SPAN>" width="964" height="510"&gt;</SPAN> </CODE><!-- php buffer end --></CODE></DIV></DIV>as you can see, we set a variable named "filename" to the $newname php variable.<BR><BR>now, it'll be availabe in flash in the loaderInfo.parameters object, which will now have a property named "filename" that will be the path from $newname: 
<DIV style="MARGIN: 5px 20px 20px">
<DIV class=smallfont style="MARGIN-BOTTOM: 2px">PHP Code:</DIV>
<DIV class=alt2 dir=ltr style="BORDER-RIGHT: 1px inset; PADDING-RIGHT: 6px; BORDER-TOP: 1px inset; PADDING-LEFT: 6px; PADDING-BOTTOM: 6px; MARGIN: 0px; OVERFLOW: auto; BORDER-LEFT: 1px inset; WIDTH: 640px; PADDING-TOP: 6px; BORDER-BOTTOM: 1px inset; HEIGHT: 34px; TEXT-ALIGN: left"><CODE style="WHITE-SPACE: nowrap"><!-- php buffer start --><CODE><SPAN style="COLOR: #000000"><SPAN style="COLOR: #0000bb"></SPAN><SPAN style="COLOR: #007700">var </SPAN><SPAN style="COLOR: #0000bb">imageFilePath</SPAN><SPAN style="COLOR: #007700">:</SPAN><SPAN style="COLOR: #0000bb">String </SPAN><SPAN style="COLOR: #007700">= </SPAN><SPAN style="COLOR: #0000bb">root</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">loaderInfo</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">parameters</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">filename</SPAN><SPAN style="COLOR: #007700">; <BR></SPAN><SPAN style="COLOR: #0000bb"></SPAN></SPAN></CODE><!-- php buffer end --></CODE></DIV></DIV>then load it into flash using a Loader object: 
<DIV style="MARGIN: 5px 20px 20px">
<DIV class=smallfont style="MARGIN-BOTTOM: 2px">PHP Code:</DIV>
<DIV class=alt2 dir=ltr style="BORDER-RIGHT: 1px inset; PADDING-RIGHT: 6px; BORDER-TOP: 1px inset; PADDING-LEFT: 6px; PADDING-BOTTOM: 6px; MARGIN: 0px; OVERFLOW: auto; BORDER-LEFT: 1px inset; WIDTH: 640px; PADDING-TOP: 6px; BORDER-BOTTOM: 1px inset; HEIGHT: 82px; TEXT-ALIGN: left"><CODE style="WHITE-SPACE: nowrap"><!-- php buffer start --><CODE><SPAN style="COLOR: #000000"><SPAN style="COLOR: #0000bb"></SPAN><SPAN style="COLOR: #007700">var </SPAN><SPAN style="COLOR: #0000bb">loader</SPAN><SPAN style="COLOR: #007700">:</SPAN><SPAN style="COLOR: #0000bb">Loader </SPAN><SPAN style="COLOR: #007700">= new </SPAN><SPAN style="COLOR: #0000bb">Loader</SPAN><SPAN style="COLOR: #007700">();<BR>var </SPAN><SPAN style="COLOR: #0000bb">request</SPAN><SPAN style="COLOR: #007700">:</SPAN><SPAN style="COLOR: #0000bb">URLRequest </SPAN><SPAN style="COLOR: #007700">= new </SPAN><SPAN style="COLOR: #0000bb">URLRequest</SPAN><SPAN style="COLOR: #007700">(</SPAN><SPAN style="COLOR: #0000bb">imageFilePath</SPAN><SPAN style="COLOR: #007700">);<BR></SPAN><SPAN style="COLOR: #0000bb">loader</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">load</SPAN><SPAN style="COLOR: #007700">(</SPAN><SPAN style="COLOR: #0000bb">request</SPAN><SPAN style="COLOR: #007700">);<BR></SPAN><SPAN style="COLOR: #0000bb">addChild</SPAN><SPAN style="COLOR: #007700">(</SPAN><SPAN style="COLOR: #0000bb">loader</SPAN><SPAN style="COLOR: #007700">); <BR></SPAN><SPAN style="COLOR: #0000bb"></SPAN></SPAN></CODE><!-- php buffer end --></CODE></DIV></DIV>

//////////////////////////////////
// the return i'm trying to get //
//////////////////////////////////

<para><![CDATA[first, you need to give Flash the new file name. You can use a standard querystring or Flash vars, and php to echo it out:><BR><BR><BR>PHP Code:<BR>]]></para> 
<code><![CDATA[&lt;param name=&quot;movie&quot; value=&quot;Test_Editor.swf?filename=&lt;?php echo $newname; ?&gt;&quot; /&gt;><BR>...<BR>&lt;object type=&quot;application/x-shockwave-flash&quot; data=&quot;Test_Editor.swf?filename=&lt;?php echo $newname; ?&gt;&quot; width=&quot;964&quot; height=&quot;510&quot;&gt; ]]></code> 
<para><![CDATA[as you can see, we set a variable named &quot;filename&quot; to the $newname php variable.<BR><BR>now, it&amp;ll be availabe in flash in the loaderInfo.parameters object, which will now have a property named &quot;filename&quot; that will be the path from $newname:<BR><BR>PHP Code:<BR>]]></para>
<code><![CDATA[var imageFilePath:String = root.loaderInfo.parameters.filename; <BR>]]></code>
<para><![CDATA[then load it into flash using a Loader object: <BR><BR>PHP Code:<BR>]]></para>
<code><![CDATA[var loader:Loader = new Loader();<BR>var request:URLRequest = new URLRequest(imageFilePath);<BR>loader.load(request);<BR>addChild(loader); <BR>]]></code>

Open in new window

0
Comment
Question by:moagrius
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 26276054
Can you please post a URL that contains the HTML you start with?  Thanks, ~Ray
0
 
LVL 19

Author Comment

by:moagrius
ID: 26276165
it's actually in a local database.  the first part of that snippet would actually be the entire contents of the row's "Content" field.

i've got it working to a large degree, but it's pretty ham-handed, and i ended up using an arbitrary string as a delimiter to preserve the separation of the code block while stripping tags which i fear is inconsistent at best.

i'll post the less-than-elegant code and maybe you can suggest a better approach?

i appreciate the response.

(what precedes this is just a query to the database, where $row is the result of a mysql_fetch_assoc call...  $xml is just a string that will eventually be echoed out - i can post the whole script if you think that'd be helpful, but little besides what's attached is relevant.)
// open the node
$xml .= "\n\t\t<content>";

// strip slashes
$markup = stripslashes($row["Content"]);

// strip single quotes
$markup = preg_replace("/\'/", "&apos;", $markup);

// strip double quotes
$markup = preg_replace("/\"/", "&quot;", $markup);

// change line breaks to BR tags
$markup = preg_replace("/\n+/", "<BR>", $markup);

// replace open comment
$markup = preg_replace("/<!-- php buffer start -->/iU", "STARTCODEBLOCK", $markup);

// replace close comment
$markup = preg_replace("/<!-- php buffer end -->/iU", "ENDCODEBLOCK", $markup);

// strip tags now that commented php code is safe
$markup = strip_tags($markup, "<br>");

// separate into nodes
if(strpos($markup, "STARTCODEBLOCK") !== false){
	$markup = preg_replace("/(.*?)STARTCODEBLOCK(.*?)ENDCODEBLOCK(.*?)/", "\n\t\t\t<para><![CDATA[$1]]></para>\n\t\t\t<code><![CDATA[$2]]></code>\n\t\t\t<para><![CDATA[$3]]></para>", $markup);
} else {
	$markup = "\n\t\t\t<para><![CDATA[" . $markup . "]]></para>";
}

// get rid of empty nodes
$markup = str_replace("\n\t\t\t<para><![CDATA[]]></para>", "", $markup);

// add the nodes to the output
$xml .= $markup;

// close the node
$xml .= "\n\t\t</content>";

Open in new window

0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 26277662
"ham-handed" - yes, but don't let that cause you any discomfort.  All "scrape" scripts share that trait.

If you're still having trouble with the REGEX, let me suggest another "ham-handed" way of dealing with these sorts of things.  You can use explode() to break strings apart into arrays, based on a character string, like this:

$str = 'ABCBD';
$arr = explode('B', $str); // ARRAY HAS A, C, D

When scraping, I often find that I can use that strategy to isolate the important pieces of data, then the individual operations can be performed on each piece.  This makes it easier to use str_replace() instead of preg_replace, and since you're now working with smaller and more predictable strings of data, you have less risk of munging something else in the string with an errant REGEX.

I realize that is more of a strategy than an answer, but hopefully it guides in the right direction.  If I have some time later today, I'll try to plow into the actual code.  Best of luck with the project, ~Ray
0
 
LVL 19

Author Comment

by:moagrius
ID: 26279382
that'd be great - thanks for your time so far.
0

Featured Post

WordPress Tutorial 1: Installation & Setup

WordPress is a very popular option for running your web site and can be used to get your content online quickly for the world to see. This guide will walk you through installing the WordPress server software and the initial setup process.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
There are times when I have encountered the need to decompress a response from a PHP request. This is how it's done, but you must have control of the request and you can set the Accept-Encoding header.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question