Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 525
  • Last Modified:

Need to convert scraped HTML to XML nodes, without tags

Hi,

I'm grabbing HTML from a forum, with output that looks like the first part attached snippet.  The second part of the snippet is an example of what I'm trying to get back from the first part.

I'm trying to achieve:
- strip html tags, exept <BR> using strip_tags($markup, "<BR>");
- convert single- and double-quotes to htmlentities using htmlspecialchars($markup, ENT_QUOTES);
- convert line breaks to <BR>, using both nl2br and preg_replace (to condense multiple consecutive line breaks to one or two <BR>)
the above seems to be working OK, but the problem is in the next step:

I need to convert it into XML nodes.  Each "code" block needs to be in a <code> node (the code blocks are marked by both normal code tags as well as the <!-- php buffer start/end --> comment elements), and everything between, before or after a code block should go in a <paragraph> node.

So basically everything up to the start of the first code block (<!-- php buffer start -->) goes into a paragraph node, then open a code node and put everything in that until we hit the end of the code block (<!-- php buffer end -->), then open a new paragraph node and put everything else until it hits another new code block, etc.

All nodes should use CDATA wrappers.

I've been trying mostly regexp functions like preg_replace and preg_match, and i've gotten close a couple times, but everytime I think I'm making progress, some other bit falls apart.  For example, if I strip_tags in the wrong spot, I lose the php buffer comments, so I can't determine where code blocks start and finish.  My regexps are also commonly failing to match properly.

I've spent more time on this than I'd care to admit.  If anyone could help me get this wrapped up, I'd really appreciate it.

TYIA




///////////////////////////////
// the html i'm staring with //
///////////////////////////////

first, you need to give Flash the new file name. You can use a standard querystring or Flash vars, and php to echo it out:<BR>
<DIV style="MARGIN: 5px 20px 20px">
<DIV class=smallfont style="MARGIN-BOTTOM: 2px">PHP Code:</DIV>
<DIV class=alt2 dir=ltr style="BORDER-RIGHT: 1px inset; PADDING-RIGHT: 6px; BORDER-TOP: 1px inset; PADDING-LEFT: 6px; PADDING-BOTTOM: 6px; MARGIN: 0px; OVERFLOW: auto; BORDER-LEFT: 1px inset; WIDTH: 640px; PADDING-TOP: 6px; BORDER-BOTTOM: 1px inset; HEIGHT: 66px; TEXT-ALIGN: left"><CODE style="WHITE-SPACE: nowrap"><!-- php buffer start --><CODE><SPAN style="COLOR: #000000">&lt;param name="movie" value="Test_Editor.swf?filename=<SPAN style="COLOR: #0000bb">&lt;?php </SPAN><SPAN style="COLOR: #007700">echo </SPAN><SPAN style="COLOR: #0000bb">$newname</SPAN><SPAN style="COLOR: #007700">; </SPAN><SPAN style="COLOR: #0000bb">?&gt;</SPAN>" /&gt;<BR>...<BR>&lt;object type="application/x-shockwave-flash" data="Test_Editor.swf?filename=<SPAN style="COLOR: #0000bb">&lt;?php </SPAN><SPAN style="COLOR: #007700">echo </SPAN><SPAN style="COLOR: #0000bb">$newname</SPAN><SPAN style="COLOR: #007700">; </SPAN><SPAN style="COLOR: #0000bb">?&gt;</SPAN>" width="964" height="510"&gt;</SPAN> </CODE><!-- php buffer end --></CODE></DIV></DIV>as you can see, we set a variable named "filename" to the $newname php variable.<BR><BR>now, it'll be availabe in flash in the loaderInfo.parameters object, which will now have a property named "filename" that will be the path from $newname: 
<DIV style="MARGIN: 5px 20px 20px">
<DIV class=smallfont style="MARGIN-BOTTOM: 2px">PHP Code:</DIV>
<DIV class=alt2 dir=ltr style="BORDER-RIGHT: 1px inset; PADDING-RIGHT: 6px; BORDER-TOP: 1px inset; PADDING-LEFT: 6px; PADDING-BOTTOM: 6px; MARGIN: 0px; OVERFLOW: auto; BORDER-LEFT: 1px inset; WIDTH: 640px; PADDING-TOP: 6px; BORDER-BOTTOM: 1px inset; HEIGHT: 34px; TEXT-ALIGN: left"><CODE style="WHITE-SPACE: nowrap"><!-- php buffer start --><CODE><SPAN style="COLOR: #000000"><SPAN style="COLOR: #0000bb"></SPAN><SPAN style="COLOR: #007700">var </SPAN><SPAN style="COLOR: #0000bb">imageFilePath</SPAN><SPAN style="COLOR: #007700">:</SPAN><SPAN style="COLOR: #0000bb">String </SPAN><SPAN style="COLOR: #007700">= </SPAN><SPAN style="COLOR: #0000bb">root</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">loaderInfo</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">parameters</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">filename</SPAN><SPAN style="COLOR: #007700">; <BR></SPAN><SPAN style="COLOR: #0000bb"></SPAN></SPAN></CODE><!-- php buffer end --></CODE></DIV></DIV>then load it into flash using a Loader object: 
<DIV style="MARGIN: 5px 20px 20px">
<DIV class=smallfont style="MARGIN-BOTTOM: 2px">PHP Code:</DIV>
<DIV class=alt2 dir=ltr style="BORDER-RIGHT: 1px inset; PADDING-RIGHT: 6px; BORDER-TOP: 1px inset; PADDING-LEFT: 6px; PADDING-BOTTOM: 6px; MARGIN: 0px; OVERFLOW: auto; BORDER-LEFT: 1px inset; WIDTH: 640px; PADDING-TOP: 6px; BORDER-BOTTOM: 1px inset; HEIGHT: 82px; TEXT-ALIGN: left"><CODE style="WHITE-SPACE: nowrap"><!-- php buffer start --><CODE><SPAN style="COLOR: #000000"><SPAN style="COLOR: #0000bb"></SPAN><SPAN style="COLOR: #007700">var </SPAN><SPAN style="COLOR: #0000bb">loader</SPAN><SPAN style="COLOR: #007700">:</SPAN><SPAN style="COLOR: #0000bb">Loader </SPAN><SPAN style="COLOR: #007700">= new </SPAN><SPAN style="COLOR: #0000bb">Loader</SPAN><SPAN style="COLOR: #007700">();<BR>var </SPAN><SPAN style="COLOR: #0000bb">request</SPAN><SPAN style="COLOR: #007700">:</SPAN><SPAN style="COLOR: #0000bb">URLRequest </SPAN><SPAN style="COLOR: #007700">= new </SPAN><SPAN style="COLOR: #0000bb">URLRequest</SPAN><SPAN style="COLOR: #007700">(</SPAN><SPAN style="COLOR: #0000bb">imageFilePath</SPAN><SPAN style="COLOR: #007700">);<BR></SPAN><SPAN style="COLOR: #0000bb">loader</SPAN><SPAN style="COLOR: #007700">.</SPAN><SPAN style="COLOR: #0000bb">load</SPAN><SPAN style="COLOR: #007700">(</SPAN><SPAN style="COLOR: #0000bb">request</SPAN><SPAN style="COLOR: #007700">);<BR></SPAN><SPAN style="COLOR: #0000bb">addChild</SPAN><SPAN style="COLOR: #007700">(</SPAN><SPAN style="COLOR: #0000bb">loader</SPAN><SPAN style="COLOR: #007700">); <BR></SPAN><SPAN style="COLOR: #0000bb"></SPAN></SPAN></CODE><!-- php buffer end --></CODE></DIV></DIV>

//////////////////////////////////
// the return i'm trying to get //
//////////////////////////////////

<para><![CDATA[first, you need to give Flash the new file name. You can use a standard querystring or Flash vars, and php to echo it out:><BR><BR><BR>PHP Code:<BR>]]></para> 
<code><![CDATA[&lt;param name=&quot;movie&quot; value=&quot;Test_Editor.swf?filename=&lt;?php echo $newname; ?&gt;&quot; /&gt;><BR>...<BR>&lt;object type=&quot;application/x-shockwave-flash&quot; data=&quot;Test_Editor.swf?filename=&lt;?php echo $newname; ?&gt;&quot; width=&quot;964&quot; height=&quot;510&quot;&gt; ]]></code> 
<para><![CDATA[as you can see, we set a variable named &quot;filename&quot; to the $newname php variable.<BR><BR>now, it&amp;ll be availabe in flash in the loaderInfo.parameters object, which will now have a property named &quot;filename&quot; that will be the path from $newname:<BR><BR>PHP Code:<BR>]]></para>
<code><![CDATA[var imageFilePath:String = root.loaderInfo.parameters.filename; <BR>]]></code>
<para><![CDATA[then load it into flash using a Loader object: <BR><BR>PHP Code:<BR>]]></para>
<code><![CDATA[var loader:Loader = new Loader();<BR>var request:URLRequest = new URLRequest(imageFilePath);<BR>loader.load(request);<BR>addChild(loader); <BR>]]></code>

Open in new window

0
moagrius
Asked:
moagrius
  • 2
  • 2
1 Solution
 
Ray PaseurCommented:
Can you please post a URL that contains the HTML you start with?  Thanks, ~Ray
0
 
moagriusAuthor Commented:
it's actually in a local database.  the first part of that snippet would actually be the entire contents of the row's "Content" field.

i've got it working to a large degree, but it's pretty ham-handed, and i ended up using an arbitrary string as a delimiter to preserve the separation of the code block while stripping tags which i fear is inconsistent at best.

i'll post the less-than-elegant code and maybe you can suggest a better approach?

i appreciate the response.

(what precedes this is just a query to the database, where $row is the result of a mysql_fetch_assoc call...  $xml is just a string that will eventually be echoed out - i can post the whole script if you think that'd be helpful, but little besides what's attached is relevant.)
// open the node
$xml .= "\n\t\t<content>";

// strip slashes
$markup = stripslashes($row["Content"]);

// strip single quotes
$markup = preg_replace("/\'/", "&apos;", $markup);

// strip double quotes
$markup = preg_replace("/\"/", "&quot;", $markup);

// change line breaks to BR tags
$markup = preg_replace("/\n+/", "<BR>", $markup);

// replace open comment
$markup = preg_replace("/<!-- php buffer start -->/iU", "STARTCODEBLOCK", $markup);

// replace close comment
$markup = preg_replace("/<!-- php buffer end -->/iU", "ENDCODEBLOCK", $markup);

// strip tags now that commented php code is safe
$markup = strip_tags($markup, "<br>");

// separate into nodes
if(strpos($markup, "STARTCODEBLOCK") !== false){
	$markup = preg_replace("/(.*?)STARTCODEBLOCK(.*?)ENDCODEBLOCK(.*?)/", "\n\t\t\t<para><![CDATA[$1]]></para>\n\t\t\t<code><![CDATA[$2]]></code>\n\t\t\t<para><![CDATA[$3]]></para>", $markup);
} else {
	$markup = "\n\t\t\t<para><![CDATA[" . $markup . "]]></para>";
}

// get rid of empty nodes
$markup = str_replace("\n\t\t\t<para><![CDATA[]]></para>", "", $markup);

// add the nodes to the output
$xml .= $markup;

// close the node
$xml .= "\n\t\t</content>";

Open in new window

0
 
Ray PaseurCommented:
"ham-handed" - yes, but don't let that cause you any discomfort.  All "scrape" scripts share that trait.

If you're still having trouble with the REGEX, let me suggest another "ham-handed" way of dealing with these sorts of things.  You can use explode() to break strings apart into arrays, based on a character string, like this:

$str = 'ABCBD';
$arr = explode('B', $str); // ARRAY HAS A, C, D

When scraping, I often find that I can use that strategy to isolate the important pieces of data, then the individual operations can be performed on each piece.  This makes it easier to use str_replace() instead of preg_replace, and since you're now working with smaller and more predictable strings of data, you have less risk of munging something else in the string with an errant REGEX.

I realize that is more of a strategy than an answer, but hopefully it guides in the right direction.  If I have some time later today, I'll try to plow into the actual code.  Best of luck with the project, ~Ray
0
 
moagriusAuthor Commented:
that'd be great - thanks for your time so far.
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now