Solved

Need to clean up this html. Regex needed... urgent

Posted on 2004-04-23
9
466 Views
Last Modified: 2012-06-27
Hi All,

I am in an extreme need to solve this. I know it's not defficult, but I never worked with regex.

In a simple version of html below, I need to get rid of all <sctipt>tags</script>, all <meta> tags. I need to leave this javascript function though:
<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

Note that the arguments can be different of the function Report(). Here is what I have so far...(a)
      public string removeJavaCode(string oldStr) {

                string pattern = @"<script[^>]*>.*?</script[^>]*>";
                string newStr  = Regex.Replace(oldStr,pattern,"");
                return oldStr;//newStr;
      }      

This, if I pass in string.Replace("\n","") removes all javascript. I need to leave that piece though.

Thanks for any help.

Puero

/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">


<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
Comment
Question by:pureo
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
9 Comments
 

Author Comment

by:pureo
ID: 10908125
Hi,

I have those links too. I need regex expression for my problem and not link how to clean up html.

Thanks,
Puero
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909776
Just to clarify,

You want to remove all SCRIPT and META tags, except:

  1. SCRIPT tags that have the function specified (OnLoadReport), or
  2. SCRIPT tags that have any javascript functions,
  3. SCRIPT tags with any language functions ?
0
Salesforce Made Easy to Use

On-screen guidance at the moment of need enables you & your employees to focus on the core, you can now boost your adoption rates swiftly and simply with one easy tool.

 

Author Comment

by:pureo
ID: 10909794
Hello,

please in my first posting, don't mind this line: <script language="javascript" type="text/javascript" src="ReportViewer.js"></script>, that one is not in the source before entering the function. So the html I need to modify is the same as I posted in my first post, except this line in the head section. Sorry about that.

Thanks a lot, this is how it should look after modifications:
the result should look like this:


/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909814
What about this part:

<script language="javascript" type="text/javascript">
<!--
//-->
</script>
<script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">
</script>

Is that in the source code as well?
0
 

Author Comment

by:pureo
ID: 10909823
Yes, that part is in the source code.

Thanks.
Pureo
0
 
LVL 10

Accepted Solution

by:
eternal_21 earned 500 total points
ID: 10909871
The following function:

  public static string ParseHtml(string sourceString)
  {
    string newString;

    // javascriptPattern matches any <META ...> tags
    const string metaPattern = @"<META[^>]*>(\r)?\n?";
    Regex metaRegex;
    metaRegex = new Regex(metaPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = metaRegex.Replace(sourceString, "");

    // javascriptPattern matches any <SCRIPT> block that does not have a '{' or a '}'.
    const string javascriptPattern = @"<SCRIPT[^>]*>[^{}]*?</SCRIPT>(\r)?\n?";
    Regex javascriptRegex;
    javascriptRegex = new Regex(javascriptPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = javascriptRegex.Replace(newString, "");

    return newString;
  }

Produced the output:

### OUTPUT ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">

</div>
</body>
</html>

###

Based on this source code:

### SOURCE CODE ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>

###
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909872
Is that what you are looking for?
0
 

Author Comment

by:pureo
ID: 10909897
Nice, thanks a lot!

Pureo
0

Featured Post

Enroll in July's Course of the Month

July's Course of the Month is now available! Enroll to learn HTML5 and prepare for certification. It's free for Premium Members, Team Accounts, and Qualified Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
Michael from AdRem Software explains how to view the most utilized and worst performing nodes in your network, by accessing the Top Charts view in NetCrunch network monitor (https://www.adremsoft.com/). Top Charts is a view in which you can set seve…
Monitoring a network: how to monitor network services and why? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the philosophy behind service monitoring and why a handshake validation is critical in network monitoring. Software utilized …

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question