Solved

Need to clean up this html. Regex needed... urgent

Posted on 2004-04-23
9
436 Views
Last Modified: 2012-06-27
Hi All,

I am in an extreme need to solve this. I know it's not defficult, but I never worked with regex.

In a simple version of html below, I need to get rid of all <sctipt>tags</script>, all <meta> tags. I need to leave this javascript function though:
<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

Note that the arguments can be different of the function Report(). Here is what I have so far...(a)
      public string removeJavaCode(string oldStr) {

                string pattern = @"<script[^>]*>.*?</script[^>]*>";
                string newStr  = Regex.Replace(oldStr,pattern,"");
                return oldStr;//newStr;
      }      

This, if I pass in string.Replace("\n","") removes all javascript. I need to leave that piece though.

Thanks for any help.

Puero

/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">


<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
Comment
Question by:pureo
  • 4
  • 4
9 Comments
 
LVL 23

Expert Comment

by:rama_krishna580
ID: 10907011
0
 

Author Comment

by:pureo
ID: 10908125
Hi,

I have those links too. I need regex expression for my problem and not link how to clean up html.

Thanks,
Puero
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909776
Just to clarify,

You want to remove all SCRIPT and META tags, except:

  1. SCRIPT tags that have the function specified (OnLoadReport), or
  2. SCRIPT tags that have any javascript functions,
  3. SCRIPT tags with any language functions ?
0
 

Author Comment

by:pureo
ID: 10909794
Hello,

please in my first posting, don't mind this line: <script language="javascript" type="text/javascript" src="ReportViewer.js"></script>, that one is not in the source before entering the function. So the html I need to modify is the same as I posted in my first post, except this line in the head section. Sorry about that.

Thanks a lot, this is how it should look after modifications:
the result should look like this:


/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 
LVL 10

Expert Comment

by:eternal_21
ID: 10909814
What about this part:

<script language="javascript" type="text/javascript">
<!--
//-->
</script>
<script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">
</script>

Is that in the source code as well?
0
 

Author Comment

by:pureo
ID: 10909823
Yes, that part is in the source code.

Thanks.
Pureo
0
 
LVL 10

Accepted Solution

by:
eternal_21 earned 500 total points
ID: 10909871
The following function:

  public static string ParseHtml(string sourceString)
  {
    string newString;

    // javascriptPattern matches any <META ...> tags
    const string metaPattern = @"<META[^>]*>(\r)?\n?";
    Regex metaRegex;
    metaRegex = new Regex(metaPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = metaRegex.Replace(sourceString, "");

    // javascriptPattern matches any <SCRIPT> block that does not have a '{' or a '}'.
    const string javascriptPattern = @"<SCRIPT[^>]*>[^{}]*?</SCRIPT>(\r)?\n?";
    Regex javascriptRegex;
    javascriptRegex = new Regex(javascriptPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = javascriptRegex.Replace(newString, "");

    return newString;
  }

Produced the output:

### OUTPUT ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">

</div>
</body>
</html>

###

Based on this source code:

### SOURCE CODE ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>

###
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909872
Is that what you are looking for?
0
 

Author Comment

by:pureo
ID: 10909897
Nice, thanks a lot!

Pureo
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Events in static methods 3 50
This code tracks birthdays 3 61
rest webservice call over https via c# 6 28
designing in object programming 12 52
Introduction                                                 Was the var keyword really only brought out to shorten your syntax? Or have the VB language guys got their way in C#? What type of variable is it? All will be revealed.   Also called…
In order to hide the "ugly" records selectors (triangles) in the rowheaders, here are some suggestions. Microsoft doesn't have a direct method/property to do it. You can only hide the rowheader column. First solution, the easy way The first sol…
This Micro Tutorial will teach you how to censor certain areas of your screen. The example in this video will show a little boy's face being blurred. This will be demonstrated using Adobe Premiere Pro CS6.
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

25 Experts available now in Live!

Get 1:1 Help Now