Solved

Need to clean up this html. Regex needed... urgent

Posted on 2004-04-23
9
432 Views
Last Modified: 2012-06-27
Hi All,

I am in an extreme need to solve this. I know it's not defficult, but I never worked with regex.

In a simple version of html below, I need to get rid of all <sctipt>tags</script>, all <meta> tags. I need to leave this javascript function though:
<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

Note that the arguments can be different of the function Report(). Here is what I have so far...(a)
      public string removeJavaCode(string oldStr) {

                string pattern = @"<script[^>]*>.*?</script[^>]*>";
                string newStr  = Regex.Replace(oldStr,pattern,"");
                return oldStr;//newStr;
      }      

This, if I pass in string.Replace("\n","") removes all javascript. I need to leave that piece though.

Thanks for any help.

Puero

/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">


<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
Comment
Question by:pureo
  • 4
  • 4
9 Comments
 
LVL 23

Expert Comment

by:rama_krishna580
ID: 10907011
0
 

Author Comment

by:pureo
ID: 10908125
Hi,

I have those links too. I need regex expression for my problem and not link how to clean up html.

Thanks,
Puero
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909776
Just to clarify,

You want to remove all SCRIPT and META tags, except:

  1. SCRIPT tags that have the function specified (OnLoadReport), or
  2. SCRIPT tags that have any javascript functions,
  3. SCRIPT tags with any language functions ?
0
 

Author Comment

by:pureo
ID: 10909794
Hello,

please in my first posting, don't mind this line: <script language="javascript" type="text/javascript" src="ReportViewer.js"></script>, that one is not in the source before entering the function. So the html I need to modify is the same as I posted in my first post, except this line in the head section. Sorry about that.

Thanks a lot, this is how it should look after modifications:
the result should look like this:


/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 10

Expert Comment

by:eternal_21
ID: 10909814
What about this part:

<script language="javascript" type="text/javascript">
<!--
//-->
</script>
<script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">
</script>

Is that in the source code as well?
0
 

Author Comment

by:pureo
ID: 10909823
Yes, that part is in the source code.

Thanks.
Pureo
0
 
LVL 10

Accepted Solution

by:
eternal_21 earned 500 total points
ID: 10909871
The following function:

  public static string ParseHtml(string sourceString)
  {
    string newString;

    // javascriptPattern matches any <META ...> tags
    const string metaPattern = @"<META[^>]*>(\r)?\n?";
    Regex metaRegex;
    metaRegex = new Regex(metaPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = metaRegex.Replace(sourceString, "");

    // javascriptPattern matches any <SCRIPT> block that does not have a '{' or a '}'.
    const string javascriptPattern = @"<SCRIPT[^>]*>[^{}]*?</SCRIPT>(\r)?\n?";
    Regex javascriptRegex;
    javascriptRegex = new Regex(javascriptPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = javascriptRegex.Replace(newString, "");

    return newString;
  }

Produced the output:

### OUTPUT ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">

</div>
</body>
</html>

###

Based on this source code:

### SOURCE CODE ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>

###
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909872
Is that what you are looking for?
0
 

Author Comment

by:pureo
ID: 10909897
Nice, thanks a lot!

Pureo
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Suggested Solutions

Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now