?
Solved

Need to clean up this html. Regex needed... urgent

Posted on 2004-04-23
9
Medium Priority
?
477 Views
Last Modified: 2012-06-27
Hi All,

I am in an extreme need to solve this. I know it's not defficult, but I never worked with regex.

In a simple version of html below, I need to get rid of all <sctipt>tags</script>, all <meta> tags. I need to leave this javascript function though:
<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

Note that the arguments can be different of the function Report(). Here is what I have so far...(a)
      public string removeJavaCode(string oldStr) {

                string pattern = @"<script[^>]*>.*?</script[^>]*>";
                string newStr  = Regex.Replace(oldStr,pattern,"");
                return oldStr;//newStr;
      }      

This, if I pass in string.Replace("\n","") removes all javascript. I need to leave that piece though.

Thanks for any help.

Puero

/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">


<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
Comment
Question by:pureo
  • 4
  • 4
9 Comments
 

Author Comment

by:pureo
ID: 10908125
Hi,

I have those links too. I need regex expression for my problem and not link how to clean up html.

Thanks,
Puero
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909776
Just to clarify,

You want to remove all SCRIPT and META tags, except:

  1. SCRIPT tags that have the function specified (OnLoadReport), or
  2. SCRIPT tags that have any javascript functions,
  3. SCRIPT tags with any language functions ?
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 

Author Comment

by:pureo
ID: 10909794
Hello,

please in my first posting, don't mind this line: <script language="javascript" type="text/javascript" src="ReportViewer.js"></script>, that one is not in the source before entering the function. So the html I need to modify is the same as I posted in my first post, except this line in the head section. Sorry about that.

Thanks a lot, this is how it should look after modifications:
the result should look like this:


/////////////////////////////////////////////////
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
//here being html code.

</div>
</body>
</html>
///////////////////////////////////////////////////////
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909814
What about this part:

<script language="javascript" type="text/javascript">
<!--
//-->
</script>
<script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">
</script>

Is that in the source code as well?
0
 

Author Comment

by:pureo
ID: 10909823
Yes, that part is in the source code.

Thanks.
Pureo
0
 
LVL 10

Accepted Solution

by:
eternal_21 earned 2000 total points
ID: 10909871
The following function:

  public static string ParseHtml(string sourceString)
  {
    string newString;

    // javascriptPattern matches any <META ...> tags
    const string metaPattern = @"<META[^>]*>(\r)?\n?";
    Regex metaRegex;
    metaRegex = new Regex(metaPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = metaRegex.Replace(sourceString, "");

    // javascriptPattern matches any <SCRIPT> block that does not have a '{' or a '}'.
    const string javascriptPattern = @"<SCRIPT[^>]*>[^{}]*?</SCRIPT>(\r)?\n?";
    Regex javascriptRegex;
    javascriptRegex = new Regex(javascriptPattern, RegexOptions.Singleline|RegexOptions.IgnoreCase);
    newString = javascriptRegex.Replace(newString, "");

    return newString;
  }

Produced the output:

### OUTPUT ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

</head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">

</div>
</body>
</html>

###

Based on this source code:

### SOURCE CODE ###

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>
</title>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<META http-equiv="Content-Style-Type" content="text/css">
<META http-equiv="Content-Script-Type" content="text/javascript">

<style type="text/css">//here being a stylesheet</style>

<script language="javascript" type="text/javascript">
<!--
//-->
</script><script language="javascript" type="text/javascript" src="?rs:Command=Get&amp;rc:GetImage=8.00.743.00Report.js">

</script><script language="javascript" type="text/javascript">
<!--
function OnLoadReport()
{
var pageHits = null;
var rep = new Report(1, 4, pageHits, false, docMapIds);
if (parent != self) parent.OnLoadReport(rep);
}
//-->
</script>

<script language="javascript" type="text/javascript" src="ReportViewer.js"></script></head>
<body onload="javascript:OnLoadReport();" style="OVERFLOW: hidden; BORDER: 0px; MARGIN: 0px; PADDING: 0px">
<div id="oReportDiv" onresize="javascript:OnResizeDiv()" style="OVERFLOW: auto; WIDTH: 100%; HEIGHT: 100%">
<script language="javascript" type="text/javascript">
<!--
var docMapIds = [];
//-->
</script>

</div>
</body>
</html>

###
0
 
LVL 10

Expert Comment

by:eternal_21
ID: 10909872
Is that what you are looking for?
0
 

Author Comment

by:pureo
ID: 10909897
Nice, thanks a lot!

Pureo
0

Featured Post

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Calculating holidays and working days is a function that is often needed yet it is not one found within the Framework. This article presents one approach to building a working-day calculator for use in .NET.
It was really hard time for me to get the understanding of Delegates in C#. I went through many websites and articles but I found them very clumsy. After going through those sites, I noted down the points in a easy way so here I am sharing that unde…
The video provides a quick and easy steps to migrate MBOX file to well known Outlook PST and Office 365. Besides this, it also supports and migrates more than 20 email clients of MBOX which include AppleMail, Opera, Thunderbird and SeaMonkey effortl…
Free Data Recovery software is an advanced solution from Kernel Tools to recover data and files such as documents, emails, database, media and pictures, etc. It supports recovery from physical & logical drive after a hard disk crash, accidental/inte…
Suggested Courses

588 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question