• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 210
  • Last Modified:

HTML Post Parsing with VB

Ok I am really blinded by VB's strange regex and need some HELP with post processing HTML docs before indexing em. I can do what I want in C in minutes but I like to keep everything VB.......so here what I need....

I need to read HTML and resolve the unresolved urls and ignore certain scripts within like openWindow commands <SCRIPT>....openWindow.....</SCRPIT>.

Hope someone has some code or know where I can get a parser MOD to do this.

0
James
Asked:
James
1 Solution
 
BenjyCommented:
listening
0
 
gencrossCommented:
You want to catch webpages before they are processed by browser, scrub them, and then send them to the browser?
0
 
JamesAuthor Commented:
gencross, Sure, whatever works to clean them up...
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

 
Richie_SimonettiIT OperationsCommented:
If you already have the code in C maybe we could translate it...
0
 
PaulHewsCommented:
Look up the InStr (also InStrRev) function in the docs.  It shows the position of a substring within a larger string.  Then when you have the position and length (usually from subtracting two positions), extract the substring using mid$ function.

Since you want to clean the code of these substrings, then you can use the replace function to replace the substrings with empty strings.

0
 
JamesAuthor Commented:
Well this is a good suggestion to translate regex to VB I reckon. In C/perl you can quickly resolve HTML by simply doing something like this:

# Let's put a file contents into a string
open (IN, "microsoft.htm");
while (<IN>){
$html .= $_; # this apends each line of file .=
}
close IN;

# Use this base URL
$BASE_URL = "http://microsoft.com";

# do a case insensitive global search/replace
# This resolves href="/ or src="/or whatever else
$html =~ s!"/!"$BASE_URL/!gi;
print $html;
--------cut---------

Ya can repeat the substitution for other known instances of realtive urls in HTML like lack of double quotes/slashes and what not. All very easy and requires little code. Ya can also do the same for removing certain scripts because you can do matching with substitution. How I always do that is first strip all line breaks to make on big continous line then I would just look between <script> </script> and if I get a match of something undesirable I replace with nothing to remove it.

How would ya do the same above in VB??




0
 
JamesAuthor Commented:
Well this is a good suggestion to translate regex to VB I reckon. In C/perl you can quickly resolve HTML by simply doing something like this:

# Let's put a file contents into a string
open (IN, "microsoft.htm");
while (<IN>){
$html .= $_; # this apends each line of file .=
}
close IN;

# Use this base URL
$BASE_URL = "http://microsoft.com";

# do a case insensitive global search/replace
# This resolves href="/ or src="/or whatever else
$html =~ s!"/!"$BASE_URL/!gi;
print $html;
--------cut---------

Ya can repeat the substitution for other known instances of realtive urls in HTML like lack of double quotes/slashes and what not. All very easy and requires little code. Ya can also do the same for removing certain scripts because you can do matching with substitution. How I always do that is first strip all line breaks to make on big continous line then I would just look between <script> </script> and if I get a match of something undesirable I replace with nothing to remove it.

How would ya do the same above in VB??




0
 
DanRollinsCommented:
Hi ohmeohmy,
It appears that you have forgotten this question. I will ask Community Support to close it unless you finalize it within 7 days. I will ask a Community Support Moderator to:

    Refund points and save as a 0-pt PAQ.

ohmeohmy, Please DO NOT accept this comment as an answer.
EXPERTS: Post a comment if you are certain that an expert deserves credit.  Explain why.
==========
DanRollins -- EE database cleanup volunteer
0
 
Computer101Commented:
Points refunded and placed in PAQ

Computer101
E-E Admin
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now