Solved

HTML Post Parsing with VB

Posted on 2002-06-21
10
197 Views
Last Modified: 2010-05-02
Ok I am really blinded by VB's strange regex and need some HELP with post processing HTML docs before indexing em. I can do what I want in C in minutes but I like to keep everything VB.......so here what I need....

I need to read HTML and resolve the unresolved urls and ignore certain scripts within like openWindow commands <SCRIPT>....openWindow.....</SCRPIT>.

Hope someone has some code or know where I can get a parser MOD to do this.

0
Comment
Question by:ohmeohmy
10 Comments
 
LVL 1

Expert Comment

by:Benjy
ID: 7098274
listening
0
 
LVL 4

Expert Comment

by:gencross
ID: 7098506
You want to catch webpages before they are processed by browser, scrub them, and then send them to the browser?
0
 

Author Comment

by:ohmeohmy
ID: 7098615
gencross, Sure, whatever works to clean them up...
0
Three Reasons Why Backup is Strategic

Backup is strategic to your business because your data is strategic to your business. Without backup, your business will fail. This white paper explains why it is vital for you to design and immediately execute a backup strategy to protect 100 percent of your data.

 
LVL 16

Expert Comment

by:Richie_Simonetti
ID: 7099111
If you already have the code in C maybe we could translate it...
0
 
LVL 38

Expert Comment

by:PaulHews
ID: 7099247
Look up the InStr (also InStrRev) function in the docs.  It shows the position of a substring within a larger string.  Then when you have the position and length (usually from subtracting two positions), extract the substring using mid$ function.

Since you want to clean the code of these substrings, then you can use the replace function to replace the substrings with empty strings.

0
 
LVL 9

Expert Comment

by:GivenRandy
ID: 7099364
0
 

Author Comment

by:ohmeohmy
ID: 7100090
Well this is a good suggestion to translate regex to VB I reckon. In C/perl you can quickly resolve HTML by simply doing something like this:

# Let's put a file contents into a string
open (IN, "microsoft.htm");
while (<IN>){
$html .= $_; # this apends each line of file .=
}
close IN;

# Use this base URL
$BASE_URL = "http://microsoft.com";

# do a case insensitive global search/replace
# This resolves href="/ or src="/or whatever else
$html =~ s!"/!"$BASE_URL/!gi;
print $html;
--------cut---------

Ya can repeat the substitution for other known instances of realtive urls in HTML like lack of double quotes/slashes and what not. All very easy and requires little code. Ya can also do the same for removing certain scripts because you can do matching with substitution. How I always do that is first strip all line breaks to make on big continous line then I would just look between <script> </script> and if I get a match of something undesirable I replace with nothing to remove it.

How would ya do the same above in VB??




0
 

Author Comment

by:ohmeohmy
ID: 7100106
Well this is a good suggestion to translate regex to VB I reckon. In C/perl you can quickly resolve HTML by simply doing something like this:

# Let's put a file contents into a string
open (IN, "microsoft.htm");
while (<IN>){
$html .= $_; # this apends each line of file .=
}
close IN;

# Use this base URL
$BASE_URL = "http://microsoft.com";

# do a case insensitive global search/replace
# This resolves href="/ or src="/or whatever else
$html =~ s!"/!"$BASE_URL/!gi;
print $html;
--------cut---------

Ya can repeat the substitution for other known instances of realtive urls in HTML like lack of double quotes/slashes and what not. All very easy and requires little code. Ya can also do the same for removing certain scripts because you can do matching with substitution. How I always do that is first strip all line breaks to make on big continous line then I would just look between <script> </script> and if I get a match of something undesirable I replace with nothing to remove it.

How would ya do the same above in VB??




0
 
LVL 49

Expert Comment

by:DanRollins
ID: 7923733
Hi ohmeohmy,
It appears that you have forgotten this question. I will ask Community Support to close it unless you finalize it within 7 days. I will ask a Community Support Moderator to:

    Refund points and save as a 0-pt PAQ.

ohmeohmy, Please DO NOT accept this comment as an answer.
EXPERTS: Post a comment if you are certain that an expert deserves credit.  Explain why.
==========
DanRollins -- EE database cleanup volunteer
0
 
LVL 1

Accepted Solution

by:
Computer101 earned 0 total points
ID: 7929877
Points refunded and placed in PAQ

Computer101
E-E Admin
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction In a recent article (http://www.experts-exchange.com/A_7811-A-Better-Concatenate-Function.html) for the Excel community, I showed an improved version of the Excel Concatenate() function.  While writing that article I realized that no o…
If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
Get people started with the process of using Access VBA to control Excel using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Excel. Using automation, an Access application can laun…
Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question