• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 381
  • Last Modified:

Parse HTML source and quotation marks

ok... I have an app that I put together with VisStud.net
it will parse a web page and output data to a file.

one section is giving me a hard time.
I want to output the street address that is not dispalyed on the website. It is used as a variable to open a mapping program and is contained in the original source code in a form like so
<input type="hidden" name="address1" value="123 Main St.">

I want to pull the 123 Main St. out without pulling the quotation marks.

        strHTML = ie.document.body.innerhtml
        position1 = InStr(strHTML, "address1")
        position2 = InStr(position1, strHTML, ";") + 1
        ticket = Mid(strHTML, position2, InStr(position2, strHTML, ">") - position2)

also... the phrase address1 shows up twice before in the source... I do not want to use those points.  How can I get it to use start at the third instance of address1
0
chad
Asked:
chad
  • 10
  • 8
  • 6
2 Solutions
 
PePiCommented:
<<also... the phrase address1 shows up twice before in the source... >>
you can create a counter that checks the number of times the phrase address1 shows up. on the 3rd time then do your parsing.

strHTML = Replace(ie.document.body.innerhtml, """", "")
tickect = Mid(strHTML ,InStrRev(strHTML , "value=") + 6, Len(strHTML ) - (InStrRev(strHTML , "value=") + 8))


HTH!!


0
 
chadAuthor Commented:
Thanks PePi for the response.
I like 'replace' I have not seen that before

what is each section of this line do?  I am new to vb
ticket = Mid(strHTML ,InStrRev(strHTML , "value=") + 6, Len(strHTML ) - (InStrRev(strHTML , "value=") + 8))

thanks
0
 
PePiCommented:
ticket = .... line parses the value of address1. this just uses less variables. you would not need the variables position1 and position2. it also uses the string "value=" as the reference point in parsing. if i remember my HTML correctly, the "value=" can appear before the other tags right? this just ensures that you are parsing the right data.

do you think creating a counter and keeping track of the number of occurance of address1 will work?
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
connah0047Commented:
Or you could use regular expressions which is the "right" way to do it. :)

Project -> References
Add reference to "Microsoft VBScript Regular Expressions 5.5"

Then this code would print out to the debug window the data you want. The below assumes the HTML you are parsing is containg within a variable called "strHTML". It ALSO assumes that you have multiple address fields that you want to parse out. Having just one will work, too.

'---BEGIN CODE---'
    Dim RegEx As New RegExp
    Dim matMatch As Match
    Dim matMatches As MatchCollection
   
    RegEx.Global = True
    RegEx.IgnoreCase = True
    RegEx.Pattern = "<input type=""hidden"" name=""address[0-9]+"" value=""(.*?)"">"
   
    Set matMatches = RegEx.Execute(strHTML)
   
    For Each matMatch In matMatches
        Debug.Print matMatch.SubMatches(0)
    Next
'---END CODE---'

That may look like a lot of typing just to get the address, but first, it's much more efficient. Second, if you're parsing HTML, you probably want to be doing everything else like this too. The speed and ease of regular expression parsing compared to string manipulation in VB is tremendous.  The above code will print out the value for EACH address field in the HTML document that begins with the word "address" and ends with any number. So if you had:

<input type="hidden" name="address1" value="Joe Blow">
<input type="hidden" name="address2" value="123 Main St.">
<input type="hidden" name="address3" value="Chucktown, SC 29401">

The debug output would be:

Joe Blow
123 Main St.
Chucktown, SC 29401

Hope this helps! Let me know if you would like anything here explained in more detail.
0
 
PePiCommented:
what happens when the pattern doesn't have a standard? example, one line may have :

<input type="hidden" name="address1" value="123 Main St."> and another may have
<input type="hidden" value="123 Main St." name="address1"> or
<input type= value="123 Main St." "hidden" name="address1"> ???


I guess you have to change the pattern???

0
 
connah0047Commented:
Kabaam, you could use Pepi's method if you wanted to keep it short and sweet and simply get the job done. But (no offense, Pepi) the code could be a little more accurate. Again, assuming your HTML is in a variable called strHTML:

Debug.Print Mid(strHTML, InStr(1, strHTML, "name=""address1"" value=""") + 23, Len(strHTML) - InStr(1, strHTML, "name=""address1"" value=""") - 24)

That code assures you get the correct data. Pepi's example returns this:

"123 Main St

instead of this:

123 Main St.
0
 
connah0047Commented:
> what happens when the pattern doesn't have a standard? example, one line may have :

You may raise that question any time you are parsing something. If something isn't standard, it cannot be parsed. If you can parse it, it is somehow standardized.
0
 
connah0047Commented:
Btw, Pepi, my apologies for my comment a moment ago concerning your code. Kabaam, that code of his works fine. I jumped the gun. Sorry, Pepi!
0
 
PePiCommented:
<<Kabaam, you could use Pepi's method if you wanted to keep it short and sweet and simply get the job done. But (no offense, Pepi) the code could be a little more accurate. Again, assuming your HTML is in a variable called strHTML:>>
non taken ;)

<<That code assures you get the correct data. Pepi's example returns this:>>

"123 Main St

instead of this:

123 Main St.

this will never happen because there is a replace being done prior to parsing
0
 
PePiCommented:
<<Btw, Pepi, my apologies for my comment a moment ago concerning your code. Kabaam, that code of his works fine. I jumped the gun. Sorry, Pepi!>>

No problem, we are all here to share. and i don't mind people criticizing my solutions. i learn from it as well
0
 
chadAuthor Commented:
ok.. thanks folks..
I have started with PePi's first code suggestion and having problems with it.
If I include it as it... I get the correct starting point of the value but then it includes the remaining source code
also it includes some of the text BEFORE the address1

for example..

<input type="hidden" name="address1" value="123 Main St.">
<more html>

will output

123 Main St. name=address1><more html>

any ideas
0
 
connah0047Commented:
kabaam: Do you have more than one address field per page that you are trying to parse out? Also, is the address line always formatted the same way every time?
0
 
chadAuthor Commented:
for now, I am testing it with only one address field per page and yes it will be formatted the same always
0
 
chadAuthor Commented:
and connah0047
when I try to use the debug.print line you gave me I get this
'Print' is not a member of 'System.Diagnostics.Debug'.
0
 
connah0047Commented:
Oh crap. You're using .NET. Nevermind then. :) I don't know .NET and shiver at the thought of learning it. I thought you were using VB5 or VB6.
0
 
PePiCommented:
I also assumed that strHTML contains 1 line of HTML code at a time.
0
 
PePiCommented:
eeek.... VB.NET!!!!
0
 
connah0047Commented:
Ok, we're going to get this one way or another. Try this kabaam:

'---BEGIN CODE---'
    Dim lngStartPos As Long
    Dim lngStopPos As Long
    Dim strAddress As String
   
    lngStartPos = InStr(1, strHTML, " name=""address1"" value=""") + 24
    lngStopPos = InStr(lngStartPos, strHTML, """>")
   
    strAddress = Mid(strHTML, lngStartPos, lngStopPos - lngStartPos)
'---END CODE---'

strAddress now contains the data you want. Again assuming the HTML is in strHTML.
0
 
connah0047Commented:
And it should work in .NET. You know, when .NET came out, my I bought it. I had no idea how different it was from VB6...I didn't know I was going to have to learn an entire new language. I never have had time to learn it so I didn't I will take VB6 to my grave! :)
0
 
chadAuthor Commented:
using that I get this

Overload resolution failed because no accessible 'InStr' can be called without a narrowing conversion:

0
 
connah0047Commented:
Ok, following are instructions for uninstalling VB.NET and installing VB6... :) :)
Sorry man, I don't think I can help you. Pepi, it's all up to you now! :)
0
 
connah0047Commented:
Kabaam, you should try posting your question in the VB.NET forum. The folks over there could help better than here.
0
 
PePiCommented:
lol... my VB.NET skills is not up there yet. had no time to practice since VB6 is the choice language in my place of work. I dunno why but....
0
 
chadAuthor Commented:
thanks folks... I did learn a bit here with this question
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 10
  • 8
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now