Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 502
  • Last Modified:

vba regex on webpage

Hi,

is it possible to do a regex on a webpage using vba?
As I need to copy a certain header of a webpage and paste it in excel.

I know in vb.net you can use a streamreader, but how is this done in vba excel?
0
Mutsop
Asked:
Mutsop
  • 8
  • 7
  • 5
2 Solutions
 
Arno KosterCommented:
VBA stands for Visual Basic for Applications, and is mainly used within the microsoft office suite.
In a webpage you can use javascript and vbscript, but VBA in general cannot be used.

On the other hand, you could use VBA code inside of an excel document to read a webpage and copy the header from it.
Could you clarify on how you want to perform the copying and pasting ?

eg.
 - have a button on a webpage which performs the functionality when clicked (eg. by using PHP)
 - have an excel sheet with a button which performs the functionality when clicked (eg. by using VBA)
 - etc
0
 
MutsopAuthor Commented:
I think you undertsood me wrong,

I have an excel file with a form, a number needs to be inputted in a textbox.
After submitting the application needs to copy a header (ex h1) from a website with url
www.example.com/tva.php?number=textbox.value

so excel doesnt need to open a browser, it just needs to be able to copy the header of that specific url.
0
 
Arno KosterCommented:
So you are looking for something like :

user fills in a number into the textbox
user submits the form (eg. by clicking on a button)
VBA macro is started which
  - generates the URL of the website from which the header is to be copied
  - loads the website into memory (this can be visible or hidden for the user)
  - copies the header from the website
  - possibly pasts the header to a certain cell

This indeed can be done using VBA and will not be that hard to achieve. The most difficult step will be finding the exact location of the specific header in the webpage.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Robert SchuttSoftware EngineerCommented:
@akoster: do you want to dive into this? I have a sheet lying around, am putting a form in for the complete example as I'm typing this. Actually uses a xmlHttp object to retrieve the html and a regexp to find the h1 tag(s). Can post it if you wouldn't feel lobsided (is that the correct word? not a native english speaker ;-)...
0
 
Arno KosterCommented:
Sure, no problem.
When you have posted the excel workbook, i'll insert the necessary code.

On the other hand, you could also try to use this code fragment and tweak it to fit your needs :
Sub import()
Dim ie As InternetExplorer

    Set ie = New InternetExplorer
    ie.navigate "[url]"
    
    Do While ie.Busy
        Application.Wait DateAdd("s", 1, Now)
    Loop
    
    For Each element In ie.document.getelementsbytagname("H1")
        If element.classname = "[specific header class for the H1 object you are looking for]" Then
            If element.Children.Length > 0 Then
                    result = element.FirstChild.innertext
                    exit for
            End If
        End If
    Next element
    
    Worksheets("[name of worksheet in which result is placed]").Range("[address of cell in which result is placed") = result

    ie.Quit
    Set ie = Nothing

End Sub

Open in new window

0
 
Arno KosterCommented:
Keep in mind though that this is a starting point taken from an existing application and needs some tweaking to your situation but also some checks to make it more robust.
0
 
Robert SchuttSoftware EngineerCommented:
funny, quite a different approach. If only for reference, I'll post my take on it:

 readpage.xls

First time, close the form and adjust the url in the code for UserForm1 (or copy the other functions to your own workbook).
0
 
Arno KosterCommented:
Well, i can see pro's and cons in both approaches.
I think it would be best if author tries one or if possible both approaches...



readpage-AK.xls
0
 
MutsopAuthor Commented:
Hi,
thanks a million for helping me out with this :)

Now I tried both of your ideas and actually like both.

@akoster:
The positive part about your code is that I can select a specific tag. Which can be useful in another project I'm working on. Well I think, it would be all of the childs of a specific parent in an online xml file.

Downside is, as the webpage I need to retrieve info from has several tags needed (also certain <p> tags) I can't really use em all.

@robert_schutt:
I love the use of regex! problem thought was I couldn't find a the name immediately of the reference. Apparently it was microsoft specific and found it in your xls file.
With your code I could just copy paste my several regexes from my vb.net project :)

Thanks again to both!

(i'll accept this project this afternoon, in case you want to comment on this)
0
 
Arno KosterCommented:
Mutsop,

good to see that our combined efforts helped you out !

if the webpage uses different styles / classes for the tags and you are only interested in a specific class name, you could use this to filter them, eg. by using this code


    For Each element In ie.document.getelementsbytagname("p")
        If element.classname = "classname" Then
           [...]
        end if
    next element

Open in new window


next, if a span or paragraph contains a number of tags, you can also parse these by using

 
   For Each element In ie.document.getelementsbytagname("p")
        If element.classname = "classname" Then
            If element.Children.Length > 0 Then
                msgbox "First element : " & element.FirstChild.innertext
                msgbox "Third element : " & element.children(2).innertext
            end if
        end if
    next element

Open in new window


0
 
MutsopAuthor Commented:
k so I managed to get the company name using both of your methods.

But now I need to get the adres: for example following url shows what I'm talking about:
http://www.societe.com/societe/lomatrans-location-manutention-transports-306159021.html

But to get the adres ("Siege Social") I'm kind of in a struggle using any of your methods.
The regex in vb.net I use:
<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>

Open in new window

that works, but not when using it in vba. (check code snippet)

@akoster: as you can see, using your method here would almost be impossible as there are no class name indications.

<tbody><tr align="left" valign="top">
																			<td style="color:#777777">
																				Activité
																			</td>
																			<td colspan="2">
																				Transports routiers de fret de proximité<br>4941B
																			</td>
																		</tr>
																		
																		
																		<tr align="left" valign="top">
																			<td style="color:#777777">
																			       Catégorie
																			</td>
																			<td colspan="2">
																				Transports
																			</td>
																		</tr>
																		
					

                   			           
																		<tr align="left" valign="top">
																			<td style="color:#777777">
																				Siège social
																			</td>
																			<td>
																				<a href="http://maps.google.fr/maps?f=q&amp;hl=fr&amp;geocode=&amp;q=%20Quartier%20de%20la%20Grande%20Bastide%20CHATEAUNEUF%20LES%20MARTIGUES" target="_blank" class="lienFleche"> Quartier de la Grande Bastide <br>13220 CHATEAUNEUF LES MARTIGUES</a>
																			</td>
				 <td valign="middle" align="left"><a href="http://maps.google.fr/maps?f=q&amp;hl=fr&amp;geocode=&amp;q=%20Quartier%20de%20la%20Grande%20Bastide%20CHATEAUNEUF%20LES%20MARTIGUES" target="_blank"><img border="0" src="http://img1.societe.com/imgz/locator.png" alt="google map" title="plan"></a></td>
																		</tr>
																		
																		
																		<tr align="left" valign="top">
																			<td style="color:#777777">
																				Forme juridique
																			</td>
																			<td colspan="2">
																				société par actions simplifiée
																			</td>
																		</tr>
																		
																	</tbody>

Open in new window

0
 
Robert SchuttSoftware EngineerCommented:
It works for me in VBA.

I tried it out by adding another piece of code in CommandButton1_Click, just before the UserForm1.Hide()

 re.Pattern = "<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>"
     Set Matches = re.Execute(t)
    For Each Match In Matches
        ActiveCell.Value = Match.Submatches(0)
        ActiveCell.Offset(0, 1).Select
        ActiveCell.Value = Match.Submatches(1)
        ActiveCell.Offset(1, -1).Select
    Next

Open in new window


It could have something to do with setting re.Global en MultiLine.
0
 
Arno KosterCommented:
you can try this :
    Dim ie As InternetExplorer

    Set ie = New InternetExplorer
    ie.navigate "http://www.societe.com/societe/lomatrans-location-manutention-transports-306159021.html"
    
    Do While ie.Busy
        Application.Wait DateAdd("s", 1, Now)
    Loop
    
    Sheet1.Range("A1").Select
    '-- parse company name
    For Each element In ie.Document.getelementsbytagname("H1")
        ActiveCell.Value = element.innertext
        ActiveCell.Offset(1).Select
    Next element
    
    '-- parse address
    For Each element In ie.Document.getelementsbytagname("TD")
        If element.Style.Color = "#666" Then
            '-- table element with color #666 contains the div we are looking for
            ActiveCell.Value = element.FirstChild.innertext
            ActiveCell.Offset(1).Select
        End If
    Next element
    
    ie.Quit
    Set ie = Nothing
    UserForm1.Hide

Open in new window

0
 
MutsopAuthor Commented:
@Robert_Schutt:

weird, this is what I have:
Private Sub CommandButtonBtw_Click()
    Dim rw As Long
    rw = ActiveSheet.Range("D" & ActiveSheet.Rows.Count).End(xlUp).row
    
    ActiveSheet.Cells(rw + 1, "D").Value = UCase(TextBoxBtw.Value)
    
    ActiveSheet.Cells(rw + 1, "E").Select
    Dim t
    t = getHtml("http://www.societe.com/cgi-bin/mainsrch?champ=" & CStr(tva))
    Dim re1
    Set re1 = New RegExp
    re1.Global = True
    re1.MultiLine = False ' True
    re1.Pattern = "<h1[^>]*>(.*?)</h1>"
    Dim Matches1, Match1
    Set Matches1 = re1.Execute(t)
    For Each Match1 In Matches1
        ActiveCell.Value = Match1.Submatches(0)
        ActiveCell.Offset(0, 1).Select
    Next

    Dim re2
    Set re2 = New RegExp
    re2.Global = True
    re2.MultiLine = False ' True
    re2.Pattern = "<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>"
    Dim Matches2, Match2
    Set Matches2 = re2.Execute(t)
    For Each Match2 In Matches2
        ActiveCell.Value = Match.Submatches(0)
        ActiveCell.Offset(0, 1).Select
        ActiveCell.Value = Match.Submatches(1)
        ActiveCell.Offset(1, -1).Select
    Next
    frmBtwInvoegen.Hide
    'GetInfo
    Me.Hide
End Sub

Open in new window

And the second part doesnt seem to work. I tried all possible combinations with the global and MultiLine setting.

@akoster: I'll try that out :)
0
 
Robert SchuttSoftware EngineerCommented:
Hi, I think I spotted the problem: on lines 30 and 32 you're using Match instead of Match2.
0
 
MutsopAuthor Commented:
@akoster:

seems to work... although I have a minor problem.
1. the street and city needs to be splitted in 2 different cells (next to each other)
2. "Afficher le Numéro de téléphone " is written after the street. that seems to be a link but it inherits the same color of td (#666).

So I thought writing :
If element.Style.Color = "#666" And element.classname <> "tellink" Then
            '-- table element with color #666 contains the div we are looking for
            ActiveCell.Value = element.FirstChild.innertext
            ActiveCell.Offset(0, 1).Select
        End If

Open in new window


see "And element.classname <> "tellink""... problem though is that its a firstChild.innertext.
0
 
MutsopAuthor Commented:
@robert_schutt: Sorry I copied the wrong code (I accidently pressed undo before copying the code).
But I did use Match2 and still didn't work.

I tried global false and true and MultiLine too. None work.

Although the first one works (header) not the addresses.
0
 
Arno KosterCommented:
Assuming that you'd rather not have the 'afficher le numero de telephone' included, try this :
    For Each Element In ie.Document.getelementsbytagname("TD")
        If Element.Style.Color = "#666" Then
            '-- table element with color #666 contains the div we are looking for
            Address = Split(Element.FirstChild.innertext, vbCrLf)(0)
            '-- split street and city
            For pos = Len(Address) To 1 Step -1
                If IsNumeric(Mid(Address, pos, 1)) Then
                    '-- everything up to this position is considered street
                    ActiveCell.Value = Left(Address, pos)
                    ActiveCell.Offset(0, 1).Select
                    '-- everything after this position is considered city
                    ActiveCell.Value = Mid(Address, pos + 1)
                    ActiveCell.Offset(1, -1).Select
                    '-- address found, stop splitting
                    Exit For
                End If
            Next pos
        End If
    Next Element

Open in new window

0
 
Robert SchuttSoftware EngineerCommented:
I think there's a problem with the character set of the retrieved page. If I change the pattern, it does work.

    re2.Pattern = "<td.*?>\s*\S+ social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>"

Open in new window

0
 
MutsopAuthor Commented:
Both work perfectly :)

Thanks alot for your help guys.
I both gave you points as I can actually use both on different projects.
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 8
  • 7
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now