Solved

vba regex on webpage

Posted on 2011-09-06
20
485 Views
Last Modified: 2012-05-12
Hi,

is it possible to do a regex on a webpage using vba?
As I need to copy a certain header of a webpage and paste it in excel.

I know in vb.net you can use a streamreader, but how is this done in vba excel?
0
Comment
Question by:Mutsop
  • 8
  • 7
  • 5
20 Comments
 
LVL 19

Expert Comment

by:akoster
ID: 36488614
VBA stands for Visual Basic for Applications, and is mainly used within the microsoft office suite.
In a webpage you can use javascript and vbscript, but VBA in general cannot be used.

On the other hand, you could use VBA code inside of an excel document to read a webpage and copy the header from it.
Could you clarify on how you want to perform the copying and pasting ?

eg.
 - have a button on a webpage which performs the functionality when clicked (eg. by using PHP)
 - have an excel sheet with a button which performs the functionality when clicked (eg. by using VBA)
 - etc
0
 
LVL 3

Author Comment

by:Mutsop
ID: 36488693
I think you undertsood me wrong,

I have an excel file with a form, a number needs to be inputted in a textbox.
After submitting the application needs to copy a header (ex h1) from a website with url
www.example.com/tva.php?number=textbox.value

so excel doesnt need to open a browser, it just needs to be able to copy the header of that specific url.
0
 
LVL 19

Expert Comment

by:akoster
ID: 36488864
So you are looking for something like :

user fills in a number into the textbox
user submits the form (eg. by clicking on a button)
VBA macro is started which
  - generates the URL of the website from which the header is to be copied
  - loads the website into memory (this can be visible or hidden for the user)
  - copies the header from the website
  - possibly pasts the header to a certain cell

This indeed can be done using VBA and will not be that hard to achieve. The most difficult step will be finding the exact location of the specific header in the webpage.
0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 36488888
@akoster: do you want to dive into this? I have a sheet lying around, am putting a form in for the complete example as I'm typing this. Actually uses a xmlHttp object to retrieve the html and a regexp to find the h1 tag(s). Can post it if you wouldn't feel lobsided (is that the correct word? not a native english speaker ;-)...
0
 
LVL 19

Expert Comment

by:akoster
ID: 36488977
Sure, no problem.
When you have posted the excel workbook, i'll insert the necessary code.

On the other hand, you could also try to use this code fragment and tweak it to fit your needs :
Sub import()
Dim ie As InternetExplorer

    Set ie = New InternetExplorer
    ie.navigate "[url]"
    
    Do While ie.Busy
        Application.Wait DateAdd("s", 1, Now)
    Loop
    
    For Each element In ie.document.getelementsbytagname("H1")
        If element.classname = "[specific header class for the H1 object you are looking for]" Then
            If element.Children.Length > 0 Then
                    result = element.FirstChild.innertext
                    exit for
            End If
        End If
    Next element
    
    Worksheets("[name of worksheet in which result is placed]").Range("[address of cell in which result is placed") = result

    ie.Quit
    Set ie = Nothing

End Sub

Open in new window

0
 
LVL 19

Expert Comment

by:akoster
ID: 36488990
Keep in mind though that this is a starting point taken from an existing application and needs some tweaking to your situation but also some checks to make it more robust.
0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 36489033
funny, quite a different approach. If only for reference, I'll post my take on it:

 readpage.xls

First time, close the form and adjust the url in the code for UserForm1 (or copy the other functions to your own workbook).
0
 
LVL 19

Expert Comment

by:akoster
ID: 36489489
Well, i can see pro's and cons in both approaches.
I think it would be best if author tries one or if possible both approaches...



readpage-AK.xls
0
 
LVL 3

Author Comment

by:Mutsop
ID: 36493850
Hi,
thanks a million for helping me out with this :)

Now I tried both of your ideas and actually like both.

@akoster:
The positive part about your code is that I can select a specific tag. Which can be useful in another project I'm working on. Well I think, it would be all of the childs of a specific parent in an online xml file.

Downside is, as the webpage I need to retrieve info from has several tags needed (also certain <p> tags) I can't really use em all.

@robert_schutt:
I love the use of regex! problem thought was I couldn't find a the name immediately of the reference. Apparently it was microsoft specific and found it in your xls file.
With your code I could just copy paste my several regexes from my vb.net project :)

Thanks again to both!

(i'll accept this project this afternoon, in case you want to comment on this)
0
 
LVL 19

Expert Comment

by:akoster
ID: 36493981
Mutsop,

good to see that our combined efforts helped you out !

if the webpage uses different styles / classes for the tags and you are only interested in a specific class name, you could use this to filter them, eg. by using this code


    For Each element In ie.document.getelementsbytagname("p")
        If element.classname = "classname" Then
           [...]
        end if
    next element

Open in new window


next, if a span or paragraph contains a number of tags, you can also parse these by using

 
   For Each element In ie.document.getelementsbytagname("p")
        If element.classname = "classname" Then
            If element.Children.Length > 0 Then
                msgbox "First element : " & element.FirstChild.innertext
                msgbox "Third element : " & element.children(2).innertext
            end if
        end if
    next element

Open in new window


0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 3

Author Comment

by:Mutsop
ID: 36494182
k so I managed to get the company name using both of your methods.

But now I need to get the adres: for example following url shows what I'm talking about:
http://www.societe.com/societe/lomatrans-location-manutention-transports-306159021.html

But to get the adres ("Siege Social") I'm kind of in a struggle using any of your methods.
The regex in vb.net I use:
<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>

Open in new window

that works, but not when using it in vba. (check code snippet)

@akoster: as you can see, using your method here would almost be impossible as there are no class name indications.

<tbody><tr align="left" valign="top">
																			<td style="color:#777777">
																				Activité
																			</td>
																			<td colspan="2">
																				Transports routiers de fret de proximité<br>4941B
																			</td>
																		</tr>
																		
																		
																		<tr align="left" valign="top">
																			<td style="color:#777777">
																			       Catégorie
																			</td>
																			<td colspan="2">
																				Transports
																			</td>
																		</tr>
																		
					

                   			           
																		<tr align="left" valign="top">
																			<td style="color:#777777">
																				Siège social
																			</td>
																			<td>
																				<a href="http://maps.google.fr/maps?f=q&amp;hl=fr&amp;geocode=&amp;q=%20Quartier%20de%20la%20Grande%20Bastide%20CHATEAUNEUF%20LES%20MARTIGUES" target="_blank" class="lienFleche"> Quartier de la Grande Bastide <br>13220 CHATEAUNEUF LES MARTIGUES</a>
																			</td>
				 <td valign="middle" align="left"><a href="http://maps.google.fr/maps?f=q&amp;hl=fr&amp;geocode=&amp;q=%20Quartier%20de%20la%20Grande%20Bastide%20CHATEAUNEUF%20LES%20MARTIGUES" target="_blank"><img border="0" src="http://img1.societe.com/imgz/locator.png" alt="google map" title="plan"></a></td>
																		</tr>
																		
																		
																		<tr align="left" valign="top">
																			<td style="color:#777777">
																				Forme juridique
																			</td>
																			<td colspan="2">
																				société par actions simplifiée
																			</td>
																		</tr>
																		
																	</tbody>

Open in new window

0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 36494261
It works for me in VBA.

I tried it out by adding another piece of code in CommandButton1_Click, just before the UserForm1.Hide()

 re.Pattern = "<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>"
     Set Matches = re.Execute(t)
    For Each Match In Matches
        ActiveCell.Value = Match.Submatches(0)
        ActiveCell.Offset(0, 1).Select
        ActiveCell.Value = Match.Submatches(1)
        ActiveCell.Offset(1, -1).Select
    Next

Open in new window


It could have something to do with setting re.Global en MultiLine.
0
 
LVL 19

Expert Comment

by:akoster
ID: 36494465
you can try this :
    Dim ie As InternetExplorer

    Set ie = New InternetExplorer
    ie.navigate "http://www.societe.com/societe/lomatrans-location-manutention-transports-306159021.html"
    
    Do While ie.Busy
        Application.Wait DateAdd("s", 1, Now)
    Loop
    
    Sheet1.Range("A1").Select
    '-- parse company name
    For Each element In ie.Document.getelementsbytagname("H1")
        ActiveCell.Value = element.innertext
        ActiveCell.Offset(1).Select
    Next element
    
    '-- parse address
    For Each element In ie.Document.getelementsbytagname("TD")
        If element.Style.Color = "#666" Then
            '-- table element with color #666 contains the div we are looking for
            ActiveCell.Value = element.FirstChild.innertext
            ActiveCell.Offset(1).Select
        End If
    Next element
    
    ie.Quit
    Set ie = Nothing
    UserForm1.Hide

Open in new window

0
 
LVL 3

Author Comment

by:Mutsop
ID: 36494663
@Robert_Schutt:

weird, this is what I have:
Private Sub CommandButtonBtw_Click()
    Dim rw As Long
    rw = ActiveSheet.Range("D" & ActiveSheet.Rows.Count).End(xlUp).row
    
    ActiveSheet.Cells(rw + 1, "D").Value = UCase(TextBoxBtw.Value)
    
    ActiveSheet.Cells(rw + 1, "E").Select
    Dim t
    t = getHtml("http://www.societe.com/cgi-bin/mainsrch?champ=" & CStr(tva))
    Dim re1
    Set re1 = New RegExp
    re1.Global = True
    re1.MultiLine = False ' True
    re1.Pattern = "<h1[^>]*>(.*?)</h1>"
    Dim Matches1, Match1
    Set Matches1 = re1.Execute(t)
    For Each Match1 In Matches1
        ActiveCell.Value = Match1.Submatches(0)
        ActiveCell.Offset(0, 1).Select
    Next

    Dim re2
    Set re2 = New RegExp
    re2.Global = True
    re2.MultiLine = False ' True
    re2.Pattern = "<td.*?>\s*Siège social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>"
    Dim Matches2, Match2
    Set Matches2 = re2.Execute(t)
    For Each Match2 In Matches2
        ActiveCell.Value = Match.Submatches(0)
        ActiveCell.Offset(0, 1).Select
        ActiveCell.Value = Match.Submatches(1)
        ActiveCell.Offset(1, -1).Select
    Next
    frmBtwInvoegen.Hide
    'GetInfo
    Me.Hide
End Sub

Open in new window

And the second part doesnt seem to work. I tried all possible combinations with the global and MultiLine setting.

@akoster: I'll try that out :)
0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 36494728
Hi, I think I spotted the problem: on lines 30 and 32 you're using Match instead of Match2.
0
 
LVL 3

Author Comment

by:Mutsop
ID: 36494743
@akoster:

seems to work... although I have a minor problem.
1. the street and city needs to be splitted in 2 different cells (next to each other)
2. "Afficher le Numéro de téléphone " is written after the street. that seems to be a link but it inherits the same color of td (#666).

So I thought writing :
If element.Style.Color = "#666" And element.classname <> "tellink" Then
            '-- table element with color #666 contains the div we are looking for
            ActiveCell.Value = element.FirstChild.innertext
            ActiveCell.Offset(0, 1).Select
        End If

Open in new window


see "And element.classname <> "tellink""... problem though is that its a firstChild.innertext.
0
 
LVL 3

Author Comment

by:Mutsop
ID: 36494762
@robert_schutt: Sorry I copied the wrong code (I accidently pressed undo before copying the code).
But I did use Match2 and still didn't work.

I tried global false and true and MultiLine too. None work.

Although the first one works (header) not the addresses.
0
 
LVL 19

Assisted Solution

by:akoster
akoster earned 250 total points
ID: 36494819
Assuming that you'd rather not have the 'afficher le numero de telephone' included, try this :
    For Each Element In ie.Document.getelementsbytagname("TD")
        If Element.Style.Color = "#666" Then
            '-- table element with color #666 contains the div we are looking for
            Address = Split(Element.FirstChild.innertext, vbCrLf)(0)
            '-- split street and city
            For pos = Len(Address) To 1 Step -1
                If IsNumeric(Mid(Address, pos, 1)) Then
                    '-- everything up to this position is considered street
                    ActiveCell.Value = Left(Address, pos)
                    ActiveCell.Offset(0, 1).Select
                    '-- everything after this position is considered city
                    ActiveCell.Value = Mid(Address, pos + 1)
                    ActiveCell.Offset(1, -1).Select
                    '-- address found, stop splitting
                    Exit For
                End If
            Next pos
        End If
    Next Element

Open in new window

0
 
LVL 35

Accepted Solution

by:
Robert Schutt earned 250 total points
ID: 36494948
I think there's a problem with the character set of the retrieved page. If I change the pattern, it does work.

    re2.Pattern = "<td.*?>\s*\S+ social\s*</td>\s*<td>\s*<a.*?>(.*?)<br/?>(.*?)</a>"

Open in new window

0
 
LVL 3

Author Closing Comment

by:Mutsop
ID: 36495093
Both work perfectly :)

Thanks alot for your help guys.
I both gave you points as I can actually use both on different projects.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

This script will sweep a range of IP addresses (class c only, 255.255.255.0) and report to a log the version of office installed. What it does: 1.)      Creates log file in the directory the script is run from (if it doesn't already exist) 2.)      Sweep…
This article will guide you to convert a grid from a picture into Excel format using Microsoft OneNote and no other 3rd party application.
This Micro Tutorial will demonstrate in Microsoft Excel how to add style and sexy appeal to horizontal bar charts.
This Micro Tutorial will demonstrate how to create pivot charts out of a data set. I also added a drop-down menu which allows to choose from different categories in the data set and the chart will automatically update.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now