?
Solved

Parsing / Outputing certain HTML source file data into Excel in VBA

Posted on 2010-01-10
33
Medium Priority
?
1,281 Views
Last Modified: 2013-11-27
I have some URLs (listed in a text file), For each URL listed, I want a particular block of HTML data as shown in attached file (highlighted) to be retrived from the source HTML file of the URL and parsed it to the following format and stored in an excel file in the order as retrived.

One URL: http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=2&sort=featured

Attached document shows the HTML source for the above URL. I have highlighted in YELLOW the block of HTML to retain after truncating the other things out.
From this block of HTML code, i need the following to be stored into excel sheet in the respective columns.
1. img src     eg: 'http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg'
2. product name
3. price

Separate excel sheets for each URL in the list please. VBA code is preferred for ease of my understanding please.
Thanks
HTML-data.doc
0
Comment
Question by:Raheman M. Abdul
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 17
  • 14
  • 2
33 Comments
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26280271
It's difficult to code for a textfile unless I know the format.  Please post your text file...  Thx
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26280284
Sorry to attach. thanks
sourceURLs.txt
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26280287
No problem.  :)
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26280374
Mara.  Please select choice A or B for what you are refering to as the product name...
A.
Lenovo ThinkPad W500 Notebook- Core 2 Duo P9500 2.53GHz - 15.4" WSXGA+2GB RAM - 320GB HDD [PCNB1012]
B.
Lenovo ThinkPad W500 Notebook
 
The name and description are a combined string.  I can split it using the hyphen, but I cannot guarantee that there will always be a hyphen.
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26280399
Double check this output please:
--------------------------------------------------
http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg
Lenovo ThinkPad T400
1,423.99
--------------------------------------------------
http://www.actfind.com/product_images/c/pcnb1014__76758_thumb.jpg
Lenovo ThinkPad T500
1,723.99
--------------------------------------------------
http://www.actfind.com/product_images/t/pcnb1022__64508_thumb.jpg
Lenovo ThinkPad T61P 6459
1,521.97
--------------------------------------------------
http://www.actfind.com/product_images/v/pcnb1013__56417_thumb.jpg
Lenovo ThinkPad W500
2,009.89
--------------------------------------------------
http://www.actfind.com/product_images/y/pcnb1012__14259_thumb.jpg
Lenovo ThinkPad W500
1,993.79
--------------------------------------------------
http://www.actfind.com/product_images/z/pcnb1011__39374_thumb.jpg
Lenovo ThinkPad X Series X200 7458 NoteBook
1,151.79
--------------------------------------------------
http://www.actfind.com/product_images/v/pcnb1010__98812_thumb.jpg
Lenovo ThinkPad X Series X300 NoteBook Intel Core 2 Duo L7100 LV(1.2GHz)
1,947.79
--------------------------------------------------
http://www.actfind.com/product_images/c/pcnb1020__32129_thumb.jpg
Lenovo ThinkPad X200
1,139.97
--------------------------------------------------
http://www.actfind.com/product_images/z/pcnb1021__36738_thumb.jpg
Lenovo Thinkpad X61T Tablet PC
1,536.97
--------------------------------------------------
http://www.actfind.com/product_images/k/pcnb1089_1__13863_thumb.jpg
New Sealed HP FM906UT 10.1 inch
361.90
--------------------------------------------------
http://www.actfind.com/product_images/u/pcnb1097_1__83571_thumb.jpg
New Sealed SONY VGN NS305DS 15.4 inch
735.90
--------------------------------------------------
http://www.actfind.com/product_images/w/pcnb1094_1__16237_thumb.jpg
New Sealed SONY VGNCS230JW 14.1 inch
1,109.90
--------------------------------------------------
http://www.actfind.com/product_images/n/pcnb1096_1__84434_thumb.jpg
New Sealed SONY VGNFW460JB 16.4 inch
1,109.90
--------------------------------------------------
http://www.actfind.com/product_images/s/pcnb1095_1__18995_thumb.jpg
New Sealed Toshiba A505-S6970 16.0 inch
913.00
--------------------------------------------------
http://www.actfind.com/product_images/i/pcnb1103_1__50444_thumb.jpg
New Sealed TOSHIBA L300 - EZ1521 15.4 inch
522.50
--------------------------------------------------
http://www.actfind.com/product_images/c/pcnb1028__38771_thumb.jpg
SONY VGN-CS17H/W - 14.1 inch
1,051.89
--------------------------------------------------
http://www.actfind.com/product_images/u/pcnb1026__71701_thumb.jpg
SONY VGN-CS23H/B
899.99
--------------------------------------------------
http://www.actfind.com/product_images/f/pcnb1027__58992_thumb.jpg
SONY VGN-CS23H/S
899.99
--------------------------------------------------
http://www.actfind.com/product_images/u/pcnb1023__87641_thumb.jpg
SONY VGN-SR38/B
1,610.89
--------------------------------------------------
http://www.actfind.com/product_images/j/pcnb1024__76292_thumb.jpg
SONY VGN-SR38/J
1,619.99
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26280466
Looks good Tom.

Option A please.

Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26280505
"Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages"
Yes.  If that is the case, is there any longer a need for the text file in this solution?
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26280600
I'm calling it a night.  This code is not finished if it must read through a text file.  If so, I'll edit it.
 

Option Explicit

'References
'Miscrosoft WINHTTP Services
'Microsoft HTML Library
Sub RunThis()
    Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
    Dim RowsCnt As Long
    Set Req = New WinHttpRequest
    
    HasData = True
    Do Until Not HasData
        PageNum = PageNum + 1
        URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=" & CStr(PageNum) & "&sort=featured"
        RetData = Request(URL, Req, HasData)
        
        If HasData Then
            With Worksheets
                With .Add(, .Item(.Count))
                    .Cells(1) = "Image Source"
                    .Cells(2) = "Item Name"
                    .Cells(3) = "Price"
                    If UBound(RetData, 2) < 2 Then
                        ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
                    End If
                    RetData = Application.Transpose(RetData)
                    .Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
                    With .Columns("A:C")
                        .Item(3).NumberFormat = "$#,##0.00"
                        .EntireColumn.AutoFit
                    End With
                End With
            End With
        Else
            Set Req = Nothing
            Exit Sub
        End If
    Loop
End Sub

Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
    Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
    Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement, DescrTxt As String
    
    Req.Open "GET", URL
    Req.Send
    RT = Req.ResponseText
    
    Set Doc = New HTMLDocument
    Doc.Clear
    CallByName Doc, "Write", VbMethod, RT
    
    Set frm = Doc.getElementById("frmCompare")
    HasData = (Not frm Is Nothing)
    
    If HasData Then
        Set ul = frm.getElementsByTagName("ul")(0)
        ReDim RetData(1 To 3, 1 To 1)
        
        For Each li In ul.childNodes
            x = x + 1
            ReDim Preserve RetData(1 To 3, 1 To x)
            RetData(1, x) = li.getElementsByTagName("IMG")(0).src
            DescrTxt = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            RetData(3, x) = Trim(Split(li.getElementsByTagName("EM")(0).innerText, "$")(1))
        Next
        HasData = Not (x = 0)
    End If
    
    Request = RetData
End Function

Open in new window

0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26281802
this is ok for one category
in text file i will write the urls of other categories so text file is a must please
can u write in steps to follow since i got errors

compile error: user-defined type not defined
0
 
LVL 45

Expert Comment

by:patrickab
ID: 26282940
marahman3001,

Perhaps the attached file will do some of what you want.

Patrick
marahman3001-01.xls
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26283148
Patrickab:
  Some product names missing in the result just the prices.
Image locations did not appear in the result.
The result should be exactly as commented by TomSchreiner

thanks
0
 
LVL 45

Expert Comment

by:patrickab
ID: 26283172
marahman3001 - OK, understood. - Patrick
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26283934
Mara.  The error may be missing references.  Try the attached.

HTMLGrab.xls
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26283952
BTW.  This only uses a single connection.  If you have many items to download, you can use more connections.  You don't have that option with MS's web query.
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26284049
Mara.  I just saw your note about incorporating text files.  That's easy but I'm not going to have time until later today.  If someone else would not mind finishing this up I would be much obliged.  :)
 
FileToOpen = GetOpenFileName
Loop through file
     my current code
End Loop
Or something like that - Thanks
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26284127
The last file you send resulted in just 20 rows and got
Run time error '9'
Subscript out of range
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26284150
I cannot reproduce the error.  What line is throwing it?
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26284222
I have to go.  If someone else does not solve your problem, I'll work on this later.  What version of Excel are you using?
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26284321
Sorry Tom, thanks for your time and effort.
Error in the following
            RetData(3, x) = Trim(Split(li.getElementsByTagName("EM")(0).innerText, "$")(1))

Excel 2007
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26284329
Line 66
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26285611
Hi Mara.  My code is attempting to return a value instead of simply a string.  If the $ happens to be missing, then that error would occur.  I did not account for that.  The following code will still attemp to return a numeric value but if it fails, it will still return whatever the website is listing as the price.  See the attached file.
Have a nice day...
Tom

Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
    Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
    Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement, PriceText As String
    
    Req.Open "GET", URL
    Req.Send
    RT = Req.ResponseText
    
    Set Doc = New HTMLDocument
    Doc.Clear
    CallByName Doc, "Write", VbMethod, RT
    
    Set frm = Doc.getElementById("frmCompare")
    HasData = (Not frm Is Nothing)
    
    If HasData Then
        Set ul = frm.getElementsByTagName("ul")(0)
        ReDim RetData(1 To 3, 1 To 1)
        
        For Each li In ul.childNodes
            x = x + 1
            ReDim Preserve RetData(1 To 3, 1 To x)
            RetData(1, x) = li.getElementsByTagName("IMG")(0).src
            RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            PriceText = Trim(li.getElementsByTagName("EM")(0).innerText)
            If InStr(PriceText, "$") <> 0 Then
                RetData(3, x) = Split(PriceText, "$")(1)
            Else
                RetData(3, x) = PriceText
            End If
        Next
        HasData = Not (x = 0)
    End If
    
    Request = RetData
End Function

Open in new window

HTMLGrab2.xls
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26285861
Good work Tom
The only thing is the prices do not match and some prices do not exist. Look into this please.
Thanks again Tom
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26286062
Yes got it Tom
the currency is set by default to US $ so its giving that price.

if we run the          http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3
it will be solved and the prices will be displayed in GBP (£)

I have added the following to the code before calling the actual pages.

CurrencySetToGBP_URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3"
        RetData = Request(CurrencySetToGBP_URL, Req, HasData)

Is this effective code or any better one , suggest please.

This resolves my question and happy to give you max points.
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26286196
Your solution if fine.  If you are not concerned with returning your prices as values, but a string only, then simply replace snippet one with snippet two.  This should account for any currency.
You are probably aware that any change in the structure of the web page will likely break our procedure.  If so, just post back to this thread if that is possible.  In most cases, a minor change in the code will compensate for any HTML alterations.  Also, if you are downloading large amounts of data, you can use multiple, asynchronous connections.  I would not use any more than 8 at a time to any single host lest your IP is flagged for suspicion of DOS attacks.

Snippet One:
            
            PriceText = Trim(li.getElementsByTagName("EM")(0).innerText)
            If InStr(PriceText, "$") <> 0 Then
                RetData(3, x) = Split(PriceText, "$")(1)
            Else
                RetData(3, x) = PriceText
            End If

Snippet Two:

            RetData(3, x) = Trim(li.getElementsByTagName("EM")(0).innerText)

Open in new window

0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26286232
Oh!  What about the text file?  I have yet to incoporate that.  Do you still need it?  It's no big deal to add it.
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26286248
Many thanks Tom for all your effort. Bye
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26286256
Yes please
0
 
LVL 6

Accepted Solution

by:
TomSchreiner earned 2000 total points
ID: 26287460
Prompts user for text file and reads urls line by line until the end of the file.

Option Explicit

'References
'Miscrosoft WINHTTP Services
'Microsoft HTML Library
'Microsoft Scripting Runtime
Sub RunThis()
    Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
    Dim RowsCnt As Long, TxtFileName As String, fso As New FileSystemObject, ts As TextStream
    
    TxtFileName = Application.GetOpenFilename("Text Files (*.txt), *.txt")
    If TxtFileName = "False" Then Exit Sub
    
    Set ts = fso.GetFile(TxtFileName).OpenAsTextStream
    
    Set Req = New WinHttpRequest
    
    Do Until ts.AtEndOfStream
        URL = ts.ReadLine
        HasData = True
        Do Until Not HasData
            PageNum = PageNum + 1
            URL = Left(URL, InStr(URL, "?page=") + 5) & CStr(PageNum) & Right(URL, Len(URL) - InStr(InStr(URL, "?page="), URL, "&") + 1)
            RetData = Request(URL, Req, HasData)
            
            If HasData Then
                With Worksheets
                    With .Add(, .Item(.Count))
                        .Name = "Page " & CStr(PageNum)
                        .Cells(1) = "Image Source"
                        .Cells(2) = "Item Name"
                        .Cells(3) = "Price"
                        If UBound(RetData, 2) < 2 Then
                            ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
                        End If
                        RetData = Application.Transpose(RetData)
                        .Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
                        With .Columns("A:C")
                            .Item(3).NumberFormat = "$#,##0.00"
                            .EntireColumn.AutoFit
                        End With
                    End With
                End With
            Else
                Set Req = Nothing
                Exit Sub
            End If
        Loop
    Loop
    ts.Close
End Sub

Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
    Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
    Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement
    
    Req.Open "GET", URL
    Req.Send
    RT = Req.ResponseText
    
    Set Doc = New HTMLDocument
    Doc.Clear
    CallByName Doc, "Write", VbMethod, RT
    
    Set frm = Doc.getElementById("frmCompare")
    HasData = (Not frm Is Nothing)
    
    If HasData Then
        Set ul = frm.getElementsByTagName("ul")(0)
        ReDim RetData(1 To 3, 1 To 1)
        
        For Each li In ul.childNodes
            x = x + 1
            ReDim Preserve RetData(1 To 3, 1 To x)
            RetData(1, x) = li.getElementsByTagName("IMG")(0).src
            RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            RetData(3, x) = Trim(li.getElementsByTagName("EM")(0).innerText)
        Next
        HasData = Not (x = 0)
    End If
    
    Request = RetData
End Function

Open in new window

HTMLGrab3.xls
0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26288378
It is not reading the second line, just first line and stops.
Anyway this is very good and working excellent.
0
 
LVL 19

Author Closing Comment

by:Raheman M. Abdul
ID: 31675317
Excellent coding standard displayed and well done in a very less time. Great
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26288645
"It is not reading the second line, just first line and stops."
The textfile?
0
 
LVL 6

Expert Comment

by:TomSchreiner
ID: 26288730
Hi Mara...
Replace the "RunThis" procedure with this code.  

Sub RunThis()
    Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
    Dim RowsCnt As Long, TxtFileName As String, fso As New FileSystemObject, ts As TextStream
    
    TxtFileName = Application.GetOpenFilename("Text Files (*.txt), *.txt")
    If TxtFileName = "False" Then Exit Sub
    
    Set ts = fso.GetFile(TxtFileName).OpenAsTextStream
    
    Set Req = New WinHttpRequest
    
    Do Until ts.AtEndOfStream
        URL = ts.ReadLine
        PageNum = 0
        HasData = True
        Do Until Not HasData
            PageNum = PageNum + 1
            URL = Left(URL, InStr(URL, "?page=") + 5) & CStr(PageNum) & Right(URL, Len(URL) - InStr(InStr(URL, "?page="), URL, "&") + 1)
            RetData = Request(URL, Req, HasData)
            
            If HasData Then
                With Worksheets
                    With .Add(, .Item(.Count))
                        .Cells(1) = "Image Source"
                        .Cells(2) = "Item Name"
                        .Cells(3) = "Price"
                        If UBound(RetData, 2) < 2 Then
                            ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
                        End If
                        RetData = Application.Transpose(RetData)
                        .Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
                        With .Columns("A:C")
                            .Item(3).NumberFormat = "$#,##0.00"
                            .EntireColumn.AutoFit
                        End With
                    End With
                End With
            Else
                Exit Do
            End If
        Loop
    Loop
    ts.Close
    Set Req = Nothing
End Sub

Open in new window

0
 
LVL 19

Author Comment

by:Raheman M. Abdul
ID: 26289825
Everything Good and working fine. thanks again Tom. Bye
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article describes two methods for creating a combo box that can be used to add new items to the row source -- one for simple lookup tables, and one for a more complex row source where the new item needs data for several fields.
This article shows how to get a list of available printers for display in a drop-down list, and then to use the selected printer to print an Access report or a Word document filled with Access data, using different syntax as needed for working with …
This Micro Tutorial demonstrates using Microsoft Excel pivot tables, how to reverse engineer competitors' marketing strategies through backlinks.
This Micro Tutorial will demonstrate how to use longer labels with horizontal bar charts instead of the vertical column chart.

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question