Link to home
Start Free TrialLog in
Avatar of Raheman M. Abdul
Raheman M. AbdulFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Parsing / Outputing certain HTML source file data into Excel in VBA

I have some URLs (listed in a text file), For each URL listed, I want a particular block of HTML data as shown in attached file (highlighted) to be retrived from the source HTML file of the URL and parsed it to the following format and stored in an excel file in the order as retrived.

One URL: http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=2&sort=featured

Attached document shows the HTML source for the above URL. I have highlighted in YELLOW the block of HTML to retain after truncating the other things out.
From this block of HTML code, i need the following to be stored into excel sheet in the respective columns.
1. img src     eg: 'http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg'
2. product name
3. price

Separate excel sheets for each URL in the list please. VBA code is preferred for ease of my understanding please.
Thanks
HTML-data.doc
Avatar of TomSchreiner
TomSchreiner

It's difficult to code for a textfile unless I know the format.  Please post your text file...  Thx
Avatar of Raheman M. Abdul

ASKER

Sorry to attach. thanks
sourceURLs.txt
No problem.  :)
Mara.  Please select choice A or B for what you are refering to as the product name...
A.
Lenovo ThinkPad W500 Notebook- Core 2 Duo P9500 2.53GHz - 15.4" WSXGA+2GB RAM - 320GB HDD [PCNB1012]
B.
Lenovo ThinkPad W500 Notebook
 
The name and description are a combined string.  I can split it using the hyphen, but I cannot guarantee that there will always be a hyphen.
Double check this output please:
--------------------------------------------------
http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg
Lenovo ThinkPad T400
1,423.99
--------------------------------------------------
http://www.actfind.com/product_images/c/pcnb1014__76758_thumb.jpg
Lenovo ThinkPad T500
1,723.99
--------------------------------------------------
http://www.actfind.com/product_images/t/pcnb1022__64508_thumb.jpg
Lenovo ThinkPad T61P 6459
1,521.97
--------------------------------------------------
http://www.actfind.com/product_images/v/pcnb1013__56417_thumb.jpg
Lenovo ThinkPad W500
2,009.89
--------------------------------------------------
http://www.actfind.com/product_images/y/pcnb1012__14259_thumb.jpg
Lenovo ThinkPad W500
1,993.79
--------------------------------------------------
http://www.actfind.com/product_images/z/pcnb1011__39374_thumb.jpg
Lenovo ThinkPad X Series X200 7458 NoteBook
1,151.79
--------------------------------------------------
http://www.actfind.com/product_images/v/pcnb1010__98812_thumb.jpg
Lenovo ThinkPad X Series X300 NoteBook Intel Core 2 Duo L7100 LV(1.2GHz)
1,947.79
--------------------------------------------------
http://www.actfind.com/product_images/c/pcnb1020__32129_thumb.jpg
Lenovo ThinkPad X200
1,139.97
--------------------------------------------------
http://www.actfind.com/product_images/z/pcnb1021__36738_thumb.jpg
Lenovo Thinkpad X61T Tablet PC
1,536.97
--------------------------------------------------
http://www.actfind.com/product_images/k/pcnb1089_1__13863_thumb.jpg
New Sealed HP FM906UT 10.1 inch
361.90
--------------------------------------------------
http://www.actfind.com/product_images/u/pcnb1097_1__83571_thumb.jpg
New Sealed SONY VGN NS305DS 15.4 inch
735.90
--------------------------------------------------
http://www.actfind.com/product_images/w/pcnb1094_1__16237_thumb.jpg
New Sealed SONY VGNCS230JW 14.1 inch
1,109.90
--------------------------------------------------
http://www.actfind.com/product_images/n/pcnb1096_1__84434_thumb.jpg
New Sealed SONY VGNFW460JB 16.4 inch
1,109.90
--------------------------------------------------
http://www.actfind.com/product_images/s/pcnb1095_1__18995_thumb.jpg
New Sealed Toshiba A505-S6970 16.0 inch
913.00
--------------------------------------------------
http://www.actfind.com/product_images/i/pcnb1103_1__50444_thumb.jpg
New Sealed TOSHIBA L300 - EZ1521 15.4 inch
522.50
--------------------------------------------------
http://www.actfind.com/product_images/c/pcnb1028__38771_thumb.jpg
SONY VGN-CS17H/W - 14.1 inch
1,051.89
--------------------------------------------------
http://www.actfind.com/product_images/u/pcnb1026__71701_thumb.jpg
SONY VGN-CS23H/B
899.99
--------------------------------------------------
http://www.actfind.com/product_images/f/pcnb1027__58992_thumb.jpg
SONY VGN-CS23H/S
899.99
--------------------------------------------------
http://www.actfind.com/product_images/u/pcnb1023__87641_thumb.jpg
SONY VGN-SR38/B
1,610.89
--------------------------------------------------
http://www.actfind.com/product_images/j/pcnb1024__76292_thumb.jpg
SONY VGN-SR38/J
1,619.99
Looks good Tom.

Option A please.

Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages
"Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages"
Yes.  If that is the case, is there any longer a need for the text file in this solution?
I'm calling it a night.  This code is not finished if it must read through a text file.  If so, I'll edit it.
 

Option Explicit

'References
'Miscrosoft WINHTTP Services
'Microsoft HTML Library
Sub RunThis()
    Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
    Dim RowsCnt As Long
    Set Req = New WinHttpRequest
    
    HasData = True
    Do Until Not HasData
        PageNum = PageNum + 1
        URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=" & CStr(PageNum) & "&sort=featured"
        RetData = Request(URL, Req, HasData)
        
        If HasData Then
            With Worksheets
                With .Add(, .Item(.Count))
                    .Cells(1) = "Image Source"
                    .Cells(2) = "Item Name"
                    .Cells(3) = "Price"
                    If UBound(RetData, 2) < 2 Then
                        ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
                    End If
                    RetData = Application.Transpose(RetData)
                    .Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
                    With .Columns("A:C")
                        .Item(3).NumberFormat = "$#,##0.00"
                        .EntireColumn.AutoFit
                    End With
                End With
            End With
        Else
            Set Req = Nothing
            Exit Sub
        End If
    Loop
End Sub

Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
    Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
    Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement, DescrTxt As String
    
    Req.Open "GET", URL
    Req.Send
    RT = Req.ResponseText
    
    Set Doc = New HTMLDocument
    Doc.Clear
    CallByName Doc, "Write", VbMethod, RT
    
    Set frm = Doc.getElementById("frmCompare")
    HasData = (Not frm Is Nothing)
    
    If HasData Then
        Set ul = frm.getElementsByTagName("ul")(0)
        ReDim RetData(1 To 3, 1 To 1)
        
        For Each li In ul.childNodes
            x = x + 1
            ReDim Preserve RetData(1 To 3, 1 To x)
            RetData(1, x) = li.getElementsByTagName("IMG")(0).src
            DescrTxt = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            RetData(3, x) = Trim(Split(li.getElementsByTagName("EM")(0).innerText, "$")(1))
        Next
        HasData = Not (x = 0)
    End If
    
    Request = RetData
End Function

Open in new window

this is ok for one category
in text file i will write the urls of other categories so text file is a must please
can u write in steps to follow since i got errors

compile error: user-defined type not defined
marahman3001,

Perhaps the attached file will do some of what you want.

Patrick
marahman3001-01.xls
Patrickab:
  Some product names missing in the result just the prices.
Image locations did not appear in the result.
The result should be exactly as commented by TomSchreiner

thanks
marahman3001 - OK, understood. - Patrick
Mara.  The error may be missing references.  Try the attached.

HTMLGrab.xls
BTW.  This only uses a single connection.  If you have many items to download, you can use more connections.  You don't have that option with MS's web query.
Mara.  I just saw your note about incorporating text files.  That's easy but I'm not going to have time until later today.  If someone else would not mind finishing this up I would be much obliged.  :)
 
FileToOpen = GetOpenFileName
Loop through file
     my current code
End Loop
Or something like that - Thanks
The last file you send resulted in just 20 rows and got
Run time error '9'
Subscript out of range
I cannot reproduce the error.  What line is throwing it?
I have to go.  If someone else does not solve your problem, I'll work on this later.  What version of Excel are you using?
Sorry Tom, thanks for your time and effort.
Error in the following
            RetData(3, x) = Trim(Split(li.getElementsByTagName("EM")(0).innerText, "$")(1))

Excel 2007
Line 66
Hi Mara.  My code is attempting to return a value instead of simply a string.  If the $ happens to be missing, then that error would occur.  I did not account for that.  The following code will still attemp to return a numeric value but if it fails, it will still return whatever the website is listing as the price.  See the attached file.
Have a nice day...
Tom

Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
    Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
    Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement, PriceText As String
    
    Req.Open "GET", URL
    Req.Send
    RT = Req.ResponseText
    
    Set Doc = New HTMLDocument
    Doc.Clear
    CallByName Doc, "Write", VbMethod, RT
    
    Set frm = Doc.getElementById("frmCompare")
    HasData = (Not frm Is Nothing)
    
    If HasData Then
        Set ul = frm.getElementsByTagName("ul")(0)
        ReDim RetData(1 To 3, 1 To 1)
        
        For Each li In ul.childNodes
            x = x + 1
            ReDim Preserve RetData(1 To 3, 1 To x)
            RetData(1, x) = li.getElementsByTagName("IMG")(0).src
            RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
            PriceText = Trim(li.getElementsByTagName("EM")(0).innerText)
            If InStr(PriceText, "$") <> 0 Then
                RetData(3, x) = Split(PriceText, "$")(1)
            Else
                RetData(3, x) = PriceText
            End If
        Next
        HasData = Not (x = 0)
    End If
    
    Request = RetData
End Function

Open in new window

HTMLGrab2.xls
Good work Tom
The only thing is the prices do not match and some prices do not exist. Look into this please.
Thanks again Tom
Yes got it Tom
the currency is set by default to US $ so its giving that price.

if we run the          http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3
it will be solved and the prices will be displayed in GBP (£)

I have added the following to the code before calling the actual pages.

CurrencySetToGBP_URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3"
        RetData = Request(CurrencySetToGBP_URL, Req, HasData)

Is this effective code or any better one , suggest please.

This resolves my question and happy to give you max points.
Your solution if fine.  If you are not concerned with returning your prices as values, but a string only, then simply replace snippet one with snippet two.  This should account for any currency.
You are probably aware that any change in the structure of the web page will likely break our procedure.  If so, just post back to this thread if that is possible.  In most cases, a minor change in the code will compensate for any HTML alterations.  Also, if you are downloading large amounts of data, you can use multiple, asynchronous connections.  I would not use any more than 8 at a time to any single host lest your IP is flagged for suspicion of DOS attacks.

Snippet One:
            
            PriceText = Trim(li.getElementsByTagName("EM")(0).innerText)
            If InStr(PriceText, "$") <> 0 Then
                RetData(3, x) = Split(PriceText, "$")(1)
            Else
                RetData(3, x) = PriceText
            End If

Snippet Two:

            RetData(3, x) = Trim(li.getElementsByTagName("EM")(0).innerText)

Open in new window

Oh!  What about the text file?  I have yet to incoporate that.  Do you still need it?  It's no big deal to add it.
Many thanks Tom for all your effort. Bye
Yes please
ASKER CERTIFIED SOLUTION
Avatar of TomSchreiner
TomSchreiner

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
It is not reading the second line, just first line and stops.
Anyway this is very good and working excellent.
Excellent coding standard displayed and well done in a very less time. Great
"It is not reading the second line, just first line and stops."
The textfile?
Hi Mara...
Replace the "RunThis" procedure with this code.  

Sub RunThis()
    Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
    Dim RowsCnt As Long, TxtFileName As String, fso As New FileSystemObject, ts As TextStream
    
    TxtFileName = Application.GetOpenFilename("Text Files (*.txt), *.txt")
    If TxtFileName = "False" Then Exit Sub
    
    Set ts = fso.GetFile(TxtFileName).OpenAsTextStream
    
    Set Req = New WinHttpRequest
    
    Do Until ts.AtEndOfStream
        URL = ts.ReadLine
        PageNum = 0
        HasData = True
        Do Until Not HasData
            PageNum = PageNum + 1
            URL = Left(URL, InStr(URL, "?page=") + 5) & CStr(PageNum) & Right(URL, Len(URL) - InStr(InStr(URL, "?page="), URL, "&") + 1)
            RetData = Request(URL, Req, HasData)
            
            If HasData Then
                With Worksheets
                    With .Add(, .Item(.Count))
                        .Cells(1) = "Image Source"
                        .Cells(2) = "Item Name"
                        .Cells(3) = "Price"
                        If UBound(RetData, 2) < 2 Then
                            ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
                        End If
                        RetData = Application.Transpose(RetData)
                        .Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
                        With .Columns("A:C")
                            .Item(3).NumberFormat = "$#,##0.00"
                            .EntireColumn.AutoFit
                        End With
                    End With
                End With
            Else
                Exit Do
            End If
        Loop
    Loop
    ts.Close
    Set Req = Nothing
End Sub

Open in new window

Everything Good and working fine. thanks again Tom. Bye