Raheman M. Abdul
asked on
Parsing / Outputing certain HTML source file data into Excel in VBA
I have some URLs (listed in a text file), For each URL listed, I want a particular block of HTML data as shown in attached file (highlighted) to be retrived from the source HTML file of the URL and parsed it to the following format and stored in an excel file in the order as retrived.
One URL: http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=2&sort=featured
Attached document shows the HTML source for the above URL. I have highlighted in YELLOW the block of HTML to retain after truncating the other things out.
From this block of HTML code, i need the following to be stored into excel sheet in the respective columns.
1. img src eg: 'http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg'
2. product name
3. price
Separate excel sheets for each URL in the list please. VBA code is preferred for ease of my understanding please.
Thanks
HTML-data.doc
One URL: http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=2&sort=featured
Attached document shows the HTML source for the above URL. I have highlighted in YELLOW the block of HTML to retain after truncating the other things out.
From this block of HTML code, i need the following to be stored into excel sheet in the respective columns.
1. img src eg: 'http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg'
2. product name
3. price
Separate excel sheets for each URL in the list please. VBA code is preferred for ease of my understanding please.
Thanks
HTML-data.doc
It's difficult to code for a textfile unless I know the format. Please post your text file... Thx
ASKER
Sorry to attach. thanks
sourceURLs.txt
sourceURLs.txt
No problem. :)
Mara. Please select choice A or B for what you are refering to as the product name...
A.
Lenovo ThinkPad W500 Notebook- Core 2 Duo P9500 2.53GHz - 15.4" WSXGA+2GB RAM - 320GB HDD [PCNB1012]
B.
Lenovo ThinkPad W500 Notebook
The name and description are a combined string. I can split it using the hyphen, but I cannot guarantee that there will always be a hyphen.
A.
Lenovo ThinkPad W500 Notebook- Core 2 Duo P9500 2.53GHz - 15.4" WSXGA+2GB RAM - 320GB HDD [PCNB1012]
B.
Lenovo ThinkPad W500 Notebook
The name and description are a combined string. I can split it using the hyphen, but I cannot guarantee that there will always be a hyphen.
Double check this output please:
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg
Lenovo ThinkPad T400
1,423.99
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/c/pcnb1014__76758_thumb.jpg
Lenovo ThinkPad T500
1,723.99
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/t/pcnb1022__64508_thumb.jpg
Lenovo ThinkPad T61P 6459
1,521.97
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/v/pcnb1013__56417_thumb.jpg
Lenovo ThinkPad W500
2,009.89
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/y/pcnb1012__14259_thumb.jpg
Lenovo ThinkPad W500
1,993.79
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/z/pcnb1011__39374_thumb.jpg
Lenovo ThinkPad X Series X200 7458 NoteBook
1,151.79
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/v/pcnb1010__98812_thumb.jpg
Lenovo ThinkPad X Series X300 NoteBook Intel Core 2 Duo L7100 LV(1.2GHz)
1,947.79
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/c/pcnb1020__32129_thumb.jpg
Lenovo ThinkPad X200
1,139.97
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/z/pcnb1021__36738_thumb.jpg
Lenovo Thinkpad X61T Tablet PC
1,536.97
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/k/pcnb1089_1__13863_thumb.jpg
New Sealed HP FM906UT 10.1 inch
361.90
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/u/pcnb1097_1__83571_thumb.jpg
New Sealed SONY VGN NS305DS 15.4 inch
735.90
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/w/pcnb1094_1__16237_thumb.jpg
New Sealed SONY VGNCS230JW 14.1 inch
1,109.90
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/n/pcnb1096_1__84434_thumb.jpg
New Sealed SONY VGNFW460JB 16.4 inch
1,109.90
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/s/pcnb1095_1__18995_thumb.jpg
New Sealed Toshiba A505-S6970 16.0 inch
913.00
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/i/pcnb1103_1__50444_thumb.jpg
New Sealed TOSHIBA L300 - EZ1521 15.4 inch
522.50
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/c/pcnb1028__38771_thumb.jpg
SONY VGN-CS17H/W - 14.1 inch
1,051.89
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/u/pcnb1026__71701_thumb.jpg
SONY VGN-CS23H/B
899.99
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/f/pcnb1027__58992_thumb.jpg
SONY VGN-CS23H/S
899.99
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/u/pcnb1023__87641_thumb.jpg
SONY VGN-SR38/B
1,610.89
-------------------------- ---------- ---------- ----
http://www.actfind.com/product_images/j/pcnb1024__76292_thumb.jpg
SONY VGN-SR38/J
1,619.99
--------------------------
http://www.actfind.com/product_images/o/pcnb1015__15534_thumb.jpg
Lenovo ThinkPad T400
1,423.99
--------------------------
http://www.actfind.com/product_images/c/pcnb1014__76758_thumb.jpg
Lenovo ThinkPad T500
1,723.99
--------------------------
http://www.actfind.com/product_images/t/pcnb1022__64508_thumb.jpg
Lenovo ThinkPad T61P 6459
1,521.97
--------------------------
http://www.actfind.com/product_images/v/pcnb1013__56417_thumb.jpg
Lenovo ThinkPad W500
2,009.89
--------------------------
http://www.actfind.com/product_images/y/pcnb1012__14259_thumb.jpg
Lenovo ThinkPad W500
1,993.79
--------------------------
http://www.actfind.com/product_images/z/pcnb1011__39374_thumb.jpg
Lenovo ThinkPad X Series X200 7458 NoteBook
1,151.79
--------------------------
http://www.actfind.com/product_images/v/pcnb1010__98812_thumb.jpg
Lenovo ThinkPad X Series X300 NoteBook Intel Core 2 Duo L7100 LV(1.2GHz)
1,947.79
--------------------------
http://www.actfind.com/product_images/c/pcnb1020__32129_thumb.jpg
Lenovo ThinkPad X200
1,139.97
--------------------------
http://www.actfind.com/product_images/z/pcnb1021__36738_thumb.jpg
Lenovo Thinkpad X61T Tablet PC
1,536.97
--------------------------
http://www.actfind.com/product_images/k/pcnb1089_1__13863_thumb.jpg
New Sealed HP FM906UT 10.1 inch
361.90
--------------------------
http://www.actfind.com/product_images/u/pcnb1097_1__83571_thumb.jpg
New Sealed SONY VGN NS305DS 15.4 inch
735.90
--------------------------
http://www.actfind.com/product_images/w/pcnb1094_1__16237_thumb.jpg
New Sealed SONY VGNCS230JW 14.1 inch
1,109.90
--------------------------
http://www.actfind.com/product_images/n/pcnb1096_1__84434_thumb.jpg
New Sealed SONY VGNFW460JB 16.4 inch
1,109.90
--------------------------
http://www.actfind.com/product_images/s/pcnb1095_1__18995_thumb.jpg
New Sealed Toshiba A505-S6970 16.0 inch
913.00
--------------------------
http://www.actfind.com/product_images/i/pcnb1103_1__50444_thumb.jpg
New Sealed TOSHIBA L300 - EZ1521 15.4 inch
522.50
--------------------------
http://www.actfind.com/product_images/c/pcnb1028__38771_thumb.jpg
SONY VGN-CS17H/W - 14.1 inch
1,051.89
--------------------------
http://www.actfind.com/product_images/u/pcnb1026__71701_thumb.jpg
SONY VGN-CS23H/B
899.99
--------------------------
http://www.actfind.com/product_images/f/pcnb1027__58992_thumb.jpg
SONY VGN-CS23H/S
899.99
--------------------------
http://www.actfind.com/product_images/u/pcnb1023__87641_thumb.jpg
SONY VGN-SR38/B
1,610.89
--------------------------
http://www.actfind.com/product_images/j/pcnb1024__76292_thumb.jpg
SONY VGN-SR38/J
1,619.99
ASKER
Looks good Tom.
Option A please.
Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages
Option A please.
Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages
"Is it possible for you to automatically iterate to next page until last page found instead of writing all possible URLs in a text file? for all the pages"
Yes. If that is the case, is there any longer a need for the text file in this solution?
Yes. If that is the case, is there any longer a need for the text file in this solution?
I'm calling it a night. This code is not finished if it must read through a text file. If so, I'll edit it.
Option Explicit
'References
'Miscrosoft WINHTTP Services
'Microsoft HTML Library
Sub RunThis()
Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
Dim RowsCnt As Long
Set Req = New WinHttpRequest
HasData = True
Do Until Not HasData
PageNum = PageNum + 1
URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?page=" & CStr(PageNum) & "&sort=featured"
RetData = Request(URL, Req, HasData)
If HasData Then
With Worksheets
With .Add(, .Item(.Count))
.Cells(1) = "Image Source"
.Cells(2) = "Item Name"
.Cells(3) = "Price"
If UBound(RetData, 2) < 2 Then
ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
End If
RetData = Application.Transpose(RetData)
.Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
With .Columns("A:C")
.Item(3).NumberFormat = "$#,##0.00"
.EntireColumn.AutoFit
End With
End With
End With
Else
Set Req = Nothing
Exit Sub
End If
Loop
End Sub
Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement, DescrTxt As String
Req.Open "GET", URL
Req.Send
RT = Req.ResponseText
Set Doc = New HTMLDocument
Doc.Clear
CallByName Doc, "Write", VbMethod, RT
Set frm = Doc.getElementById("frmCompare")
HasData = (Not frm Is Nothing)
If HasData Then
Set ul = frm.getElementsByTagName("ul")(0)
ReDim RetData(1 To 3, 1 To 1)
For Each li In ul.childNodes
x = x + 1
ReDim Preserve RetData(1 To 3, 1 To x)
RetData(1, x) = li.getElementsByTagName("IMG")(0).src
DescrTxt = Trim(li.getElementsByTagName("STRONG")(0).innerText)
RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
RetData(3, x) = Trim(Split(li.getElementsByTagName("EM")(0).innerText, "$")(1))
Next
HasData = Not (x = 0)
End If
Request = RetData
End Function
ASKER
this is ok for one category
in text file i will write the urls of other categories so text file is a must please
can u write in steps to follow since i got errors
compile error: user-defined type not defined
in text file i will write the urls of other categories so text file is a must please
can u write in steps to follow since i got errors
compile error: user-defined type not defined
ASKER
Patrickab:
Some product names missing in the result just the prices.
Image locations did not appear in the result.
The result should be exactly as commented by TomSchreiner
thanks
Some product names missing in the result just the prices.
Image locations did not appear in the result.
The result should be exactly as commented by TomSchreiner
thanks
marahman3001 - OK, understood. - Patrick
BTW. This only uses a single connection. If you have many items to download, you can use more connections. You don't have that option with MS's web query.
Mara. I just saw your note about incorporating text files. That's easy but I'm not going to have time until later today. If someone else would not mind finishing this up I would be much obliged. :)
FileToOpen = GetOpenFileName
Loop through file
my current code
End Loop
Or something like that - Thanks
FileToOpen = GetOpenFileName
Loop through file
my current code
End Loop
Or something like that - Thanks
ASKER
The last file you send resulted in just 20 rows and got
Run time error '9'
Subscript out of range
Run time error '9'
Subscript out of range
I cannot reproduce the error. What line is throwing it?
I have to go. If someone else does not solve your problem, I'll work on this later. What version of Excel are you using?
ASKER
Sorry Tom, thanks for your time and effort.
Error in the following
RetData(3, x) = Trim(Split(li.getElementsB yTagName(" EM")(0).in nerText, "$")(1))
Excel 2007
Error in the following
RetData(3, x) = Trim(Split(li.getElementsB
Excel 2007
ASKER
Line 66
Hi Mara. My code is attempting to return a value instead of simply a string. If the $ happens to be missing, then that error would occur. I did not account for that. The following code will still attemp to return a numeric value but if it fails, it will still return whatever the website is listing as the price. See the attached file.
Have a nice day...
Tom
Have a nice day...
Tom
Function Request(URL As String, Req As WinHttpRequest, HasData As Boolean) As Variant()
Dim RT As String, Doc1 As New HTMLDocument, Doc As New HTMLDocument, x As Integer, RetData()
Dim frm As HTMLFormElement, ul As HTMLUListElement, li As HTMLLIElement, PriceText As String
Req.Open "GET", URL
Req.Send
RT = Req.ResponseText
Set Doc = New HTMLDocument
Doc.Clear
CallByName Doc, "Write", VbMethod, RT
Set frm = Doc.getElementById("frmCompare")
HasData = (Not frm Is Nothing)
If HasData Then
Set ul = frm.getElementsByTagName("ul")(0)
ReDim RetData(1 To 3, 1 To 1)
For Each li In ul.childNodes
x = x + 1
ReDim Preserve RetData(1 To 3, 1 To x)
RetData(1, x) = li.getElementsByTagName("IMG")(0).src
RetData(2, x) = Trim(li.getElementsByTagName("STRONG")(0).innerText)
PriceText = Trim(li.getElementsByTagName("EM")(0).innerText)
If InStr(PriceText, "$") <> 0 Then
RetData(3, x) = Split(PriceText, "$")(1)
Else
RetData(3, x) = PriceText
End If
Next
HasData = Not (x = 0)
End If
Request = RetData
End Function
HTMLGrab2.xls
ASKER
Good work Tom
The only thing is the prices do not match and some prices do not exist. Look into this please.
Thanks again Tom
The only thing is the prices do not match and some prices do not exist. Look into this please.
Thanks again Tom
ASKER
Yes got it Tom
the currency is set by default to US $ so its giving that price.
if we run the http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3
it will be solved and the prices will be displayed in GBP (£)
I have added the following to the code before calling the actual pages.
CurrencySetToGBP_URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3"
RetData = Request(CurrencySetToGBP_U RL, Req, HasData)
Is this effective code or any better one , suggest please.
This resolves my question and happy to give you max points.
the currency is set by default to US $ so its giving that price.
if we run the http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3
it will be solved and the prices will be displayed in GBP (£)
I have added the following to the code before calling the actual pages.
CurrencySetToGBP_URL = "http://www.actfind.com/categories/wholesale-Laptop-Notebook/?setCurrencyId=3"
RetData = Request(CurrencySetToGBP_U
Is this effective code or any better one , suggest please.
This resolves my question and happy to give you max points.
Your solution if fine. If you are not concerned with returning your prices as values, but a string only, then simply replace snippet one with snippet two. This should account for any currency.
You are probably aware that any change in the structure of the web page will likely break our procedure. If so, just post back to this thread if that is possible. In most cases, a minor change in the code will compensate for any HTML alterations. Also, if you are downloading large amounts of data, you can use multiple, asynchronous connections. I would not use any more than 8 at a time to any single host lest your IP is flagged for suspicion of DOS attacks.
You are probably aware that any change in the structure of the web page will likely break our procedure. If so, just post back to this thread if that is possible. In most cases, a minor change in the code will compensate for any HTML alterations. Also, if you are downloading large amounts of data, you can use multiple, asynchronous connections. I would not use any more than 8 at a time to any single host lest your IP is flagged for suspicion of DOS attacks.
Snippet One:
PriceText = Trim(li.getElementsByTagName("EM")(0).innerText)
If InStr(PriceText, "$") <> 0 Then
RetData(3, x) = Split(PriceText, "$")(1)
Else
RetData(3, x) = PriceText
End If
Snippet Two:
RetData(3, x) = Trim(li.getElementsByTagName("EM")(0).innerText)
Oh! What about the text file? I have yet to incoporate that. Do you still need it? It's no big deal to add it.
ASKER
Many thanks Tom for all your effort. Bye
ASKER
Yes please
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
It is not reading the second line, just first line and stops.
Anyway this is very good and working excellent.
Anyway this is very good and working excellent.
ASKER
Excellent coding standard displayed and well done in a very less time. Great
"It is not reading the second line, just first line and stops."
The textfile?
The textfile?
Hi Mara...
Replace the "RunThis" procedure with this code.
Replace the "RunThis" procedure with this code.
Sub RunThis()
Dim Req As WinHttpRequest, PageNum As Integer, URL As String, HasData As Boolean, RetData()
Dim RowsCnt As Long, TxtFileName As String, fso As New FileSystemObject, ts As TextStream
TxtFileName = Application.GetOpenFilename("Text Files (*.txt), *.txt")
If TxtFileName = "False" Then Exit Sub
Set ts = fso.GetFile(TxtFileName).OpenAsTextStream
Set Req = New WinHttpRequest
Do Until ts.AtEndOfStream
URL = ts.ReadLine
PageNum = 0
HasData = True
Do Until Not HasData
PageNum = PageNum + 1
URL = Left(URL, InStr(URL, "?page=") + 5) & CStr(PageNum) & Right(URL, Len(URL) - InStr(InStr(URL, "?page="), URL, "&") + 1)
RetData = Request(URL, Req, HasData)
If HasData Then
With Worksheets
With .Add(, .Item(.Count))
.Cells(1) = "Image Source"
.Cells(2) = "Item Name"
.Cells(3) = "Price"
If UBound(RetData, 2) < 2 Then
ReDim Preserve RetData(1 To UBound(RetData, 1), 1 To 2)
End If
RetData = Application.Transpose(RetData)
.Cells(2, 1).Resize(UBound(RetData, 1), UBound(RetData, 2)) = RetData
With .Columns("A:C")
.Item(3).NumberFormat = "$#,##0.00"
.EntireColumn.AutoFit
End With
End With
End With
Else
Exit Do
End If
Loop
Loop
ts.Close
Set Req = Nothing
End Sub
ASKER
Everything Good and working fine. thanks again Tom. Bye