Extract data from a specific website

Hi there,

I urgently need a VBScript that will extract data from the following factual website.

https://aDomain.com/data/X

X is a consective number from 0 to 200000

DATA TO BE EXTRACTED ARE THE FOLLOWINGS (For example, CompanyName1 is the data value, not a variable):
-CompanyName1
-CompanyName2
-CompanyName2City
-CompanyName2State
-CompanyName2Zip
-CompanyName2Country

FROM:
<div id="companyName_id" class="box">CompanyName1</div>
<label>Name:</label><div class="box">CompanyName2</div>
<label>City:</label><div class="box">CompanyName2City</div>
<label>State:</label><div class="box">CompanyName2State</div>
<label>Zip:</label><div class="box">CompanyName2Zip</div>
<label>Country:</label><div class="box">CompanyName2Country</div>

The loop is to be executed at randum times between .5 and 2 seconds.

The output I need is:
"CompanyName1","CompanyName2","CompanyName2City","CompanyName2State","CompanyName2Zip","CompanyName2Country"

The output file is to be in the same folder where the script is located, and with the name "data.csv".

Thanks for your help and your support.

Cheers,
Rene
LVL 10
ReneGeAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Chris Raisin(Retired Analyst/Programmer)Commented:
I am looking into it...stand by

Cheers
Chris (Australia)
Gustav BrockCIOCommented:
You can modify this VBA code:
Private Declare Function URLDownloadToFile Lib "urlmon" Alias "URLDownloadToFileA" ( _
    ByVal pCaller As Long, _
    ByVal szURL As String, _
    ByVal szFileName As String, _
    ByVal dwReserved As Long, _
    ByVal lpfnCB As Long) _
    As Long


Public Function DownloadFile( _
    ByVal strURL As String, _
    ByVal strLocalFilename As String) _
    As Long
  
' Download file or page with public access from the web.
' 2004-12-17. Cactus Data ApS, CPH.

' Usage, download a file:
' lngRet = DownloadFile("http://www.databaseadvisors.com/Graphics/conf2002/2002ConferencePicsbySmolin/images/dba02smolin27.jpg", "c:\happybassett.jpg")
'
' Usage, download a page:
' lngRet = DownloadFile("http://www.databaseadvisors.com/conf2002/conf200202.asp", "c:\dbaconference.htm")

' Returns 0 if success, error code if not.
' Error codes:
' -2146697210 "file not found".
' -2146697211 "domain not found".

' Limitation.
' Does not check if local file was created successfully.

    Dim lngRetVal As Long
      
    lngRetVal = URLDownloadToFile(0, strURL & vbNullChar, strLocalFilename & vbNullChar, 0, 0)
    
    DownloadFile = lngRetVal
  
End Function

Open in new window

It should be quite easy to adopt for VBscript.

/gustav
Chris Raisin(Retired Analyst/Programmer)Commented:
Are the pages ending in "html" or "htm" on the website?
Also, is the data to be appended to the text file each time?
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

aikimarkCommented:
I get an SSL error message when I try to look at that domain.
Chris Raisin(Retired Analyst/Programmer)Commented:
Forget my earlier question....I am wrting code now and see what you are after.

Stand by....

Cheers Chris
ReneGeAuthor Commented:
Hi guys,

Thanks for helping.

By the way, the website I used in my example is a fake one.

Cheers
Gustav BrockCIOCommented:
That's what I thought with "a domain". But to clarify, you could use the normal "example.com".

/gustav
aikimarkCommented:
@ReneGe

What is the real URL?
Chris Raisin(Retired Analyst/Programmer)Commented:
Almost finished...just testing.

Cheers
Chris
Chris Raisin(Retired Analyst/Programmer)Commented:
Why is the loop run at random times (between 0.5 and 2 seconds?
(If for processing purposes by external processes on your PC, that can be handles by giving way to the processes using "DoEvents" call.)

If you have 200000 files, 0.5 seconds would not be enough time to process them I would think (or even 2 seconds!)

Why not just have it run continuously?
If it is to run continuously, we will need to put in a way of stopping the process (via keyboard input say by a button displayed on a small form).

Also, after going through loop once, surely you do not want the same data output appended to the ever growing output file....I gather you would want to write to the file only data that is not already in it.
This would mean scanning the output file prior to writing to it
to ensure no "doubling up"
If a record is to be deleted that couldnot be handled by this process
(unless there was a "Delete" glag of some sort placed in a html file,
and an "edit" wouldsimply produce a new record in the output.

I gather all "adds" and "edits" will need to be done by processing the output file rather than ging via the input files.

Please clarify...It is working but I need to know exactlyhow the porocessing of the output file is handled.

Also,(and this is important) is there only one record per html file or could there be multiple records?

Cheers
Chris (Australia - 4,53am and heavens.....am I tired! :-)
ReneGeAuthor Commented:
Hi Chris,

Thanks for all you help.

The delay between each execution is not that important.  This delay is more to be applied between each loop.

It is indeed favorable that the script runs continously, between 0 to 200000.  At 200000, it would stop executing automatically.  I will not need to stop it while it is runnung.  However, it would be nice that I see what number it is at.

All of the data needs to be output to the csv file.  However, you have a point.  That csv file will be huge.  Making a different csv file for every "companyName_id" should resolve this, where the file name would be the data contained in "companyName_id".  So therefore, double-ups are a GOOD thing for me.

No delete flag required.  However, I dont need the data to be output if the page does not contain the data.  For example: IF "%companyName%" == "" Dont export data and go to the next loop.

 If a record is to be deleted that couldnot be handled by this process
 (unless there was a "Delete" glag of some sort placed in a html file,
 and an "edit" wouldsimply produce a new record in the output.

I gather all "adds" and "edits" will need to be done by processing the output file rather than ging via the input files.
I dont understand what you mean here ;(

Please clarify...It is working but I need to know exactly how the porocessing of the output file is handled.
I'm not sure I understand what you are asking here.  The script loads the page, reads the data and exports it to the CSV file.

Also,(and this is important) is there only one record per html file or could there be multiple records?
Good question, there is indeed only one record per HTML pages.

 Cheers
Chris (Australia - 4,53am and heavens.....am I tired! :-)
Go to bed ;)
ReneGeAuthor Commented:
Hi Gustav Brock,

Thanks for sharing.  does it do what I need or it needs to be tailored?

Cheers,
Rene
Chris Raisin(Retired Analyst/Programmer)Commented:
I have the code working, but it is taking AGES to loop 20000 while allowing the PC to handle processes externally.
It currently works on 2 dummy files I set up at website "www.ecbc.net.au/data/" called "0.html" and "1.html"
(See code below)

The "if then" code could perhaps be handled differently, but it is easy to read and amend if required.

We could create an array of webpages by looping 20000 times (if that can be thought of as quick) and collecting
the names of the URLs that actually exist, then process the "If then" testing on each existing web page, but it will effectively be the same as testing if each page exists (one by one) then processing the existing ones. The way the code is at the moment, it saves having to visit the "site" file twice.

I have not put any 0.5 or 2 second timer in yet until I clarify exactly want you envisage the process will be.

I have put in "DoEvents" in the loop so the program does not "hog" all the processing when it runs.
Without that, your PC is locked within the loop and will not free up the PC until 20,000 is reached
(which takes ages).

It would be great if the loop cold be cut short when an input file is not found. This assumes all files are numered consecutively and none will be missing.  If that was the case it would stop as soon as a URL was invalid (not found).
As it is, it now loops 20000 times (even though there are only 2 files) and runs for far too long.
Unfortunately there is no quick way to test if a url exists or not - you have to attempt to connect to the URL first to see whether it exists (and connection attempts, whether successful or not,  slow things down)

I am now off to bed for a bit after about 5 hours work on this (which was interrupted by a visit to a scheduled 2 hour online seminar on MediaShout run at 3.00am (Melbourne time) in Los Angeles)....yawn  :-)


Chris (Australia 6.02am

Option Explicit

'In VBA editor you must include references to the following (under "Tools/References" in the VBA editor):
' 1. Visual Basic for Applications
' 2. Microsoft Excel 14.0 Object Library (or otehr Office library for the application which is running VBA)
' 3. Microsoft Internet Controls
' 4. Microsoft HTML Object Library

Public Sub PickUpData()
  Dim ie As Object
  Dim nLine As Integer
  Dim strLine As String
  Dim oDoc As HTMLDocument
  Dim strDivs() As String
  Dim strEle() As String
  Dim strSiteBase As String
  Dim strPage As String
  Dim nPage As Long
  Dim nDiv As Integer
  Dim nFile As Integer
  Dim StrFileOut As String
  Dim strUrl As String
  Dim strOutPut(6) As String
  Dim x As Integer
  
  Const CompanyName1 = 1
  Const CompanyName2 = 2
  Const CompanyName2City = 3
  Const CompanyName2State = 4
  Const CompanyName2Zip = 5
  Const CompanyName2Country = 6
  
  Set ie = New InternetExplorerMedium
  
  ie.Visible = False
  StrFileOut = "c:\data.csv"
  
  'strSiteBase = "https://aDomain.com/data/"
  
  'the following is a testing site that will work! (It only has file 0and 1 at the moment but more can be put there
  'upon request.
  strSiteBase = "http://www.ecbc.net.au/data/"
  nFile = FreeFile
  Open StrFileOut For Append As #nFile
  For nPage = 0 To 20000
    DoEvents 'to allow normal processes outside this program to run.
    strUrl = strSiteBase & CStr(nPage) & ".html"
    If URLExists(strUrl) Then
      ie.Navigate strUrl
      While ie.Busy
        DoEvents
      Wend
      If ie.LocationName = strUrl Then
        Set oDoc = ie.Document
        strPage = oDoc.documentElement.innerHTML
        strDivs = Split(strPage, "</div>")
        For nDiv = 0 To UBound(strDivs)
          strLine = LCase(strDivs(nDiv))
          If InStr(strLine, "companyname_id") > 0 Then
            strEle = Split(strLine, ">")
            strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
            strOutPut(CompanyName1) = strEle(UBound(strEle))
          Else
            If InStr(strLine, "<label>name:</label>") > 0 Then
              strEle = Split(strLine, ">")
              strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
              strOutPut(CompanyName2) = strEle(UBound(strEle))
            Else
              If InStr(strLine, "<label>city:</label>") > 0 Then
                strEle = Split(strLine, ">")
                strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                strOutPut(CompanyName2City) = strEle(UBound(strEle))
              Else
                If InStr(strLine, "<label>state:</label>") > 0 Then
                  strEle = Split(strLine, ">")
                  strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                  strOutPut(CompanyName2State) = strEle(UBound(strEle))
                Else
                  If InStr(strLine, "<label>zip:</label>") > 0 Then
                    strEle = Split(strLine, ">")
                    strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                    strOutPut(CompanyName2Zip) = strEle(UBound(strEle))
                  Else
                    If InStr(strLine, "<label>country:</label>") > 0 Then
                      strEle = Split(strLine, ">")
                      strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                      strOutPut(CompanyName2Country) = strEle(UBound(strEle))
                    End If
                  End If
                End If
              End If
            End If
          End If
        Next
        strLine = ""
        For nLine = 1 To 6
          strLine = strLine & strOutPut(nLine) & IIf(nLine < 6, ",", "")
        Next
        Print #nFile, strLine
      End If
    End If
NextFile:
  Next
  Close #nFile
  MsgBox "This will not be in final code" & vbCrLf & "Just here to let you know the process is finished!"
End Sub

Function URLExists(url As String) As Boolean
    Dim Request As Object
    Dim ff As Integer
    Dim rc As Variant
    On Error GoTo EndNow 'if error occurs, then site will be considered un-reachable at the moment
                         '(even if it does exist)
    Set Request = CreateObject("WinHttp.WinHttpRequest.5.1")
    With Request
      .Open "GET", url, False
      .send
      URLExists = InStr(LCase(.responseText), "unavailable link") = 0
    End With
    Set Request = Nothing
EndNow:
End Function

Open in new window

Chris Raisin(Retired Analyst/Programmer)Commented:
I just saw your few observations but I am too tired to fix the code just now...will get about 6 hours sleep before resuming. I guess having a differnt output file for each different company name (companyname1)

I gather by your notes that this is a "once off" procedure to gather all your data (in which case my mention of "edits" and "deletes" does not apply. I will amend the code when I get up to send to different output files, and put a few more dummy files on the "test" site with the campanyname1 appearing more than once.

I will need to tweak the code, then to look open the output file that applies to companyname1, rather than just
"data.csv"

I will also put in place some indiocator so you can see how far through the process it is.

Stand by.....
(I suyppose it is almost your bed time now!) :-)

Cheers
Chris  (Australia)....6.50am (snore)
ReneGeAuthor Commented:
Thanks Chris!

Go to bed!!

;)
Chris Raisin(Retired Analyst/Programmer)Commented:
I can't help myse;lf (I'm addicted).

I just changed the code a bit for the multiple outputs. Will look at the "progess" indicator later in the day (LOL)

I will definately go off now (at 65 years of age, these late nights are getting to me.......)

Cheers
Chris (Australia)....6.30am

Option Explicit

'In VBA editor you must include references to the following (under "Tools/References" in the VBA editor):
' 1. Visual Basic for Applications
' 2. Microsoft Excel 14.0 Object Library (or otehr Office library for the application which is running VBA)
' 3. Microsoft Internet Controls
' 4. Microsoft HTML Object Library

Public Sub PickUpData()
  Dim ie As Object
  Dim nLine As Integer
  Dim strLine As String
  Dim oDoc As HTMLDocument
  Dim strDivs() As String
  Dim strEle() As String
  Dim strSiteBase As String
  Dim strPage As String
  Dim nPage As Long
  Dim nDiv As Integer
  Dim nFile As Integer
  Dim StrFileOut As String
  Dim strUrl As String
  Dim strOutPut(6) As String
  Dim x As Integer
  
  Const CompanyName1 = 1
  Const CompanyName2 = 2
  Const CompanyName2City = 3
  Const CompanyName2State = 4
  Const CompanyName2Zip = 5
  Const CompanyName2Country = 6
  
  Set ie = New InternetExplorerMedium
  
  ie.Visible = False
  'StrFileOut = "c:\data.csv"
  
   
  'the following is a testing site that will work! (It only has file 0and 1 at the moment but more can be put there
  'upon request.
  'strSiteBase = "https://aDomain.com/data/"
  strSiteBase = "http://www.ecbc.net.au/data/"
  For nPage = 0 To 20000
    DoEvents 'to allow normal processes outside this program to run.
    strUrl = strSiteBase & CStr(nPage) & ".html"
    If URLExists(strUrl) Then
      ie.Navigate strUrl
      While ie.Busy
        DoEvents
      Wend
      If ie.LocationName = strUrl Then
        Set oDoc = ie.Document
        strPage = oDoc.documentElement.innerHTML
        strDivs = Split(strPage, "</div>")
        For nDiv = 0 To UBound(strDivs)
          strLine = LCase(strDivs(nDiv))
          If InStr(strLine, "companyname_id") > 0 Then
            strEle = Split(strLine, ">")
            strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
            strOutPut(CompanyName1) = strEle(UBound(strEle))
          Else
            If InStr(strLine, "<label>name:</label>") > 0 Then
              strEle = Split(strLine, ">")
              strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
              strOutPut(CompanyName2) = strEle(UBound(strEle))
            Else
              If InStr(strLine, "<label>city:</label>") > 0 Then
                strEle = Split(strLine, ">")
                strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                strOutPut(CompanyName2City) = strEle(UBound(strEle))
              Else
                If InStr(strLine, "<label>state:</label>") > 0 Then
                  strEle = Split(strLine, ">")
                  strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                  strOutPut(CompanyName2State) = strEle(UBound(strEle))
                Else
                  If InStr(strLine, "<label>zip:</label>") > 0 Then
                    strEle = Split(strLine, ">")
                    strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                    strOutPut(CompanyName2Zip) = strEle(UBound(strEle))
                  Else
                    If InStr(strLine, "<label>country:</label>") > 0 Then
                      strEle = Split(strLine, ">")
                      strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                      strOutPut(CompanyName2Country) = strEle(UBound(strEle))
                    End If
                  End If
                End If
              End If
            End If
          End If
        Next
        strLine = ""
        For nLine = 1 To 6
          strLine = strLine & strOutPut(nLine) & IIf(nLine < 6, ",", "")
        Next
        StrFileOut = Application.Path & "\" & strOutPut(CompanyName1) & ".csv"
        nFile = FreeFile
        Open StrFileOut For Append As #nFile 'if file does not exist it will be created, else it will append
        Print #nFile, strLine
        Close #nFile
      End If
    End If
NextFile:
  Next
  MsgBox "This will not be in final code" & vbCrLf & "Just here to let you know the process is finished!"
End Sub

Function URLExists(url As String) As Boolean
    Dim Request As Object
    Dim ff As Integer
    Dim rc As Variant
    On Error GoTo EndNow 'if error occurs, then site will be considered un-reachable at the moment
                         '(even if it does exist)
    Set Request = CreateObject("WinHttp.WinHttpRequest.5.1")
    With Request
      .Open "GET", url, False
      .send
      URLExists = InStr(LCase(.responseText), "unavailable link") = 0
    End With
    Set Request = Nothing
EndNow:
End Function

Open in new window

ReneGeAuthor Commented:
Hi aikimark,

I am sorry but I cannot disclose the web site address.

I know its not very practical but I'm not authorized to.  Also, in order to access it, a login is required.

Oops, I forgot to mention this!  I hope it will not be a problem.

Cheers
Chris Raisin(Retired Analyst/Programmer)Commented:
For testing purposes I am using my own website and I have created a subfolder on the site that contains just two test files (at the moment).

I run a Bridge club called ECBC and so I am taking advantahge of being able to post data to it as I wish.

The files are: www.ecbc.net.au/data/o.html  and
                       www.ecbc.net.au/data/1.html

The main website is www.ecbc.net.au and you can see my name there as owner/manager of the bridge club (privately owned)....I hope that is alright  with the Zone advisor :-)

Cheers
Chris
ReneGeAuthor Commented:
@aikimark:
-Following Chris' answer, I think I understand that there may be a concern.
-Please be assured that the data collected by this script is owned by my client.  We just dont have it in our database.

@Chris:
-I just realized that in "www.ecbc.net.au/data/o.html", we find the letter "o" not the number "0"

Cheers,
Rene
Chris Raisin(Retired Analyst/Programmer)Commented:
OK...I looked at it, and  filedefinately ends in "0.html" (i.e. the number zero). I just mistyped the link I gave you (sorry...sleepy eyes)

(Yes I am still awake)  :-)

Chris.....7.40am
Chris Raisin(Retired Analyst/Programmer)Commented:
OK...some changes to the code to incorporate a (sort of) front end, which is just asmall form advising progress and allowing you to interrupt the procedure (interrupting will stop reading the files and close down the macro).
I also have made provision so that data is not duplicated in the output files (wouldhappen with multiple runs).

I have placed all the code in an Excel spreadsheet (attached)

The bulk of the actual processing code is contained in the UserForm (just double click on the form itself within the VBA editor in Excel to see the code). You will need to ensure you have "Developer" activated in your Excel workbooks menu and then simply click on the "VBA Editor" (or right mouse click on the macro showing under "macros" and select "Edit").

If you have any trouble doing this I can put together a video tutorial if you wish. (I would incorprate a walkthrough in the code as well if you like).

My biggest concern is the amount of time it will take to process the 20000 files. If this is a once only event then there is no real problem, but if you intend to have this as a repeating process, be aware it will take a while.

Give it a test and let me know how it goes.
Do you want me to put a lot more data into the "test" area (www.ecbc.net.au/data")?

For the sake of others reading this feed, I also display the now amended code below as well as
a few graphics showing the panel which displays (UserForm1).

'The properties of UserForm1 is as follows:
The properties of UserForm1 when it is created by the VBA runtime

The "Progress Indicator" (for want of a better name) looks like the following:

   1. On startup (centered in the PC screen, but you can move it around to get it out of the way)
        Panel when program (macro) first starts
   2. During processing (title changes to show file being processed)
       During processing, file names are displayed in title line
   3. At end of processing, panel will re-appear......at centre of screen...I think  :-)....and report total number of files
      processed
      Panel displayed at end of processing
Cheers
Chris

Code in Module1 (the start of the macro)
Option Explicit   'always have this command active to pick up undeclared variables and avoid errors

'In VBA editor you must include references to the following (under "Tools/References"):
' 1. Visual Basic for Applications
' 2. Microsoft Excel 14.0 Object Library (or other Office library for the application which is running VBA)
' 3. Microsoft Internet Controls
' 4. Microsoft HTML Object Library

'Make sure you change the folder designations for strPathOut in the code to your desired destination
'within the code contained in the form UserForm1 (double click on the form module within the VBA editor)

Public Sub PickUpData()  
   'This is the macro that will be visible in your VBA environment (e.g. Excel)
    UserForm1.Show           'Wow! Just one line!.....well not really :-)
End Sub

Open in new window


Code held internally in the UserForm1 object
Option Explicit
Private Sub cmdActivate_Click()
   If cmdActivate.Caption = "Start" Then
     cmdActivate.Caption = "STOP"
     StartPickup
   Else
     cmdActivate.Caption = "Start"
   End If
End Sub
Private Sub cmdClose_Click()
   Unload Me
   End
End Sub

Private Sub StartPickup()
  Dim ie As Object
  Dim nLine As Integer
  Dim strLine As String
  Dim oDoc As HTMLDocument
  Dim strDivs() As String
  Dim strEle() As String
  Dim strSiteBase As String
  Dim strPage As String
  Dim nPage As Long
  Dim nDiv As Integer
  Dim nFile As Integer
  Dim strPathOut As String
  Dim StrFileOut As String
  Dim strUrl As String
  Dim strOutPut(6) As String
  Dim nProcessed As Integer
  Dim x As Integer
  Dim bPrintNew As Boolean
  Dim strReadIn As String
  
  Const CompanyName1 = 1
  Const CompanyName2 = 2
  Const CompanyName2City = 3
  Const CompanyName2State = 4
  Const CompanyName2Zip = 5
  Const CompanyName2Country = 6
  
  Set ie = New InternetExplorerMedium
  
  ie.Visible = False
     
  'the following is a testing site that will work! (It only has a few files at the moment but more can be put there
  'upon request. Each file must have a name format like: ######.html, where ###### is a numeric ranging from 0 to 
  '20000 (can be single digit, two digits etc. up to 6 digits
  
  'Amend to show the site where you want to search
  'Your processed files will also be stored in the folder where your macro resides.
  
  'strSiteBase = "https://aDomain.com/data/"
  strSiteBase = "http://www.ecbc.net.au/data/"
  
  'Change this to be the folder where you want the "csv" files stored
  'It is best NOT to be in the same folder where this macro resides!
  strPathOut = "H:\########\"
  If Dir(strPathOut, vbDirectory) = "" Then
     strPathOut = Application.Path  'the folder where thismacro resides!
  End If
  For nPage = 0 To 20000
    If UserForm1.cmdActivate.Caption = "Start" Or UserForm1.Visible = False Then
      'user has clicked button to stop process or has closed the box.
      Exit For
    End If
    DoEvents 'to allow normal processes outside this program to run.
    strUrl = strSiteBase & CStr(nPage) & ".html"
    UserForm1.Caption = "Processing: " & strUrl
    If URLExists(strUrl) Then
      ie.Navigate strUrl
      'while the browser is loading we let the operating system have some say in running other things
      While ie.Busy
        DoEvents
      Wend
      If ie.LocationName = strUrl Then
        Set oDoc = ie.Document
        strPage = oDoc.documentElement.innerHTML
        strDivs = Split(strPage, "</div>")
        For nDiv = 0 To UBound(strDivs)
          strLine = LCase(strDivs(nDiv)) 'convert all to lowercase in case the text case changes in the html files being scanned.
          If InStr(strLine, "companyname_id") > 0 Then
            strEle = Split(strLine, ">")
            strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
            strOutPut(CompanyName1) = strEle(UBound(strEle))
          Else
            If InStr(strLine, "<label>name:</label>") > 0 Then
              strEle = Split(strLine, ">")
              strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
              strOutPut(CompanyName2) = strEle(UBound(strEle))
            Else
              If InStr(strLine, "<label>city:</label>") > 0 Then
                strEle = Split(strLine, ">")
                strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                strOutPut(CompanyName2City) = strEle(UBound(strEle))
              Else
                If InStr(strLine, "<label>state:</label>") > 0 Then
                  strEle = Split(strLine, ">")
                  strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                  strOutPut(CompanyName2State) = strEle(UBound(strEle))
                Else
                  If InStr(strLine, "<label>zip:</label>") > 0 Then
                    strEle = Split(strLine, ">")
                    strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                    strOutPut(CompanyName2Zip) = strEle(UBound(strEle))
                  Else
                    If InStr(strLine, "<label>country:</label>") > 0 Then
                      strEle = Split(strLine, ">")
                      strEle(UBound(strEle)) = Replace(strEle(UBound(strEle)), "</div", "")
                      strOutPut(CompanyName2Country) = strEle(UBound(strEle))
                    End If
                  End If
                End If
              End If
            End If
          End If
        Next
        strLine = ""
        For nLine = 1 To 6
          strLine = strLine & strOutPut(nLine) & IIf(nLine < 6, ",", "")
        Next
        StrFileOut = strPathOut & "\" & strOutPut(CompanyName1) & ".csv"
        nFile = FreeFile
        If Dir(StrFileOut) <> "" Then
          'first see whether the line of data already exists in the file (we dont want duplicates)
          bPrintNew = False
          Open StrFileOut For Input As #nFile
          Do While Not EOF(nFile)
            DoEvents 'in case we have a big file we need to let the system outside the macro have a go!
            Line Input #nFile, strReadIn
            If LCase(strReadIn) = LCase(strLine) Then
               bPrintNew = False
               Exit Do
            End If
          Loop
        End If
        If bPrintNew Then
          nFile = FreeFile
          Open StrFileOut For Append As #nFile 'if file does not exist it will be created, else it will append
          Print #nFile, strLine
          Close #nFile
        End If
        nProcessed = nProcessed + 1  'keeping tab of number of files processed.
      End If
    End If
NextFile:
  Next
  If UserForm1.Visible Then
    UserForm1.Caption = "Total files processed: " & CStr(nProcessed)
    UserForm1.Show 'to bring it to the forefront
  Else
    End   'to make sure macro ends if user closed the userform by clicking on the "X" instead of using
          'STOP or Close buttons (naughty, naughty!) :-).
  End If
End Sub

Function URLExists(url As String) As Boolean
    Dim Request As Object
    On Error Resume Next 'if error occurs, then site will be considered un-reachable at the moment
                         '(even if it does exist)
    Set Request = CreateObject("WinHttp.WinHttpRequest.5.1")
    With Request
      .Open "GET", url, False
      .send
      If Err.Number = 0 Then
        URLExists = InStr(LCase(.responseText), "unavailable link") = 0
      End If
    End With
EndNow:
    Set Request = Nothing
End Function

Open in new window


File of Excel workbook containing all code and the UserForm
PickupData.xlsm

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Gustav BrockCIOCommented:
I'm a bit confused. You asked for VBscript code, but now it is VBA and Excel. That makes a huge difference.
Could you clarify, please?

Also, I don't understand the delay. Wouldn't you just download the files at maximum speed?

Using URLDownloadToFile is in general 4-5 times faster than any other method, but it cannot do a login except, of course, if your account or a token somehow can be included in the URL.

/gustav
Chris Raisin(Retired Analyst/Programmer)Commented:
The original question asked for data to be extracted from website pages, not downloaded and then extracted (which actually would take longer). The web pages may not exist, which would add to problems in file transfer and retries by the downloader (upon failure) would add to time.

I overlooked the fact that VB script was requested, but since the output is in comma seperated values (.CSV) format, I assumed the originator could handle VBA. It can be converted to an executable of course in VB by using VBA references, therby avoiding the VBA or VBScript interface completely.

I assume that if VBScript MUST be used for some reason then the originator of the question should have no trouble in moving it into VBScript. (I never use it myself) :-)

Cheers
Chris (Australia)
aikimarkCommented:
@Chris
the originator of the question should have no trouble in moving it into VBScript. (I never use it myself)
1. That's quite an assumption of skill level.
2. If ReneGe did not know how to convert your VBA solution into (the required) VBScript, could you help him if you've never used VBScript?
Chris Raisin(Retired Analyst/Programmer)Commented:
I did not say I have never used it, I just say I dont use it now. Gustuv Brock made the same assumption with his response which supplied VBA code and stated "it should be quite easy to adopt for VBscript."

I will leave it up to Rene if he want to take me up on my offer to walk through the code for him.

Cheers
Chris
ReneGeAuthor Commented:
I would have prefered VBscript because I just started Learning it, and I know how to run it.  However VBA would be ok because the main focus here is to have it do it's thing.

You have spent much effort in helping me here, so if you just show me how to execute it, that would be great.

For the sake of Learning, I would open another question to convert the script to VBScript.

Thanks and cheers
Chris Raisin(Retired Analyst/Programmer)Commented:
OK...I will put together a video for you.
Will that be OK (provided you can understand my Australian accent)

Chris
NorieAnalyst Assistant Commented:
Just curious,  but how can any code/script be written/work without knowing the actual URL of interest?

I've done quite a bit of thus stuff myself in the past and one of the main things I've learned is there is no 'one size fits all' code.
NorieAnalyst Assistant Commented:
*thus = this.
ReneGeAuthor Commented:
Hi Chris,

I ran it as per your instructions.

Here is what i get.  See attachment.

Also, the data is not extracted and not either exported in the Excel spread sheet nor output to a csv file.

companyName_id field name is not displayed.

Am I missing something?

Thanks and cheers,
Rene
EE1.jpg
ReneGeAuthor Commented:
Hi Norie,

I understand your concern.  The fact the Chris created test pages on his website is good enough for me.

Anyway, my client's website is secured so he would not be able to test it there anyway.

Thanks and cheers,
Rene
ReneGeAuthor Commented:
Hi aikimark,

Thanks for your concerns.  As long it works, I'm fine.  Once this question is resolved, I'll create a new question to convert this script to VBscript just for the sake of Learning.

Thanks and cheers.
Chris Raisin(Retired Analyst/Programmer)Commented:
The browser should not be showing. I suspect that you are missing a reference in your VB Editor.

Can you hand five while Iput together this video for you?

Cheers
Chris
ReneGeAuthor Commented:
Hi Chris,

What does "hand five" mean?

Thanks and cheers,
Rene
aikimarkCommented:
@ReneGe

I assume he meant to type "hang five" - hang on about five time units (seconds, minutes, hours, days, etc.)
ReneGeAuthor Commented:
Makes sense :)
Chris Raisin(Retired Analyst/Programmer)Commented:
I meant to say "Hang Five"
It's an old surfing phrase (from the 1960's) which was when you came in on your baord on one foot
(hanging on with 5 toes I guess)!  (Well, that is my interpretation anyway....LOL)

It actually is slang in Australia for waiting for a small time  (like 5 minutes), although I better make it a bit longer, since it is now 10 minutes to 1 in the morning and I have to start teaching later today at 9.30am  I also teaching another  class in the afternoon at 1.30pm so I am not really free again until 4.00pm. (In about 15 hours). I have to sleep in that time too, so I hope the wait until then is OK. The video will probably run for about 5 minutes or so. It will be just showing how to run a macro within an Office VBA environment (Excel in this case).

If you want a detailed walkthrough the code for a learning experience I could do that, but it will take a bit more preparation and time.

Cheers
Chris
ReneGeAuthor Commented:
Hi Chris,

No worries.  Takes all the time you need!

Maybe I can spare you the video!

To run the script:
-I opened your xlsm file
-Disabled security features
-Enabled Developer Ribbon
-Click on Macros (Alt F8)
-Selected the macro, the click Run

Was there something else/more to do?

Thanks and cheers,
Rene
Chris Raisin(Retired Analyst/Programmer)Commented:
You must make sure you have added the references under "Tools/References" in the VBE menu
Read the code in module 1 to see what you need to add. You will need to include Microsoft Frms 2.0 Object library as well (I forgot to include that in the code because we were not using a form at that stage).

I suspect you already have the form reference added since a form has appeared.

I must be off nows orry...two nights without sleep would be too much.

Cheers
Chris
ReneGeAuthor Commented:
Hi Chris,

Microsoft Frms 2.0 Object library was indeed included.

Please find attached the selected references.

Sleep well mate :)

Cheers,
Rene
Screenshot-2015-03-14-17.45.21.png
Chris Raisin(Retired Analyst/Programmer)Commented:
Earlier you said "companyName_id field name is not displayed."

It is displayed: It is (in the sample caode) simply called "Company0".
Your html code does not have a Label tag before CompanyName_id.

It just does not have a label (as indicated by your data definition).

I will put a few more items up with more realistic data items.
(Why is your browser showing in the screen capture? It should not appear when you run the macro).
ReneGeAuthor Commented:
Oh I think I should feel stupid right now.
I'm currently in bed.
I'll review tomorrow Sunday and update you.
No need to add more data.

Cheers
ReneGeAuthor Commented:
Hi Chris,

Sorry for my very late follow-up.

I tried a few times and I am still am unable to generate the csv file, containing the required data.

I value all the time, effort, dedication and heart you invested in this project.

If you feel that we are almost there, please let me know.  We'll continue from here.

Or else, I will not mind creating a new question and be more precise in the requirements.

Don't worry, I'll be honored to give you all the credits for your contributions and efforts.

What do you think?

Thanks for everything :)

Cheers,
Rene
Chris Raisin(Retired Analyst/Programmer)Commented:
Let's stick with it since it is generating at my end.

Where are you having errors (if any) or explain exactly what you are doing, with what output (if any).
I think we are almost there :-)

Cheers
Chris
ReneGeAuthor Commented:
Hi Chris,

Sorry for taking so long to reply.

I should come back to this question in a few days.

Cheerz
ReneGeAuthor Commented:
Hi Chris,

I will close this question and give you the points.

I'll create a new one when ready to conclude it and refer to this one.

I'll let you know through this one when.

Thanks and cheers,
Rene
Chris Raisin(Retired Analyst/Programmer)Commented:
OK, thanks Rene

Cheers
Chris
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VB Script

From novice to tech pro — start learning today.