Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 135
  • Last Modified:

Scrape Data from a list of hyperlinks

Hello Everyone,
I have the following routine.  The purpose is to scrape data from a list of web pages.  The links to these pages are listed in a column in the worksheet.  The code loops through each cell containing a URL, imports data from the associated web page to the worksheet, deletes some unwanted information, transposes the remaining data from rows to columns, and moves on to the next cell.  

The problem is the code keeps hanging and I'm not sure why.  The web links are valid.  I can manually import the data from the website but the code seems the go slower and slower until it finally grinds to a halt.

In the example file there are about 150 rows already done.  To test the code, select cell B150 and run the ScrapeCompanyInfo macro.  It will continue down the list until it gets stuck again.

I'm not experienced with querying web data using Excel.  Maybe there's better way to do it.  Any help would be appreciated.  

Thanks,
Kyle

Sub ScrapeCompanyInfo()
Dim rng As Range, rFind As Range, rEnd As Range
Dim xConnect As Object
Set rng = Selection
Do Until rng = ""
    With ActiveSheet.QueryTables.Add(Connection:="URL;" & rng, _
        Destination:=rng.Offset(0, 1))
        .Name = Cells(rng.Row, 1)
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
    
    Set rFind = rng.Offset(0, 1).EntireColumn.Find(Cells(rng.Row, 1), , xlValues, xlWhole)
    Range(rng.Offset(0, 1), rFind.Offset(-1)).Delete xlShiftUp
    Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
    Range(rng.Offset(12, 1), rEnd).ClearContents
    Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
    With Range(rng.Offset(0, 1), rEnd)
        .Copy
        .Resize(1).Offset(0, 1).PasteSpecial Transpose:=True
        .ClearContents
    End With
    Set rng = rng.Offset(1)
    For Each xConnect In ActiveWorkbook.Connections
        If xConnect.Name <> "ThisWorkbookDataModel" Then xConnect.Delete
    Next xConnect
    ThisWorkbook.Save
Loop
End Sub

Open in new window

Equipment-Directory.xlsm
0
kgerb
Asked:
kgerb
  • 2
1 Solution
 
Pierre CorneliusCommented:
I have made some changes to your code and added some comments. The biggest change is that you were creating new connections during each loop whereas my code just creates one and reuses it to fetch the data. Also you were saving during each loop, rather just save at the end or even leave saving up to the user.

Also note that it will run until the first blank cell found and you have some blank rows in your spreadsheet so it won't run to the end. But I guess you know this already.

Try it out and let me know how it goes.
Sub ScrapeCompanyInfo()
    Dim rng As Range, rFind As Range, rEnd As Range
    Dim xConnect As Object
    Set rng = Selection
    
    Dim UrlFetcher As QueryTable
    Set UrlFetcher = ActiveSheet.QueryTables.Add(Connection:="URL;", Destination:=rng.Offset(0, 1))
    
    With UrlFetcher
        .Name = "scraper"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = False 'True - I chagned this to false, we don't want asynchronous queries
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = False 'True - I changed this to false, we merely fetch and extract what we need so no need to save with spreadsheet
        .AdjustColumnWidth = False 'True - no need for this, we merely fetch and extract what we need so no need to adjust
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        '.Refresh BackgroundQuery:=False This will happen in the loop...
    End With
    
    Do Until rng = ""
        With UrlFetcher
            .Connection = "URL;" & rng.Value
            .Destination = rng.Offset(0, 1)
            If .Refresh Then
                Set rFind = rng.Offset(0, 1).EntireColumn.Find(Cells(rng.Row, 1), , xlValues, xlWhole)
                Range(rng.Offset(0, 1), rFind.Offset(-1)).Delete xlShiftUp
                Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
                Range(rng.Offset(12, 1), rEnd).ClearContents
                Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
                With Range(rng.Offset(0, 1), rEnd)
                    .Copy
                    .Resize(1).Offset(0, 1).PasteSpecial Transpose:=True
                    .ClearContents
                End With
                
            End If
            Set rng = rng.Offset(1)
        End With
        
        'For Each xConnect In ActiveWorkbook.Connections
        '    If xConnect.Name <> "ThisWorkbookDataModel" Then xConnect.Delete
        'Next xConnect
        'ThisWorkbook.Save Don't save in the loop, save after looping or leave it up to the user...
    Loop
    For Each xConnect In ActiveWorkbook.Connections
        If xConnect.Name <> "ThisWorkbookDataModel" Then xConnect.Delete
    Next xConnect
End Sub

Open in new window

0
 
kgerbChief EngineerAuthor Commented:
Thanks Pierre,
It seems to be working better with your modifications.  It's still super slow though.  I'm going to try a different approach but you answered my question so thank you!

regards,
Kyle
0
 
Pierre CorneliusCommented:
You're welcome.

The slow part is connecting to the web and loading the info. You can see this when stepping through the code in debug mode. The call to .Refresh takes some time...
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now