Solved

Scrape Data from a list of hyperlinks

Posted on 2016-11-10
3
29 Views
Last Modified: 2016-11-11
Hello Everyone,
I have the following routine.  The purpose is to scrape data from a list of web pages.  The links to these pages are listed in a column in the worksheet.  The code loops through each cell containing a URL, imports data from the associated web page to the worksheet, deletes some unwanted information, transposes the remaining data from rows to columns, and moves on to the next cell.  

The problem is the code keeps hanging and I'm not sure why.  The web links are valid.  I can manually import the data from the website but the code seems the go slower and slower until it finally grinds to a halt.

In the example file there are about 150 rows already done.  To test the code, select cell B150 and run the ScrapeCompanyInfo macro.  It will continue down the list until it gets stuck again.

I'm not experienced with querying web data using Excel.  Maybe there's better way to do it.  Any help would be appreciated.  

Thanks,
Kyle

Sub ScrapeCompanyInfo()
Dim rng As Range, rFind As Range, rEnd As Range
Dim xConnect As Object
Set rng = Selection
Do Until rng = ""
    With ActiveSheet.QueryTables.Add(Connection:="URL;" & rng, _
        Destination:=rng.Offset(0, 1))
        .Name = Cells(rng.Row, 1)
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
    
    Set rFind = rng.Offset(0, 1).EntireColumn.Find(Cells(rng.Row, 1), , xlValues, xlWhole)
    Range(rng.Offset(0, 1), rFind.Offset(-1)).Delete xlShiftUp
    Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
    Range(rng.Offset(12, 1), rEnd).ClearContents
    Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
    With Range(rng.Offset(0, 1), rEnd)
        .Copy
        .Resize(1).Offset(0, 1).PasteSpecial Transpose:=True
        .ClearContents
    End With
    Set rng = rng.Offset(1)
    For Each xConnect In ActiveWorkbook.Connections
        If xConnect.Name <> "ThisWorkbookDataModel" Then xConnect.Delete
    Next xConnect
    ThisWorkbook.Save
Loop
End Sub

Open in new window

Equipment-Directory.xlsm
0
Comment
Question by:kgerb
  • 2
3 Comments
 
LVL 14

Accepted Solution

by:
Pierre Cornelius earned 500 total points
ID: 41883530
I have made some changes to your code and added some comments. The biggest change is that you were creating new connections during each loop whereas my code just creates one and reuses it to fetch the data. Also you were saving during each loop, rather just save at the end or even leave saving up to the user.

Also note that it will run until the first blank cell found and you have some blank rows in your spreadsheet so it won't run to the end. But I guess you know this already.

Try it out and let me know how it goes.
Sub ScrapeCompanyInfo()
    Dim rng As Range, rFind As Range, rEnd As Range
    Dim xConnect As Object
    Set rng = Selection
    
    Dim UrlFetcher As QueryTable
    Set UrlFetcher = ActiveSheet.QueryTables.Add(Connection:="URL;", Destination:=rng.Offset(0, 1))
    
    With UrlFetcher
        .Name = "scraper"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = False 'True - I chagned this to false, we don't want asynchronous queries
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = False 'True - I changed this to false, we merely fetch and extract what we need so no need to save with spreadsheet
        .AdjustColumnWidth = False 'True - no need for this, we merely fetch and extract what we need so no need to adjust
        .RefreshPeriod = 0
        .WebSelectionType = xlEntirePage
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        '.Refresh BackgroundQuery:=False This will happen in the loop...
    End With
    
    Do Until rng = ""
        With UrlFetcher
            .Connection = "URL;" & rng.Value
            .Destination = rng.Offset(0, 1)
            If .Refresh Then
                Set rFind = rng.Offset(0, 1).EntireColumn.Find(Cells(rng.Row, 1), , xlValues, xlWhole)
                Range(rng.Offset(0, 1), rFind.Offset(-1)).Delete xlShiftUp
                Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
                Range(rng.Offset(12, 1), rEnd).ClearContents
                Set rEnd = Cells(Rows.Count, rng.Offset(0, 1).Column).End(xlUp)
                With Range(rng.Offset(0, 1), rEnd)
                    .Copy
                    .Resize(1).Offset(0, 1).PasteSpecial Transpose:=True
                    .ClearContents
                End With
                
            End If
            Set rng = rng.Offset(1)
        End With
        
        'For Each xConnect In ActiveWorkbook.Connections
        '    If xConnect.Name <> "ThisWorkbookDataModel" Then xConnect.Delete
        'Next xConnect
        'ThisWorkbook.Save Don't save in the loop, save after looping or leave it up to the user...
    Loop
    For Each xConnect In ActiveWorkbook.Connections
        If xConnect.Name <> "ThisWorkbookDataModel" Then xConnect.Delete
    Next xConnect
End Sub

Open in new window

0
 
LVL 12

Author Comment

by:kgerb
ID: 41883775
Thanks Pierre,
It seems to be working better with your modifications.  It's still super slow though.  I'm going to try a different approach but you answered my question so thank you!

regards,
Kyle
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 41883777
You're welcome.

The slow part is connecting to the web and loading the info. You can see this when stepping through the code in debug mode. The call to .Refresh takes some time...
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

What is a Form List Box? (skip if you know this) The forms List Box is the alternative to the ActiveX list box. If you are using excel 2007, you first make sure you have a developer tab (click the Orb)->"Excel Options"->Popular->"Show Developer tab…
INDEX and MATCH can be used to great effect to replace HLOOKUP and VLOOKUP as it does not have the limitation of needing the data to be sorted so that the reference value is in the first column or row. It also has the ability to perform a bi-directi…
Graphs within dashboards are meant to be dynamic, representing data from a period of time that will change each time the dashboard is updated with new data. Rather than update each graph to point to a different set within a static set of data, t…
This Micro Tutorial demonstrates in Microsoft Excel how to consolidate your marketing data by creating an interactive charts using form controls. This creates cool drop-downs for viewers of your chart to choose from.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now