?
Solved

web scraping with Excel VBA

Posted on 2011-10-17
13
Medium Priority
?
765 Views
Last Modified: 2012-05-12
Hi

I've just started doing some experimenting scraping sites with Excel VBA and was wondering are there any limitation to this method eg are there some types of dynamic pages that can't be scraped??

Many thanks
0
Comment
Question by:kenabbott
  • 5
  • 3
  • 3
  • +1
13 Comments
 
LVL 35

Expert Comment

by:Norie
ID: 36979904
I think it really depends on the page and how you are going about it.

One factor would be 'how' dynamic the page is, is it refreshed every minute, after 10 secs etc

There's also a connection factor which could impact on the time at takes you to just navigate to the page, how long the page takes to refresh etc.
0
 

Author Comment

by:kenabbott
ID: 36979961
Does it make a difference if the page is using GET or POST??
0
 
LVL 35

Expert Comment

by:Norie
ID: 36980009
I suppose GET might be quicker since it only usually involves retrieving data whereas POST is normally also used for other things, eg updating  data
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 11

Expert Comment

by:ScriptAddict
ID: 36980715
I've found that some sites that use frames, especially those java controlled frames can be extremely difficult to scrape.  

0
 
LVL 35

Expert Comment

by:Norie
ID: 36980738
Frames aren't too difficult if you can isolate the frame's target document.

Sometimes if you do that you can work with that on it's own.
0
 
LVL 11

Expert Comment

by:ScriptAddict
ID: 36980970
Next time I'm pulling my hair out I'm definitely posting here then ;)
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 36985482
Probably be best if you would post here just before pulling your hair out ;-)
0
 
LVL 19

Expert Comment

by:Arno Koster
ID: 36985484
kenabbott, do you have a particular site that you have problems with ?
0
 

Author Comment

by:kenabbott
ID: 36985492
No, its just a general question.  Just interested in limitations of scraping
0
 
LVL 35

Expert Comment

by:Norie
ID: 36985575
It really is down to the site but it also depends on how much you want to persevere.

Another thing is how 'perfect' you want the 'final product'.

You could probably get the raw data quicker than looking for detailed information.

Once you've got that you can manipulate it to get what you need.

All that's for general stuff - if it's a dynamic site/page that brings a whole load of other considerations.
0
 

Author Comment

by:kenabbott
ID: 36985586
Actually to be more specific I am specifically interested in sites with product data - eg scraping Amazon for details of all travel books on Morocco (I should add I don't intend to do that but use it as an example)
0
 
LVL 19

Accepted Solution

by:
Arno Koster earned 2000 total points
ID: 36985680
Most regular way of scraping is to open an internet explorer object and let it download a page. Then you can process the contents and continue to the next page.
Normally most of the time the internet explorer object will be waiting for the pages to load as processing is performed rather quick.

three design questions will have to be answererd :

 - which pages to scrape (do you have a list or are you planning on recursively scraping all hyperlinks in a scraped website)
 - which information to save (how can you identify the information you are looking for, eg. after a H1 heading / inside of a par with specific style etc)
 - where to save the information (eg. in excel worksheet / to a text file / debug prompt etc)

i guess you will most likely be limited by either imagination or time needed to scrape websites)
0
 
LVL 35

Expert Comment

by:Norie
ID: 36985692
There are millions of sites with product data, all different.

The most important thing in this is the site/page, that's what you've got to work with after all.:)
0

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When the s#!t hits the fan, you don’t have time to look up who’s on call, draft emails, call collaborators, or send text messages. An instant chat window is definitely the way to go, especially one like HipChat. HipChat is a true business app. An…
Strategic internal linking is often considered an SEO power technique, especially for content marketing. Do you need to hire an SEO agency to optimize you internal linking? No, this article will help you understand the basics of internal linking and…
This Micro Tutorial will demonstrate in Google Sheets how to use the HYPERLINK function to create live links inside your spreadsheet.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question