web scraping with Excel VBA

Hi

I've just started doing some experimenting scraping sites with Excel VBA and was wondering are there any limitation to this method eg are there some types of dynamic pages that can't be scraped??

Many thanks
kenabbottAsked:
Who is Participating?
 
Arno KosterConnect With a Mentor Commented:
Most regular way of scraping is to open an internet explorer object and let it download a page. Then you can process the contents and continue to the next page.
Normally most of the time the internet explorer object will be waiting for the pages to load as processing is performed rather quick.

three design questions will have to be answererd :

 - which pages to scrape (do you have a list or are you planning on recursively scraping all hyperlinks in a scraped website)
 - which information to save (how can you identify the information you are looking for, eg. after a H1 heading / inside of a par with specific style etc)
 - where to save the information (eg. in excel worksheet / to a text file / debug prompt etc)

i guess you will most likely be limited by either imagination or time needed to scrape websites)
0
 
NorieVBA ExpertCommented:
I think it really depends on the page and how you are going about it.

One factor would be 'how' dynamic the page is, is it refreshed every minute, after 10 secs etc

There's also a connection factor which could impact on the time at takes you to just navigate to the page, how long the page takes to refresh etc.
0
 
kenabbottAuthor Commented:
Does it make a difference if the page is using GET or POST??
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
NorieVBA ExpertCommented:
I suppose GET might be quicker since it only usually involves retrieving data whereas POST is normally also used for other things, eg updating  data
0
 
ScriptAddictCommented:
I've found that some sites that use frames, especially those java controlled frames can be extremely difficult to scrape.  

0
 
NorieVBA ExpertCommented:
Frames aren't too difficult if you can isolate the frame's target document.

Sometimes if you do that you can work with that on it's own.
0
 
ScriptAddictCommented:
Next time I'm pulling my hair out I'm definitely posting here then ;)
0
 
Arno KosterCommented:
Probably be best if you would post here just before pulling your hair out ;-)
0
 
Arno KosterCommented:
kenabbott, do you have a particular site that you have problems with ?
0
 
kenabbottAuthor Commented:
No, its just a general question.  Just interested in limitations of scraping
0
 
NorieVBA ExpertCommented:
It really is down to the site but it also depends on how much you want to persevere.

Another thing is how 'perfect' you want the 'final product'.

You could probably get the raw data quicker than looking for detailed information.

Once you've got that you can manipulate it to get what you need.

All that's for general stuff - if it's a dynamic site/page that brings a whole load of other considerations.
0
 
kenabbottAuthor Commented:
Actually to be more specific I am specifically interested in sites with product data - eg scraping Amazon for details of all travel books on Morocco (I should add I don't intend to do that but use it as an example)
0
 
NorieVBA ExpertCommented:
There are millions of sites with product data, all different.

The most important thing in this is the site/page, that's what you've got to work with after all.:)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.