Recommendations for web scraping based on certain criteria?

Posted on 2011-10-29
Last Modified: 2013-11-18
Well, experts, I have a bit of a challenge ...

I'm unfamiliar with what is available in the data collection/web scraping arena (either free or chargeable). Bottom line, this is the type of information I need to get:

Photography-related websites or blogs (NOT photographers) which meet certain SEO criteria (a high traffic him of visitors would be a good example). I know there are numerous ways to gauge traffic (Alexa rank, back links, etc.). The ideal information, although I have no idea how it could be obtained, would be the number of visitors (either monthly, annually, etc.). The other critical piece of information is an e-mail address by which I could contact each website or blog (typically found on most websites under one or more the following categories: support, information, contact, etc.).

The ultimate goal is to assemble a list of at least several hundred (I would hope  something more like several thousand would be more likely) websites that meet the criteria. I guess the minimum criteria would be: URL, brief website description, some indication of traffic rank, and e-mail address. The other criteria are harder to define for purposes of this post, but since I'm just trying to get a handle on this whole web scraping-data collection area, I don't want to muddy the waters with difficult to understand selection criteria.

I've done numerous searches, all of which have not resulted in anything close to what I need. My hope is that someone at EE is aware of an online or standalone software package which could supply most or all of what I need. As another option, I suppose purchasing an e-mail list is an option; however, I have never done that either so I don't know where to start.

I do not program so any solution involving that, would not work in my case. I also don't have the time or money to have custom programs developed to accomplish this, (unless my idea of what it would require is much more than what it actually would take).

I can't help but believe that somewhere, someone has developed this type of software tool, but I have no idea where to even start looking. Any suggestions or guidance would be greatly appreciated. Thank you.

If anyone has a suggestion as to a better zone to identify, please let me know because I don't understand what half of the zones mean anyway.
Question by:photoman11
    LVL 26

    Accepted Solution

    You should be able to get everything except an email address from traffic information.  Companies that sell this sort of data (e.g. comscore should have tools that let you search for companies in specific spaces - like photographic sites.  You should be able to use their tools to identify the top photographic sites on the web, with a link etc.

    Collecting more than that by automatic screen scraping would be a challenge.  Most web sites specifically work to block screen scraping of their email addresses.  There are a number of ways to block this and any major site will adopt one of them in order to stop spamming of their support/contact aliases.  (This is why so many sites use a form for contacting them rather than a visible email address).  So I suspect you'll be out of luck there.  Somebody may sell a database with that sort of contact info but now you're really surfing in the dark underbelly of the web.  Be VERY CAREFUL if you start down that road.  E.g. if you buy a list of email addresses using a credit card you should expect that card to immediately be resold and used fraudulently.


    Author Comment


    I think I understand what you're saying. however, I'm not sure how to get everything except the e-mail addresses. I looked at the comscore site and I couldn't find any category or product which correlated with what I am looking for. Do you know of any online or downloadable software products which do this?

    LVL 26

    Expert Comment

    I'm not aware of any specific products that do this - but it seems odd that sites which aggregate and sell traffic data (like comScore) wouldn't provide these sorts of search tools.  Seems like an obvious need if you're looking to identify traffic levels or competitors in a particular industry sector.  Did you try contacting them to make sure they can't meet this need?


    Author Comment


    I did not contact them yet. I base my conclusion on going through their website and reviewing their products/services. But I will contact them to find out for sure. Thanks again.

    Author Comment

    Unfortunately, I was right about them. However, I did find somebody through oDesk who has experience working with the Firefox add-on" SE0 quake, which will provide most of the information I need… I think.

    Featured Post

    What Security Threats Are You Missing?

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Join & Write a Comment

    Read about why website design really matters in today's demanding market.
    In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
    Viewers will get an overview of the benefits and risks of using Bitcoin to accept payments. What Bitcoin is: Legality: Risks: Benefits: Which businesses are best suited?: Other things you should know: How to get started:
    In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

    755 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    17 Experts available now in Live!

    Get 1:1 Help Now