• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 549
  • Last Modified:

Web Content Scraper with Graphic Interface

I've been researching tools and services that will allow me to gather publicly posted data and store it in a database.  I'm trying to do this so I can index the data and create tools to search the data more efficiently.  

The problem is all the tools I find require you to be a fairly technical user to set up the "scraping" or extraction.  I am looking for a tool that I can have business users configure to extract the data.  I'm currently meeting with several software as a service providers to see what they can offer, but I'd like to be able to run this process internally.

If I used a service to collect this data, they would have a copy, and may be able to provide competition to my business. The perfect solution would have 3 key features.

1. be able to extract data from HTML, PDF, XLS, image data (OCR functionality)
2. Be easy enough to configure that it wouldn't require a programmer or equally talented user to set up the extract configuration.
3. output directly to SQL or to an common intermediate file that could be imported via SSIS package to SQL.

Hopefully this solution does exist already, if not, then SAAS will have to be the way to go.  I look forward to any help you can provide.
2 Solutions
Michel PlungjanIT ExpertCommented:
Hi Shannon,

I think this is a request for software development, no?
If so, you may not get much help at EE, where we more answer specific questions about existing code rather than write (biggish) software from scratch
Nenad RajsicCommented:
Just a thought.

Rather than contacting SAAS companies and developing things from scratch why not contact one of the developers who already develop content scrappers and ask them to build something for you? It should be easy for them and cheap for you
James MurrellProduct SpecialistCommented:
unsure but a while back someone recommend http://www.pixieware.com/ for a project like this
Shannon_LowderAuthor Commented:
mplungjan -- I was wanting to make sure something hadn't already been built before I commission new work.

vukovarcan -- I hadn't considered that.  I'll definitely consider it now.

cs97jjm3 -- I'll check that site out now.

Sorry for the delay in my reply.
Shannon_LowderAuthor Commented:
Both are good suggestions.  I'll approach a few open source developers and see what they would think of extending their products.  I've also sent out a request for more information from pixieware.  It sounds like you still have to "code" a scrape configuration file.  I may be mistaken though.  Thank you all for contributing!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now