I've been researching tools and services that will allow me to gather publicly posted data and store it in a database. I'm trying to do this so I can index the data and create tools to search the data more efficiently.
The problem is all the tools I find require you to be a fairly technical user to set up the "scraping" or extraction. I am looking for a tool that I can have business users configure to extract the data. I'm currently meeting with several software as a service providers to see what they can offer, but I'd like to be able to run this process internally.
If I used a service to collect this data, they would have a copy, and may be able to provide competition to my business. The perfect solution would have 3 key features.
1. be able to extract data from HTML, PDF, XLS, image data (OCR functionality)
2. Be easy enough to configure that it wouldn't require a programmer or equally talented user to set up the extract configuration.
3. output directly to SQL or to an common intermediate file that could be imported via SSIS package to SQL.
Hopefully this solution does exist already, if not, then SAAS will have to be the way to go. I look forward to any help you can provide.