collecting data from the web

Hi, anyone have any suggestions for collecting data from the web? I want to collect layman and/or professional discussions on illnesses that people are having. There are lots of discussion groups that have this kind of information and there are probably lots of other sites as well. I want to, for starters, try to collect all the text in a discussion group thread and consider that one item, then collect all the text in the next thread and consider that as item 2 and continue this to build up a large database of say 10,000 or 20,000 items/threads.
I will need some way to collect the information and some way to store it, with each thread/item being a row or case in the database.
Does anyone have any recommendations about how to do this or where to start?

Thanks very much!
onyourmarkAsked:
Who is Participating?
 
Kshitij AhujaTechnology DeveloperCommented:
Frankly, i never used Google Groups. No idea about that. But when they provide latest ones, there is not much to do for providing for archives also. But i think if you talk to coders, they will find their way to get you r archives in the rss and ultimately in the excel file.
0
 
Kshitij AhujaTechnology DeveloperCommented:
As i understand your requirement correctly, you want aggregate the content from various threads and create your own knowledgebase.

1. Search for threads that provide RSS feeds.
2. You can create a script in php/asp/ or whatever dynamic language  you prefer to create a new thread on your site using those feed threads from multiple sites  and that way you can have your own database of useful threads from different sites.

However taking feeds and presenting to your users might well be infringement of copyrights. If you only plan to use it for your personal use, there shall be no problem at all.

And if you want more easier solution, specifically for personal use, use RSS aggregator like Google Reader etc and subscribe to various threads. However you will not be able to create your own database with that.


0
 
onyourmarkAuthor Commented:
Hi, thanks for the lead. Actually what I want is to do data mining on the database that I construct. Do you know if google groups have rss feeds and would I be able to collect older postings or only new postings (in other words does an rss feed pick up only the latest feed or can it go back to old listings as well)
Can I get these feeds into something like a spreadsheet with one row per thread (Excel 2007 has greatly increased its column and row spaces, used to be only 256 columns but now I think it is something like 256x256 columns)?
Thanks
0
 
Kshitij AhujaTechnology DeveloperCommented:
>>Do you know if google groups have rss feeds and would I be able to collect older postings or only new postings (in other words does an rss feed pick up only the latest feed or can it go back to old listings as well)

This will make it clear to you :
http://groups.google.com/support/bin/answer.py?answer=46384&query=feed&topic=&type=

>>Can I get these feeds into something like a spreadsheet with one row per thread (Excel 2007 has greatly increased its column and row spaces, used to be only 256 columns but now I think it is something like 256x256 columns)?

I am not much of an excel guy, however i do have php knowledge. With that, i know there is a way that you can fetch the feeds and put them in a php page, and once there you can export the data into excel very easily. You can have this coded by any php coder for a small fee. There are many sites for this. However if you wish to do it yourself, you can visit the PHP section and ask experts about it.

Thanks
0
 
onyourmarkAuthor Commented:
Hi and thanks AGAIN! Thanks for the clue about loading the rss into a webpage and then copying it to Excel. That sounds like it would work.  Also,
I looked at the site you pointed me to "Can I have an RSS feed for my group activity?"
I checked a group and clicked on View all available feeds (RSS and Atom). It had a link for 50 New topics. I guess that is a lot but do you know if it is possible to go back prior to that?
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.