collecting medical symtoms from google groups via rss

Hi, I have two tasks I am trying to accomplish. Both involve I guess what is referred to as scraping.
First, I want to collect data from google groups related to illnesses. There are many groups where people discuss there illnesses and I want to build a database consisting of the words in these groups. By a database I only mean a spreadsheet with the columns being the words (every word appearing in the thread would be a column heading) and each row being a separate thread (or posting). I was told that using RSS would be a good idea because all google groups have rss feeds.
Can anyone give me a roadmap for how to go about doing this?
Thanks so much.
onyourmarkAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
gemdeals395Connect With a Mentor Commented:
Thats cURL http://us3.php.net/curl and when you drop that in your PHP page basically your get all the html from the page into the $result variable in that example. With cURL you can POST variables to a page, gather results and do just about anything a user with a browser could. On the cookie jar just drop a blank file in the web directory of your site or a folder like /cookies/ and make the file readable and writable. Then If the server is trying to pass cookies then cURL will save them here and pass them back to the server. Now use follow location to tell cURL to either just get the results from the page you want or to follow redirects.

Now on a group of words you could just put them in an array then loop through using the method of matching of your choice. Maybe something like:

foreach($keywords as $keyword) {
    strpos($keyword, $result);
}

Then just alter the results for what you need :)
0
 
gemdeals395Commented:
Just use cURL to connect to the url you want and then get it into a variable. Then you can search for any keywords or information you want.

            $url = "http://www.url_to_parse.com";
            $cookie_jar = "/path/to/cookie.txt";
            $ch = curl_init("$url");
            curl_setopt($ch, CURLOPT_HEADER, 1);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            curl_setopt($ch, CURLOPT_VERBOSE, 1);
            curl_setopt($ch, CURLOPT_USERAGENT, "$_SERVER[HTTP_USER_AGENT]");
            curl_setopt($ch, CURLOPT_COOKIEJAR, "$cookie_jar");
            curl_setopt($ch, CURLOPT_COOKIEFILE, "$cookie_jar");
            curl_setopt($ch, CURLOPT_URL, "$url");  
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($ch, CURLOPT_REFERER, "http://www.yoursite.com/index.php");
            $result=curl_exec($ch);
            curl_close($ch);

Now the results from the url that you want is stored in $result. Now just use preg_match, strpos or any other method for searching the variable for the information you want. Hope that helps :)
0
 
onyourmarkAuthor Commented:
Hi. THANKS!

Is that Python? Also, in the case where I am not looking for any particular word but rather if I were to want to collect the entire feed into a database word for word, could you say how to modify it or would I just not use preg_match, strpos or any other method for searching?
Thanks again.
0
 
onyourmarkAuthor Commented:
ThankS!
0
All Courses

From novice to tech pro — start learning today.