Solved

Scraping data from another website's HTML using PHP

Posted on 2010-08-25
5
3,103 Views
Last Modified: 2013-11-15
Hi There,

I'm trying to scrape some data from a HTML table on the website of a local radio station. They have a recently played songs list and I'd like to do some analytics on that data.

The page I'm trying to retrieve the data from is available here:

http://www.channel103.com/music/index.php?qty=100

Fortunately the table is generated automatically and the amount of songs it displays is based on the value taken from the URL so I have a potentially limitless dataset to work with (although I've specified 100 songs as an example).

I'd eventually like to end up with the data from that table in an array or a mysql database (I want the Time Played, Song and Artist information for every entry.) However I'm unsure as to how to go about getting that information (I'm new to PHP Programming, but I understand most core programming concepts at least to a basic level).

I've played around with using regular expressions and so on and have managed to write a script that lists the currently playing song and artist, however I've come to a standstill now and can't workout where to go next. I've had a look around on the net and here on EE and XPATH seems to be a common route for similar problems but I'm struggling to get to grips with it.

Here is the PHP Code I've written so far (massively confused by the output I'm getting!):


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<title>Tom's 103 Analysis</title>
	<link href="style.css" rel="stylesheet" type="text/css" />
</head>

<body>

<?php 

/* 	Author: 	Tom Hacquoil
	Date: 		25th August 2010      */


/* PART 1: Get currently playing song and artist. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=50');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<div><span>now playing &ndash; </span><a href="http://www.channel103.com/music/index.php">(.*)</a><span>(.*)</span></div>#', $content, $data);
	
	# Assign the contents of the 'data' array to two variables, song and artist.
	$song = $data[1];
	$artist = $data[2];
	
	# Print the content of those variables.
	echo "<strong>Song:</strong> $song - <strong>Artist:</strong> $artist\n";
	
	echo "<br /><br />";
	
	
/* PART 2: Get a list of all recently played songs. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=20333');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<tr class="tabletextRow1"><td>(.*)</td>#', $content, $data);
	
	# Print first entity of the array (for testing).
	echo $data[1];
	
	echo "<br /><br /><br />";
	
	# Print the entire array. (For testing).
	print_r($data);
		

?>

</body>

</html>

Open in new window

0
Comment
Question by:TomHacquoil
5 Comments
 
LVL 16

Expert Comment

by:HackneyCab
ID: 33523726
Ask the site whether they offer an RSS/Atom feed (and thus XML valid) version of the data. If they don't, then it could be tricky to extract what you want without lots of custom regular expressions.

Also, make sure you have permission to use the data for the purpose you describe. If you take their data without permission they could get your page de-listed from search engines. Usually if you contact the site master and assure them that your page will link to their site, they give you the green light. (And make sure to keep that written confirmation.)
0
 

Author Comment

by:TomHacquoil
ID: 33528900
Hi HackneyCab,

Thanks for your useful advice so far. Unfortunately they don't offer any sort of RSS/Atom feed of the data, so I'll have to extract the specific data I want some other way, which as you say is what is causing me the problem.

In terms of permission to use the data, It is purely for personal use - I'm not going to make my analysis publicly available or use the data elsewhere on the web, It's purely a task 'just to see if I can do it' whilst I try and get to grips with programming. The only reason I'm doing it via PHP is becuase that is the language I am currently most familiar with.

Could you provide some guidance as to what my next step would be (knowing that no RSS/Atom is available? Do you suggest I use another language to achieve this task? (Python?)

Thanks!
0
 
LVL 17

Accepted Solution

by:
Chris Harte earned 250 total points
ID: 33530343
Tom,
I am not an expert on regex, but you should be using preg_match_all which returns an array rather than a string. The attached code will print out the artist and song title. I am sure if you manipulate the regex you will only extract the data you want, as it is the array is [0] time artist song [1] time [2] artist [3] song.

You could even reduce this regex and use substr on the first array to extract the info you want.

(I reduced the number of extracted items to 10 so I would not get a bonkers amount of information)
<?php 


/* PART 2: Get a list of all recently played songs. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=10');
	
	$pattern = '#<tr class="tabletextRow.">\r\n<td>(.*)</td>\r\n<td>(.*)</td>\r\n<td>(.*)#';
	
	preg_match_all ($pattern, $content, $data);
	
	//var_dump($data);
	
	for ($i = 0; $i < 11; $i++)
	{
	    echo "<br /><br />". $data[2][$i].' '.$data[3][$i];
	}
	
?>	

Open in new window

0
 
LVL 16

Assisted Solution

by:HagayMandel
HagayMandel earned 250 total points
ID: 33531163
I've done most of the job:

The output of this code is a table built as you requested,
I've added <span> tags for each type of data, so you'll be able to relate to it.
The rest, is pure sql insert.

Code:

/* PART 1: Get currently playing song and artist. */

      # Put the contents of the source of the destination website into a 'content' variable.
      $content = file_get_contents('http://www.channel103.com/music/index.php?qty=100');
      
      # Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
      preg_match('#<div><span>now playing &ndash; </span><a href="http://www.channel103.com/music/index.php">(.*)</a><span>(.*)</span></div>#', $content, $data);
      
      # Assign the contents of the 'data' array to two variables, song and artist.
      $song = $data[1];
      $artist = $data[2];
      
      # Print the content of those variables.
      echo "<i>Now playing:</i> <strong>Song:</strong> $song - <strong>Artist:</strong> $artist\n";
      
      echo "<br /><br />";
      
      
/* PART 2: Get a list of all recently played songs. */

      # Put the contents of the source of the destination website into a 'content' variable.
      //$content = file_get_contents('http://www.channel103.com/music/index.php?qty=200033');
      
      function seekContent($start, $end, $string) {
        preg_match_all('/' . preg_quote($start, '/') . '([^\.)]+)'. preg_quote($end, '/').'/i', $string, $m);
        return $m[1];
            }
   
    $content=str_replace('#<td align=\"right\"/>#','',$content);
    $start = '<tr class="tabletextRow1">';
    $end = '</tr>';
      
    $out[] = seekContent($start, $end, $content);
            
      $aa=explode('<td>',$out[0][0]);
      
      // Build the table
      print '<table width="80%" border="1" colspan="3"><tr><td>Time</td><td>Artist</td><td>Song</td></tr>';
      for ($i=1; $i<=100;$i=$i+3) {
            print '<tr>';
            $ii=$i+1;
            $iii=$i+2;
            print '<td id="time">'.$aa[$i].'</td>';
            print '<td id="artist">'.$aa[$ii].'</td>';
            print '<td id="song">'.$aa[$iii].'</td>';
            print '</tr>';
      }
      print '</table>';
            
?>
0
 

Author Closing Comment

by:TomHacquoil
ID: 33531732
Both MunterMan and HagayMandel produced incredibly useful responses that were easy to understand and achieved the desired result. Thanks a lot guys.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
A company’s centralized system that manages user data, security, and distributed resources is often a focus of criminal attention. Active Directory (AD) is no exception. In truth, it’s even more likely to be targeted due to the number of companies …
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question