Link to home
Start Free TrialLog in
Avatar of TomHacquoil
TomHacquoil

asked on

Scraping data from another website's HTML using PHP

Hi There,

I'm trying to scrape some data from a HTML table on the website of a local radio station. They have a recently played songs list and I'd like to do some analytics on that data.

The page I'm trying to retrieve the data from is available here:

http://www.channel103.com/music/index.php?qty=100

Fortunately the table is generated automatically and the amount of songs it displays is based on the value taken from the URL so I have a potentially limitless dataset to work with (although I've specified 100 songs as an example).

I'd eventually like to end up with the data from that table in an array or a mysql database (I want the Time Played, Song and Artist information for every entry.) However I'm unsure as to how to go about getting that information (I'm new to PHP Programming, but I understand most core programming concepts at least to a basic level).

I've played around with using regular expressions and so on and have managed to write a script that lists the currently playing song and artist, however I've come to a standstill now and can't workout where to go next. I've had a look around on the net and here on EE and XPATH seems to be a common route for similar problems but I'm struggling to get to grips with it.

Here is the PHP Code I've written so far (massively confused by the output I'm getting!):


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<title>Tom's 103 Analysis</title>
	<link href="style.css" rel="stylesheet" type="text/css" />
</head>

<body>

<?php 

/* 	Author: 	Tom Hacquoil
	Date: 		25th August 2010      */


/* PART 1: Get currently playing song and artist. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=50');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<div><span>now playing &ndash; </span><a href="http://www.channel103.com/music/index.php">(.*)</a><span>(.*)</span></div>#', $content, $data);
	
	# Assign the contents of the 'data' array to two variables, song and artist.
	$song = $data[1];
	$artist = $data[2];
	
	# Print the content of those variables.
	echo "<strong>Song:</strong> $song - <strong>Artist:</strong> $artist\n";
	
	echo "<br /><br />";
	
	
/* PART 2: Get a list of all recently played songs. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=20333');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<tr class="tabletextRow1"><td>(.*)</td>#', $content, $data);
	
	# Print first entity of the array (for testing).
	echo $data[1];
	
	echo "<br /><br /><br />";
	
	# Print the entire array. (For testing).
	print_r($data);
		

?>

</body>

</html>

Open in new window

Avatar of HackneyCab
HackneyCab
Flag of United Kingdom of Great Britain and Northern Ireland image

Ask the site whether they offer an RSS/Atom feed (and thus XML valid) version of the data. If they don't, then it could be tricky to extract what you want without lots of custom regular expressions.

Also, make sure you have permission to use the data for the purpose you describe. If you take their data without permission they could get your page de-listed from search engines. Usually if you contact the site master and assure them that your page will link to their site, they give you the green light. (And make sure to keep that written confirmation.)
Avatar of TomHacquoil
TomHacquoil

ASKER

Hi HackneyCab,

Thanks for your useful advice so far. Unfortunately they don't offer any sort of RSS/Atom feed of the data, so I'll have to extract the specific data I want some other way, which as you say is what is causing me the problem.

In terms of permission to use the data, It is purely for personal use - I'm not going to make my analysis publicly available or use the data elsewhere on the web, It's purely a task 'just to see if I can do it' whilst I try and get to grips with programming. The only reason I'm doing it via PHP is becuase that is the language I am currently most familiar with.

Could you provide some guidance as to what my next step would be (knowing that no RSS/Atom is available? Do you suggest I use another language to achieve this task? (Python?)

Thanks!
ASKER CERTIFIED SOLUTION
Avatar of Chris Harte
Chris Harte
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Both MunterMan and HagayMandel produced incredibly useful responses that were easy to understand and achieved the desired result. Thanks a lot guys.