Solved

Scraping data from another website's HTML using PHP

Posted on 2010-08-25
5
2,948 Views
Last Modified: 2013-11-15
Hi There,

I'm trying to scrape some data from a HTML table on the website of a local radio station. They have a recently played songs list and I'd like to do some analytics on that data.

The page I'm trying to retrieve the data from is available here:

http://www.channel103.com/music/index.php?qty=100

Fortunately the table is generated automatically and the amount of songs it displays is based on the value taken from the URL so I have a potentially limitless dataset to work with (although I've specified 100 songs as an example).

I'd eventually like to end up with the data from that table in an array or a mysql database (I want the Time Played, Song and Artist information for every entry.) However I'm unsure as to how to go about getting that information (I'm new to PHP Programming, but I understand most core programming concepts at least to a basic level).

I've played around with using regular expressions and so on and have managed to write a script that lists the currently playing song and artist, however I've come to a standstill now and can't workout where to go next. I've had a look around on the net and here on EE and XPATH seems to be a common route for similar problems but I'm struggling to get to grips with it.

Here is the PHP Code I've written so far (massively confused by the output I'm getting!):


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	<title>Tom's 103 Analysis</title>
	<link href="style.css" rel="stylesheet" type="text/css" />
</head>

<body>

<?php 

/* 	Author: 	Tom Hacquoil
	Date: 		25th August 2010      */


/* PART 1: Get currently playing song and artist. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=50');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<div><span>now playing &ndash; </span><a href="http://www.channel103.com/music/index.php">(.*)</a><span>(.*)</span></div>#', $content, $data);
	
	# Assign the contents of the 'data' array to two variables, song and artist.
	$song = $data[1];
	$artist = $data[2];
	
	# Print the content of those variables.
	echo "<strong>Song:</strong> $song - <strong>Artist:</strong> $artist\n";
	
	echo "<br /><br />";
	
	
/* PART 2: Get a list of all recently played songs. */

	# Put the contents of the source of the destination website into a 'content' variable.
	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=20333');
	
	# Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
	preg_match('#<tr class="tabletextRow1"><td>(.*)</td>#', $content, $data);
	
	# Print first entity of the array (for testing).
	echo $data[1];
	
	echo "<br /><br /><br />";
	
	# Print the entire array. (For testing).
	print_r($data);
		

?>

</body>

</html>

Open in new window

0
Comment
Question by:TomHacquoil
5 Comments
 
LVL 16

Expert Comment

by:HackneyCab
Comment Utility
Ask the site whether they offer an RSS/Atom feed (and thus XML valid) version of the data. If they don't, then it could be tricky to extract what you want without lots of custom regular expressions.

Also, make sure you have permission to use the data for the purpose you describe. If you take their data without permission they could get your page de-listed from search engines. Usually if you contact the site master and assure them that your page will link to their site, they give you the green light. (And make sure to keep that written confirmation.)
0
 

Author Comment

by:TomHacquoil
Comment Utility
Hi HackneyCab,

Thanks for your useful advice so far. Unfortunately they don't offer any sort of RSS/Atom feed of the data, so I'll have to extract the specific data I want some other way, which as you say is what is causing me the problem.

In terms of permission to use the data, It is purely for personal use - I'm not going to make my analysis publicly available or use the data elsewhere on the web, It's purely a task 'just to see if I can do it' whilst I try and get to grips with programming. The only reason I'm doing it via PHP is becuase that is the language I am currently most familiar with.

Could you provide some guidance as to what my next step would be (knowing that no RSS/Atom is available? Do you suggest I use another language to achieve this task? (Python?)

Thanks!
0
 
LVL 16

Accepted Solution

by:
Chris Harte earned 250 total points
Comment Utility
Tom,
I am not an expert on regex, but you should be using preg_match_all which returns an array rather than a string. The attached code will print out the artist and song title. I am sure if you manipulate the regex you will only extract the data you want, as it is the array is [0] time artist song [1] time [2] artist [3] song.

You could even reduce this regex and use substr on the first array to extract the info you want.

(I reduced the number of extracted items to 10 so I would not get a bonkers amount of information)
<?php 





/* PART 2: Get a list of all recently played songs. */



	# Put the contents of the source of the destination website into a 'content' variable.

	$content = file_get_contents('http://www.channel103.com/music/index.php?qty=10');

	

	$pattern = '#<tr class="tabletextRow.">\r\n<td>(.*)</td>\r\n<td>(.*)</td>\r\n<td>(.*)#';

	

	preg_match_all ($pattern, $content, $data);

	

	//var_dump($data);

	

	for ($i = 0; $i < 11; $i++)

	{

	    echo "<br /><br />". $data[2][$i].' '.$data[3][$i];

	}

	

?>	

Open in new window

0
 
LVL 16

Assisted Solution

by:HagayMandel
HagayMandel earned 250 total points
Comment Utility
I've done most of the job:

The output of this code is a table built as you requested,
I've added <span> tags for each type of data, so you'll be able to relate to it.
The rest, is pure sql insert.

Code:

/* PART 1: Get currently playing song and artist. */

      # Put the contents of the source of the destination website into a 'content' variable.
      $content = file_get_contents('http://www.channel103.com/music/index.php?qty=100');
      
      # Using Regular Expressions, scan the file and everytime a match occurs, put data into the 'data' array.
      preg_match('#<div><span>now playing &ndash; </span><a href="http://www.channel103.com/music/index.php">(.*)</a><span>(.*)</span></div>#', $content, $data);
      
      # Assign the contents of the 'data' array to two variables, song and artist.
      $song = $data[1];
      $artist = $data[2];
      
      # Print the content of those variables.
      echo "<i>Now playing:</i> <strong>Song:</strong> $song - <strong>Artist:</strong> $artist\n";
      
      echo "<br /><br />";
      
      
/* PART 2: Get a list of all recently played songs. */

      # Put the contents of the source of the destination website into a 'content' variable.
      //$content = file_get_contents('http://www.channel103.com/music/index.php?qty=200033');
      
      function seekContent($start, $end, $string) {
        preg_match_all('/' . preg_quote($start, '/') . '([^\.)]+)'. preg_quote($end, '/').'/i', $string, $m);
        return $m[1];
            }
   
    $content=str_replace('#<td align=\"right\"/>#','',$content);
    $start = '<tr class="tabletextRow1">';
    $end = '</tr>';
      
    $out[] = seekContent($start, $end, $content);
            
      $aa=explode('<td>',$out[0][0]);
      
      // Build the table
      print '<table width="80%" border="1" colspan="3"><tr><td>Time</td><td>Artist</td><td>Song</td></tr>';
      for ($i=1; $i<=100;$i=$i+3) {
            print '<tr>';
            $ii=$i+1;
            $iii=$i+2;
            print '<td id="time">'.$aa[$i].'</td>';
            print '<td id="artist">'.$aa[$ii].'</td>';
            print '<td id="song">'.$aa[$iii].'</td>';
            print '</tr>';
      }
      print '</table>';
            
?>
0
 

Author Closing Comment

by:TomHacquoil
Comment Utility
Both MunterMan and HagayMandel produced incredibly useful responses that were easy to understand and achieved the desired result. Thanks a lot guys.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Suggested Solutions

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
Not sure what the best email signature size is? Are you worried about email signature image size? Follow this best practice guide.
In this tutorial viewers will learn how to embed videos in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: "<!DOCTYPE html>": Use the <video> tag to insert a video. Define the src as the URL of your video; this is similar to …
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now