Solved

How do I structure this PregMatch script to get all results from the page?

Posted on 2011-09-15
4
290 Views
Last Modified: 2012-05-12
Experts:

How do I correctly structure this script?   Each product has a single title, description, and image, but can have up to 9 variations.  

How do I include the title, description, image with each variation, and have each variation submitted to the database individually?  (I considered doing a loop but there is not a good way to count the number of variations...)

The current script is showing the last variation as the only result.  

See attached screen shot for the layout I am working with.

Thanks for your help!







 $url_3 = $produrl3;   //each product url from DB Table
	curl_setopt($ch, CURLOPT_URL,$url_3); // set url to go fetch
	$buffer = curl_exec($ch); // run the process
	

	
	
						
						
// TITLE						
preg_match('%<span class="productPage_prodName">(.*?)<\/span>%',$buffer,$matches31);

$prodname=$matches31[1];

echo $prodname."<BR>";



// IMAGE
preg_match('%<a href=".*?"><img src="productImage\/BIG\/(.*?)" width="225" height="225" alt=".*?" border="0"><\/a>%',$buffer,$matches32);

$image=$matches32[1];

echo $image."<BR>";




// DESCRIPTION
preg_match('%<span class="productPage_longText">\s*(.*?)\s*<\/span>%s',$buffer,$matches33);

$description=$matches33[1];

echo $description."<BR>";



//  VARIATIONS  START HERE  -  Up To 9 Variations per product.


// SKU OPTIONS
preg_match_all('%<td align="left" valign="top" width="125">\s*<input type="text" name=".*?" style="background:.*?;" readonly="readonly" class="productPage_itemid" value="(.*?)" \/><\/td>%',$buffer,$matches34,PREG_SET_ORDER);
								

foreach ($matches34 as $val34) {

$sku=$val34[1];

echo $sku."<br />";


}



// TITLE
preg_match_all('%<td align="left" valign="top" width="250">\s*<span class="productPage_tableText">(.*?)<\/span><\/td>\s*<\/tr>%',$buffer,$matches35,PREG_SET_ORDER);
								

foreach ($matches35 as $val35) {

$title=$val35[1];

echo $title."<br />";



}




// Weight
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Weight:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_weight">(.*?)lbs\.<\/span><\/td>\s*<\/tr>%',$buffer,$matches36,PREG_SET_ORDER);
								

foreach ($matches36 as $val36) {

$weight=$val36[1];

echo $weight."<br />";



}


// MSRP
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">List<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_listPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches37,PREG_SET_ORDER);
								

foreach ($matches37 as $val37) {

$MSRP=$val37[1];

echo $MSRP."<br />";



}




// COST
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Dealer<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_ourPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches38,PREG_SET_ORDER);
								

foreach ($matches38 as $val38) {

$COST=$val38[1];

echo $COST."<br />";



}

mysql_query("INSERT INTO `products` (prodname, image, sku, title, weight, MSRP, COST, description) VALUES ('".mysql_real_escape_string($prodname)."','".mysql_real_escape_string($image)."','".mysql_real_escape_string($sku)."','".mysql_real_escape_string($title)."','".mysql_real_escape_string($weight)."','".mysql_real_escape_string($MSRP)."','".mysql_real_escape_string($COST)."','".mysql_real_escape_string($description)."')");

Open in new window

screenshot of distributor website
0
Comment
Question by:rlb1
  • 2
  • 2
4 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
Typically you would not need to count the number of variations -- you can use foreach() to iterate over an indeterminate number of elements in an array.

Here is how to get a good answer to your question.  Show us the input data.  Not a picture or abstraction, the ACTUAL input data you want to use.  Post a link to the web page you're trying to scrape, or copy the HTML and post it here in the code snippet.  Give us the ENTIRE page, not a fragment.

Once we have that we can show you a way to tease out the data elements you want to put into the query.  Sometimes it is easier to use explode() instead of wrestling with the awkward syntax of regex.  But in any case, once we see the HTML in clear text we can show you how to parse it.
0
 

Author Comment

by:rlb1
Comment Utility
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
Comment Utility
Well, this is as far as I could get in a short period of time.
http://www.laprbass.com/RAY_temp_rlb1.php

Some comments are in order.

1. Ask the author of the web page if they publish an API.  If they do, use it so you do not have to write this scraping script.
2. As you will see, the data is highly variable and not well formatted.  Simple REGEX will not get what you need -- it's a bigger programming project than you may think!
3. Regular expressions have a set of characters called "metacharacters" that must be escaped unless they are in character classes.  The / slash is not a metacharacter and never needs to be escaped unless it has been used as the regex delimiter. The wickets used in HTML < > are metacharacters.  The regex delimiter, in this case % is by definition a metacharacter.

Sorry I cannot do more, but I really think the API is your best approach.  Otherwise you've got a lot of programming ahead of you.  And you run the risk that all "scraper scripts" run, namely, that the publisher of the web page will make a small change and suddenly your script will fail without warning.  An API is usually a versioned piece of software and it usually defines a formal interface to acquire the data, so unpredictable changes are extremely rare.

best of luck with the project, ~Ray
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>";

// THE PAGE WE WANT TO SCRAPE
$url = 'http://www.tecnec.com/Product.asp?baseItem=BSC100XXJ&cat=CABLES&subcat=AUDIOCAB&prodClass=AXLRXLR&mfg=&search=0&off=';

// READ THE PAGE
$htm = file_get_contents($url);
// echo htmlentities($htm);

// THE REGULAR EXPRESSIONS TO ISOLATE THE THINGS WE WANT
$regexs = array
( 'TITLE' => '%\<span class="productPage_prodName"\>(.*?)\</span\>%i'
, 'IMAGE' => '%\<a href=".*?"\>\<img src="productImage/BIG/(.*?)" width="225" height="225" alt=".*?" border="0"\></a\>%i'
, 'DESCR' => '%\<span class="productPage_longText"\>(.*?)\</span\>%si'
, 'SKU'   => '%class="productPage_itemid" value="(.*?)"%i'
, 'TTEXT' => '%class="productPage_tableText"\>(.*?)\</span\>%i'
)
;

foreach ($regexs as $name => $regex)
{
    preg_match_all($regex, $htm, $mat);
    $new[$name] = $mat[1];
}

print_r($new);

Open in new window

0
 

Author Closing Comment

by:rlb1
Comment Utility
Thanks!
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now