Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

How do I structure this PregMatch script to get all results from the page?

Posted on 2011-09-15
4
Medium Priority
?
333 Views
Last Modified: 2012-05-12
Experts:

How do I correctly structure this script?   Each product has a single title, description, and image, but can have up to 9 variations.  

How do I include the title, description, image with each variation, and have each variation submitted to the database individually?  (I considered doing a loop but there is not a good way to count the number of variations...)

The current script is showing the last variation as the only result.  

See attached screen shot for the layout I am working with.

Thanks for your help!







 $url_3 = $produrl3;   //each product url from DB Table
	curl_setopt($ch, CURLOPT_URL,$url_3); // set url to go fetch
	$buffer = curl_exec($ch); // run the process
	

	
	
						
						
// TITLE						
preg_match('%<span class="productPage_prodName">(.*?)<\/span>%',$buffer,$matches31);

$prodname=$matches31[1];

echo $prodname."<BR>";



// IMAGE
preg_match('%<a href=".*?"><img src="productImage\/BIG\/(.*?)" width="225" height="225" alt=".*?" border="0"><\/a>%',$buffer,$matches32);

$image=$matches32[1];

echo $image."<BR>";




// DESCRIPTION
preg_match('%<span class="productPage_longText">\s*(.*?)\s*<\/span>%s',$buffer,$matches33);

$description=$matches33[1];

echo $description."<BR>";



//  VARIATIONS  START HERE  -  Up To 9 Variations per product.


// SKU OPTIONS
preg_match_all('%<td align="left" valign="top" width="125">\s*<input type="text" name=".*?" style="background:.*?;" readonly="readonly" class="productPage_itemid" value="(.*?)" \/><\/td>%',$buffer,$matches34,PREG_SET_ORDER);
								

foreach ($matches34 as $val34) {

$sku=$val34[1];

echo $sku."<br />";


}



// TITLE
preg_match_all('%<td align="left" valign="top" width="250">\s*<span class="productPage_tableText">(.*?)<\/span><\/td>\s*<\/tr>%',$buffer,$matches35,PREG_SET_ORDER);
								

foreach ($matches35 as $val35) {

$title=$val35[1];

echo $title."<br />";



}




// Weight
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Weight:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_weight">(.*?)lbs\.<\/span><\/td>\s*<\/tr>%',$buffer,$matches36,PREG_SET_ORDER);
								

foreach ($matches36 as $val36) {

$weight=$val36[1];

echo $weight."<br />";



}


// MSRP
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">List<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_listPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches37,PREG_SET_ORDER);
								

foreach ($matches37 as $val37) {

$MSRP=$val37[1];

echo $MSRP."<br />";



}




// COST
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Dealer<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_ourPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches38,PREG_SET_ORDER);
								

foreach ($matches38 as $val38) {

$COST=$val38[1];

echo $COST."<br />";



}

mysql_query("INSERT INTO `products` (prodname, image, sku, title, weight, MSRP, COST, description) VALUES ('".mysql_real_escape_string($prodname)."','".mysql_real_escape_string($image)."','".mysql_real_escape_string($sku)."','".mysql_real_escape_string($title)."','".mysql_real_escape_string($weight)."','".mysql_real_escape_string($MSRP)."','".mysql_real_escape_string($COST)."','".mysql_real_escape_string($description)."')");

Open in new window

screenshot of distributor website
0
Comment
Question by:rlb1
  • 2
  • 2
4 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 36548747
Typically you would not need to count the number of variations -- you can use foreach() to iterate over an indeterminate number of elements in an array.

Here is how to get a good answer to your question.  Show us the input data.  Not a picture or abstraction, the ACTUAL input data you want to use.  Post a link to the web page you're trying to scrape, or copy the HTML and post it here in the code snippet.  Give us the ENTIRE page, not a fragment.

Once we have that we can show you a way to tease out the data elements you want to put into the query.  Sometimes it is easier to use explode() instead of wrestling with the awkward syntax of regex.  But in any case, once we see the HTML in clear text we can show you how to parse it.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 36549605
Well, this is as far as I could get in a short period of time.
http://www.laprbass.com/RAY_temp_rlb1.php

Some comments are in order.

1. Ask the author of the web page if they publish an API.  If they do, use it so you do not have to write this scraping script.
2. As you will see, the data is highly variable and not well formatted.  Simple REGEX will not get what you need -- it's a bigger programming project than you may think!
3. Regular expressions have a set of characters called "metacharacters" that must be escaped unless they are in character classes.  The / slash is not a metacharacter and never needs to be escaped unless it has been used as the regex delimiter. The wickets used in HTML < > are metacharacters.  The regex delimiter, in this case % is by definition a metacharacter.

Sorry I cannot do more, but I really think the API is your best approach.  Otherwise you've got a lot of programming ahead of you.  And you run the risk that all "scraper scripts" run, namely, that the publisher of the web page will make a small change and suddenly your script will fail without warning.  An API is usually a versioned piece of software and it usually defines a formal interface to acquire the data, so unpredictable changes are extremely rare.

best of luck with the project, ~Ray
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>";

// THE PAGE WE WANT TO SCRAPE
$url = 'http://www.tecnec.com/Product.asp?baseItem=BSC100XXJ&cat=CABLES&subcat=AUDIOCAB&prodClass=AXLRXLR&mfg=&search=0&off=';

// READ THE PAGE
$htm = file_get_contents($url);
// echo htmlentities($htm);

// THE REGULAR EXPRESSIONS TO ISOLATE THE THINGS WE WANT
$regexs = array
( 'TITLE' => '%\<span class="productPage_prodName"\>(.*?)\</span\>%i'
, 'IMAGE' => '%\<a href=".*?"\>\<img src="productImage/BIG/(.*?)" width="225" height="225" alt=".*?" border="0"\></a\>%i'
, 'DESCR' => '%\<span class="productPage_longText"\>(.*?)\</span\>%si'
, 'SKU'   => '%class="productPage_itemid" value="(.*?)"%i'
, 'TTEXT' => '%class="productPage_tableText"\>(.*?)\</span\>%i'
)
;

foreach ($regexs as $name => $regex)
{
    preg_match_all($regex, $htm, $mat);
    $new[$name] = $mat[1];
}

print_r($new);

Open in new window

0
 

Author Closing Comment

by:rlb1
ID: 36549695
Thanks!
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this. Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it i…
This article discusses four methods for overlaying images in a container on a web page
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

885 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question