Solved

How do I structure this PregMatch script to get all results from the page?

Posted on 2011-09-15
4
323 Views
Last Modified: 2012-05-12
Experts:

How do I correctly structure this script?   Each product has a single title, description, and image, but can have up to 9 variations.  

How do I include the title, description, image with each variation, and have each variation submitted to the database individually?  (I considered doing a loop but there is not a good way to count the number of variations...)

The current script is showing the last variation as the only result.  

See attached screen shot for the layout I am working with.

Thanks for your help!







 $url_3 = $produrl3;   //each product url from DB Table
	curl_setopt($ch, CURLOPT_URL,$url_3); // set url to go fetch
	$buffer = curl_exec($ch); // run the process
	

	
	
						
						
// TITLE						
preg_match('%<span class="productPage_prodName">(.*?)<\/span>%',$buffer,$matches31);

$prodname=$matches31[1];

echo $prodname."<BR>";



// IMAGE
preg_match('%<a href=".*?"><img src="productImage\/BIG\/(.*?)" width="225" height="225" alt=".*?" border="0"><\/a>%',$buffer,$matches32);

$image=$matches32[1];

echo $image."<BR>";




// DESCRIPTION
preg_match('%<span class="productPage_longText">\s*(.*?)\s*<\/span>%s',$buffer,$matches33);

$description=$matches33[1];

echo $description."<BR>";



//  VARIATIONS  START HERE  -  Up To 9 Variations per product.


// SKU OPTIONS
preg_match_all('%<td align="left" valign="top" width="125">\s*<input type="text" name=".*?" style="background:.*?;" readonly="readonly" class="productPage_itemid" value="(.*?)" \/><\/td>%',$buffer,$matches34,PREG_SET_ORDER);
								

foreach ($matches34 as $val34) {

$sku=$val34[1];

echo $sku."<br />";


}



// TITLE
preg_match_all('%<td align="left" valign="top" width="250">\s*<span class="productPage_tableText">(.*?)<\/span><\/td>\s*<\/tr>%',$buffer,$matches35,PREG_SET_ORDER);
								

foreach ($matches35 as $val35) {

$title=$val35[1];

echo $title."<br />";



}




// Weight
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Weight:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_weight">(.*?)lbs\.<\/span><\/td>\s*<\/tr>%',$buffer,$matches36,PREG_SET_ORDER);
								

foreach ($matches36 as $val36) {

$weight=$val36[1];

echo $weight."<br />";



}


// MSRP
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">List<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_listPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches37,PREG_SET_ORDER);
								

foreach ($matches37 as $val37) {

$MSRP=$val37[1];

echo $MSRP."<br />";



}




// COST
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Dealer<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_ourPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches38,PREG_SET_ORDER);
								

foreach ($matches38 as $val38) {

$COST=$val38[1];

echo $COST."<br />";



}

mysql_query("INSERT INTO `products` (prodname, image, sku, title, weight, MSRP, COST, description) VALUES ('".mysql_real_escape_string($prodname)."','".mysql_real_escape_string($image)."','".mysql_real_escape_string($sku)."','".mysql_real_escape_string($title)."','".mysql_real_escape_string($weight)."','".mysql_real_escape_string($MSRP)."','".mysql_real_escape_string($COST)."','".mysql_real_escape_string($description)."')");

Open in new window

screenshot of distributor website
0
Comment
Question by:rlb1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 36548747
Typically you would not need to count the number of variations -- you can use foreach() to iterate over an indeterminate number of elements in an array.

Here is how to get a good answer to your question.  Show us the input data.  Not a picture or abstraction, the ACTUAL input data you want to use.  Post a link to the web page you're trying to scrape, or copy the HTML and post it here in the code snippet.  Give us the ENTIRE page, not a fragment.

Once we have that we can show you a way to tease out the data elements you want to put into the query.  Sometimes it is easier to use explode() instead of wrestling with the awkward syntax of regex.  But in any case, once we see the HTML in clear text we can show you how to parse it.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 36549605
Well, this is as far as I could get in a short period of time.
http://www.laprbass.com/RAY_temp_rlb1.php

Some comments are in order.

1. Ask the author of the web page if they publish an API.  If they do, use it so you do not have to write this scraping script.
2. As you will see, the data is highly variable and not well formatted.  Simple REGEX will not get what you need -- it's a bigger programming project than you may think!
3. Regular expressions have a set of characters called "metacharacters" that must be escaped unless they are in character classes.  The / slash is not a metacharacter and never needs to be escaped unless it has been used as the regex delimiter. The wickets used in HTML < > are metacharacters.  The regex delimiter, in this case % is by definition a metacharacter.

Sorry I cannot do more, but I really think the API is your best approach.  Otherwise you've got a lot of programming ahead of you.  And you run the risk that all "scraper scripts" run, namely, that the publisher of the web page will make a small change and suddenly your script will fail without warning.  An API is usually a versioned piece of software and it usually defines a formal interface to acquire the data, so unpredictable changes are extremely rare.

best of luck with the project, ~Ray
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>";

// THE PAGE WE WANT TO SCRAPE
$url = 'http://www.tecnec.com/Product.asp?baseItem=BSC100XXJ&cat=CABLES&subcat=AUDIOCAB&prodClass=AXLRXLR&mfg=&search=0&off=';

// READ THE PAGE
$htm = file_get_contents($url);
// echo htmlentities($htm);

// THE REGULAR EXPRESSIONS TO ISOLATE THE THINGS WE WANT
$regexs = array
( 'TITLE' => '%\<span class="productPage_prodName"\>(.*?)\</span\>%i'
, 'IMAGE' => '%\<a href=".*?"\>\<img src="productImage/BIG/(.*?)" width="225" height="225" alt=".*?" border="0"\></a\>%i'
, 'DESCR' => '%\<span class="productPage_longText"\>(.*?)\</span\>%si'
, 'SKU'   => '%class="productPage_itemid" value="(.*?)"%i'
, 'TTEXT' => '%class="productPage_tableText"\>(.*?)\</span\>%i'
)
;

foreach ($regexs as $name => $regex)
{
    preg_match_all($regex, $htm, $mat);
    $new[$name] = $mat[1];
}

print_r($new);

Open in new window

0
 

Author Closing Comment

by:rlb1
ID: 36549695
Thanks!
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article discusses four methods for overlaying images in a container on a web page
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question