Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

How do I structure this PregMatch script to get all results from the page?

Posted on 2011-09-15
4
Medium Priority
?
336 Views
Last Modified: 2012-05-12
Experts:

How do I correctly structure this script?   Each product has a single title, description, and image, but can have up to 9 variations.  

How do I include the title, description, image with each variation, and have each variation submitted to the database individually?  (I considered doing a loop but there is not a good way to count the number of variations...)

The current script is showing the last variation as the only result.  

See attached screen shot for the layout I am working with.

Thanks for your help!







 $url_3 = $produrl3;   //each product url from DB Table
	curl_setopt($ch, CURLOPT_URL,$url_3); // set url to go fetch
	$buffer = curl_exec($ch); // run the process
	

	
	
						
						
// TITLE						
preg_match('%<span class="productPage_prodName">(.*?)<\/span>%',$buffer,$matches31);

$prodname=$matches31[1];

echo $prodname."<BR>";



// IMAGE
preg_match('%<a href=".*?"><img src="productImage\/BIG\/(.*?)" width="225" height="225" alt=".*?" border="0"><\/a>%',$buffer,$matches32);

$image=$matches32[1];

echo $image."<BR>";




// DESCRIPTION
preg_match('%<span class="productPage_longText">\s*(.*?)\s*<\/span>%s',$buffer,$matches33);

$description=$matches33[1];

echo $description."<BR>";



//  VARIATIONS  START HERE  -  Up To 9 Variations per product.


// SKU OPTIONS
preg_match_all('%<td align="left" valign="top" width="125">\s*<input type="text" name=".*?" style="background:.*?;" readonly="readonly" class="productPage_itemid" value="(.*?)" \/><\/td>%',$buffer,$matches34,PREG_SET_ORDER);
								

foreach ($matches34 as $val34) {

$sku=$val34[1];

echo $sku."<br />";


}



// TITLE
preg_match_all('%<td align="left" valign="top" width="250">\s*<span class="productPage_tableText">(.*?)<\/span><\/td>\s*<\/tr>%',$buffer,$matches35,PREG_SET_ORDER);
								

foreach ($matches35 as $val35) {

$title=$val35[1];

echo $title."<br />";



}




// Weight
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Weight:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_weight">(.*?)lbs\.<\/span><\/td>\s*<\/tr>%',$buffer,$matches36,PREG_SET_ORDER);
								

foreach ($matches36 as $val36) {

$weight=$val36[1];

echo $weight."<br />";



}


// MSRP
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">List<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_listPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches37,PREG_SET_ORDER);
								

foreach ($matches37 as $val37) {

$MSRP=$val37[1];

echo $MSRP."<br />";



}




// COST
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Dealer<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_ourPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches38,PREG_SET_ORDER);
								

foreach ($matches38 as $val38) {

$COST=$val38[1];

echo $COST."<br />";



}

mysql_query("INSERT INTO `products` (prodname, image, sku, title, weight, MSRP, COST, description) VALUES ('".mysql_real_escape_string($prodname)."','".mysql_real_escape_string($image)."','".mysql_real_escape_string($sku)."','".mysql_real_escape_string($title)."','".mysql_real_escape_string($weight)."','".mysql_real_escape_string($MSRP)."','".mysql_real_escape_string($COST)."','".mysql_real_escape_string($description)."')");

Open in new window

screenshot of distributor website
0
Comment
Question by:rlb1
  • 2
  • 2
4 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 36548747
Typically you would not need to count the number of variations -- you can use foreach() to iterate over an indeterminate number of elements in an array.

Here is how to get a good answer to your question.  Show us the input data.  Not a picture or abstraction, the ACTUAL input data you want to use.  Post a link to the web page you're trying to scrape, or copy the HTML and post it here in the code snippet.  Give us the ENTIRE page, not a fragment.

Once we have that we can show you a way to tease out the data elements you want to put into the query.  Sometimes it is easier to use explode() instead of wrestling with the awkward syntax of regex.  But in any case, once we see the HTML in clear text we can show you how to parse it.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 36549605
Well, this is as far as I could get in a short period of time.
http://www.laprbass.com/RAY_temp_rlb1.php

Some comments are in order.

1. Ask the author of the web page if they publish an API.  If they do, use it so you do not have to write this scraping script.
2. As you will see, the data is highly variable and not well formatted.  Simple REGEX will not get what you need -- it's a bigger programming project than you may think!
3. Regular expressions have a set of characters called "metacharacters" that must be escaped unless they are in character classes.  The / slash is not a metacharacter and never needs to be escaped unless it has been used as the regex delimiter. The wickets used in HTML < > are metacharacters.  The regex delimiter, in this case % is by definition a metacharacter.

Sorry I cannot do more, but I really think the API is your best approach.  Otherwise you've got a lot of programming ahead of you.  And you run the risk that all "scraper scripts" run, namely, that the publisher of the web page will make a small change and suddenly your script will fail without warning.  An API is usually a versioned piece of software and it usually defines a formal interface to acquire the data, so unpredictable changes are extremely rare.

best of luck with the project, ~Ray
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>";

// THE PAGE WE WANT TO SCRAPE
$url = 'http://www.tecnec.com/Product.asp?baseItem=BSC100XXJ&cat=CABLES&subcat=AUDIOCAB&prodClass=AXLRXLR&mfg=&search=0&off=';

// READ THE PAGE
$htm = file_get_contents($url);
// echo htmlentities($htm);

// THE REGULAR EXPRESSIONS TO ISOLATE THE THINGS WE WANT
$regexs = array
( 'TITLE' => '%\<span class="productPage_prodName"\>(.*?)\</span\>%i'
, 'IMAGE' => '%\<a href=".*?"\>\<img src="productImage/BIG/(.*?)" width="225" height="225" alt=".*?" border="0"\></a\>%i'
, 'DESCR' => '%\<span class="productPage_longText"\>(.*?)\</span\>%si'
, 'SKU'   => '%class="productPage_itemid" value="(.*?)"%i'
, 'TTEXT' => '%class="productPage_tableText"\>(.*?)\</span\>%i'
)
;

foreach ($regexs as $name => $regex)
{
    preg_match_all($regex, $htm, $mat);
    $new[$name] = $mat[1];
}

print_r($new);

Open in new window

0
 

Author Closing Comment

by:rlb1
ID: 36549695
Thanks!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

3 proven steps to speed up Magento powered sites. The article focus is on optimizing time to first byte (TTFB), full page caching and configuring server for optimal performance.
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question