Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

How do I structure this PregMatch script to get all results from the page?

Posted on 2011-09-15
4
Medium Priority
?
328 Views
Last Modified: 2012-05-12
Experts:

How do I correctly structure this script?   Each product has a single title, description, and image, but can have up to 9 variations.  

How do I include the title, description, image with each variation, and have each variation submitted to the database individually?  (I considered doing a loop but there is not a good way to count the number of variations...)

The current script is showing the last variation as the only result.  

See attached screen shot for the layout I am working with.

Thanks for your help!







 $url_3 = $produrl3;   //each product url from DB Table
	curl_setopt($ch, CURLOPT_URL,$url_3); // set url to go fetch
	$buffer = curl_exec($ch); // run the process
	

	
	
						
						
// TITLE						
preg_match('%<span class="productPage_prodName">(.*?)<\/span>%',$buffer,$matches31);

$prodname=$matches31[1];

echo $prodname."<BR>";



// IMAGE
preg_match('%<a href=".*?"><img src="productImage\/BIG\/(.*?)" width="225" height="225" alt=".*?" border="0"><\/a>%',$buffer,$matches32);

$image=$matches32[1];

echo $image."<BR>";




// DESCRIPTION
preg_match('%<span class="productPage_longText">\s*(.*?)\s*<\/span>%s',$buffer,$matches33);

$description=$matches33[1];

echo $description."<BR>";



//  VARIATIONS  START HERE  -  Up To 9 Variations per product.


// SKU OPTIONS
preg_match_all('%<td align="left" valign="top" width="125">\s*<input type="text" name=".*?" style="background:.*?;" readonly="readonly" class="productPage_itemid" value="(.*?)" \/><\/td>%',$buffer,$matches34,PREG_SET_ORDER);
								

foreach ($matches34 as $val34) {

$sku=$val34[1];

echo $sku."<br />";


}



// TITLE
preg_match_all('%<td align="left" valign="top" width="250">\s*<span class="productPage_tableText">(.*?)<\/span><\/td>\s*<\/tr>%',$buffer,$matches35,PREG_SET_ORDER);
								

foreach ($matches35 as $val35) {

$title=$val35[1];

echo $title."<br />";



}




// Weight
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Weight:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_weight">(.*?)lbs\.<\/span><\/td>\s*<\/tr>%',$buffer,$matches36,PREG_SET_ORDER);
								

foreach ($matches36 as $val36) {

$weight=$val36[1];

echo $weight."<br />";



}


// MSRP
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">List<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_listPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches37,PREG_SET_ORDER);
								

foreach ($matches37 as $val37) {

$MSRP=$val37[1];

echo $MSRP."<br />";



}




// COST
preg_match_all('%<tr>\s*<td align="center">\s*<span class="productPage_tableText">Dealer<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_tableText">Price:<\/span><\/td>\s*<\/tr>\s*<tr>\s*<td align="center">\s*<span class="productPage_ourPrice">\$(.*?)<\/span><\/td>\s*<\/tr>%s',$buffer,$matches38,PREG_SET_ORDER);
								

foreach ($matches38 as $val38) {

$COST=$val38[1];

echo $COST."<br />";



}

mysql_query("INSERT INTO `products` (prodname, image, sku, title, weight, MSRP, COST, description) VALUES ('".mysql_real_escape_string($prodname)."','".mysql_real_escape_string($image)."','".mysql_real_escape_string($sku)."','".mysql_real_escape_string($title)."','".mysql_real_escape_string($weight)."','".mysql_real_escape_string($MSRP)."','".mysql_real_escape_string($COST)."','".mysql_real_escape_string($description)."')");

Open in new window

screenshot of distributor website
0
Comment
Question by:rlb1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 36548747
Typically you would not need to count the number of variations -- you can use foreach() to iterate over an indeterminate number of elements in an array.

Here is how to get a good answer to your question.  Show us the input data.  Not a picture or abstraction, the ACTUAL input data you want to use.  Post a link to the web page you're trying to scrape, or copy the HTML and post it here in the code snippet.  Give us the ENTIRE page, not a fragment.

Once we have that we can show you a way to tease out the data elements you want to put into the query.  Sometimes it is easier to use explode() instead of wrestling with the awkward syntax of regex.  But in any case, once we see the HTML in clear text we can show you how to parse it.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 36549605
Well, this is as far as I could get in a short period of time.
http://www.laprbass.com/RAY_temp_rlb1.php

Some comments are in order.

1. Ask the author of the web page if they publish an API.  If they do, use it so you do not have to write this scraping script.
2. As you will see, the data is highly variable and not well formatted.  Simple REGEX will not get what you need -- it's a bigger programming project than you may think!
3. Regular expressions have a set of characters called "metacharacters" that must be escaped unless they are in character classes.  The / slash is not a metacharacter and never needs to be escaped unless it has been used as the regex delimiter. The wickets used in HTML < > are metacharacters.  The regex delimiter, in this case % is by definition a metacharacter.

Sorry I cannot do more, but I really think the API is your best approach.  Otherwise you've got a lot of programming ahead of you.  And you run the risk that all "scraper scripts" run, namely, that the publisher of the web page will make a small change and suddenly your script will fail without warning.  An API is usually a versioned piece of software and it usually defines a formal interface to acquire the data, so unpredictable changes are extremely rare.

best of luck with the project, ~Ray
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>";

// THE PAGE WE WANT TO SCRAPE
$url = 'http://www.tecnec.com/Product.asp?baseItem=BSC100XXJ&cat=CABLES&subcat=AUDIOCAB&prodClass=AXLRXLR&mfg=&search=0&off=';

// READ THE PAGE
$htm = file_get_contents($url);
// echo htmlentities($htm);

// THE REGULAR EXPRESSIONS TO ISOLATE THE THINGS WE WANT
$regexs = array
( 'TITLE' => '%\<span class="productPage_prodName"\>(.*?)\</span\>%i'
, 'IMAGE' => '%\<a href=".*?"\>\<img src="productImage/BIG/(.*?)" width="225" height="225" alt=".*?" border="0"\></a\>%i'
, 'DESCR' => '%\<span class="productPage_longText"\>(.*?)\</span\>%si'
, 'SKU'   => '%class="productPage_itemid" value="(.*?)"%i'
, 'TTEXT' => '%class="productPage_tableText"\>(.*?)\</span\>%i'
)
;

foreach ($regexs as $name => $regex)
{
    preg_match_all($regex, $htm, $mat);
    $new[$name] = $mat[1];
}

print_r($new);

Open in new window

0
 

Author Closing Comment

by:rlb1
ID: 36549695
Thanks!
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

688 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question