Link to home
Create AccountLog in
Avatar of rlb1
rlb1

asked on

How do I adjust this preg match statement (use multiple times on same url)

Experts,

I modified the preg match statement on the previous question to use with a URL.  It works with one preg match, but when I attempt to use two preg match statements on one URL, the first result appears twice. (I realize that there is probably a better way but I am just getting into preg match / regex type coding.)

The result of the second preg match should be:

 <div class="active_content blacksm">
 <br />
 &bull; Warranty: Lifetime<br />
 &bull; Color: Black<br />
 &bull; Length: 0.5m<br />
 &bull; Mfr: Cables To Go<br />
 &bull; Weight: 0.590lbs<br />
 </div>


Thanks for your help!!!

<?php
$url = file_get_contents('http://www.cproducts.com/product.asp?cat_id=2030&sku=40294');
preg_match('%<div style="float:left; clear:both;">(.*?)</div>%s',$url,$extA);  // this one works
preg_match('%<div class="active_content blacksm">(.*?)</div>%s',$url,$extB);   // this one repeats the same as the first, however it is different

echo $extA[1];
echo "<BR>";
echo $extB[1];
?>

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of rlb1
rlb1

ASKER

Ray,
Thanks!  Here is the previous code:  (I just took the code you provided me with and adjusted it a little.)

<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);

$strtest = '<a title="some title here" href="some url here">';
preg_match('%<a title=[\"]some title here[\"] href=([^`]*?)[\"]>%',$strtest,$extA);
// 'You should get the result: some url here.
echo $extA[1];
 
Thanks for your help!!
Randy
This REGEX string says this:

1. Find a string starting with href=
2. Followed by a quote OR apostrophe OR any character, and capture this into a group
3. Followed by any number of any characters, and capture this into a group
4. Followed by a quote OR apostrophe OR (escaped) right wicket and capture this into a group
5. End REGEX and case-insensitive.

Please post back if you have any questions.  Best, ~Ray
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;

// TEST DATA
$arr = array
( '<a title="some title here" href="some url here">THING</a>'
, '<a title="some title here" href=urli>'
, "<a title='some title here' href='some url '>"
)
;

// A REGULAR EXPRESSION
$rgx = <<<REGEX
#href=("|'|.)(.+?)("|'|\>)#i
REGEX;

// GET THE HREF= STRING - WHETHER BOUNDED BY QUOTES OR NOT
foreach ($arr as $str)
{
    $new = preg_match($rgx, $str, $ext);

    // THESE ARE EQUAL IF THE URL WAS WRAPPED WITH A QUOTE OR APOSTROPHE DELIMITER
    if ($ext[1] == $ext[3])
    {
        $url = $ext[2];
    }
    else
    {
        $url = $ext[1] . $ext[2];
    }
    var_dump($url);
}

Open in new window

Avatar of rlb1

ASKER

Ray,
I am a little lost.  I have worked on this for a few hours and I cannot figure this out.  Regex is tough!! and I am still getting my hands around arrays.

I am trying to obtain the data within these tags from a URL
 
$url = file_get_contents('http://www.cproducts.com/product.asp?cat_id=2030&sku=40294'


%<div style="float:left; clear:both;">(.*?)</div>%s         (Description)

%<div class="active_content blacksm"><br />(.*?)</div>%s      (Specs)

I am also trying to get only the "40294a.jpg" out of this line.

<img src="/product-images/40294/50/40294a.jpg" style="border:solid 1px #D6D6D6; border-collapse:separate;" />


If you can give me some assistance coding this, I can better figure this out...  Thank You!!!
Get this book and work through the examples.  It will not make you a pro, but it is very readable and has great examples.  It will give you some foundation in PHP and all of your questions will be easier to frame when you post them here at EE.
http://www.sitepoint.com/books/phpmysql4/

I'll take a look at that URL in a moment...
Prints: 40294a.jpg
<?php // RAY_temp_rlb1.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;

// STATED GOAL: I am also trying to get only the "40294a.jpg" out of this line.

// TEST DATA
$tag = <<<TAG
<img src="/product-images/40294/50/40294a.jpg" style="border:solid 1px #D6D6D6; border-collapse:separate;" />
TAG;

// A REGULAR EXPRESSION
$rgx = <<<REGEX
#src=("|'|.)(.+?)("|'| |\>)#i
REGEX;

// GET THE STRING - WHETHER BOUNDED BY QUOTES OR NOT
$new = preg_match($rgx, $tag, $ext);

// THESE ARE EQUAL IF THE URL WAS WRAPPED WITH A QUOTE OR APOSTROPHE DELIMITER
if ($ext[1] == $ext[3])
{
    $url = $ext[2];
}
else
{
    $url = $ext[1] . $ext[2];
}

// ACTIVATE THIS TO SEE THE ISOLATED URL (FILE PATH)
// var_dump($url);

// GET THE FILE NAME FROM THE FILE PATH
$fnm = end(explode('/', $url));
echo $fnm;

Open in new window

When I tried the URL posted above, I got this output:

CREATIVE PRODUCTS

We're sorry, but we were unable to locate the file you requested.

Let's try this instead.  Visit the page you want us to scrape data from.  Use "View source" and copy the HTML.  Post that in the code snippet, and we can work with the posted data.

Regarding this, "Regex is tough!!" -- Yep.  It's a language made up almost entirely from punctuation, and it creates rules that interact in complex ways.  There are entire books about regular expressions.  It easily forms a semester of an engineering curriculum, so don't be surprised if it takes a while to master.  Most software developers never master regular expressions, and many who think they have mastered regex publish expressions that are full of holes and errors.  Example, the regex I posted above that says this:

#href=("|'|.)(.+?)("|'|\>)#i

It is wrong because the third group (terminator) should also contain an "or" condition for the blank, like this:

#href=("|'|.)(.+?)("|'| |\>)#i

That is because a blank would terminate a URL, since a URL would have to be URL-encoded, and the encoding would turn any blanks into plus signs.

I often find that I can use some combination of strpos(), substr(), and explode() to get the strings I want, and I can get those right faster than I can write the regular expressions.  There is no extra credit for using regex.  The reward is working code, gotten as fast and accurately as possible.  Just a thought...
Avatar of rlb1

ASKER

Ray,  
What am I missing here on this array?   I have attempted several things here and cannot get it to work.

Thanks for your help!

Randy


<?php
$url = file_get_contents('http://www.cablestogo.com/product.asp?cat_id=2030&sku=40294');
/*
//preg_match('%<div style="float:left; clear:both;">(.*?)</div>%s','%<div style="float:left; clear:both;">(.*?)</div>%s',$url,$extA);
preg_match('%<div style="float:left; clear:both;">(.*?)</div>%s',$url,$extA);

//echo $extA[0];
echo $extA[1];
echo $extA[2];
*/

preg_match("/(%<div style="float:left; clear:both;">(.*?)</div>%s|%<div style="float:left; clear:both;">(.*?)</div>%s)/i", $line, $url);
echo $line[1];
echo $line[2];


?>

Open in new window

Avatar of rlb1

ASKER

Ray, Got it to work with hours of manipulation!!  Thanks for your help!!
<?php
$data=file_get_contents('http://www.c.com/product.asp?cat_id=2030&sku=40294');
preg_match('%<div style="float:left; clear:both;">
(.*?)</div>%s',$data,$matches);

$a=$matches[1];
echo $a."<br />";


preg_match('%<div class="active_content blacksm">
	             <br />
(.*?)</div>%s',$data,$matches2);

$b=$matches2[1];
echo $b."<br />";

?>

Open in new window

Avatar of rlb1

ASKER

Thank you Ray!!
Congratulations!