We help IT Professionals succeed at work.

preg_match_all problem in PHP

GVNPublic123
GVNPublic123 asked
on
Hello,

What I want to do is get some usernames from a specific page. Here is the sytax of my $content

<a href="/user/DaveBlender" onmousedown="yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')" title="DaveBlender">
.......code in between.....
<a href="/user/electricdreamdesigns" onmousedown="yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - electricdreamdesigns')" title="electricdreamdesigns">
.....etc

Open in new window


So what I would like to do is fetch usernames themseves eg DaveBlender, electricdreamdesigns and store them in array.

Does anyone know what pattern should I use for this preg_match_all

Thanks a ton!
Comment
Watch Question

Commented:
You should be able to you negative and positive assertions.  Something like this:
// I can't quite remember how... sorry, I'll look it up but try this it might work
$needle = "/title=\"()\"";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

Open in new window

Author

Commented:
I think that won't do it, there are too many title= in code, I need 100% correct results.

Commented:
I don't understand.  Do you only want to get DaveBlender and electricdreamdesigns?  If so, we can get these only by doing this:
// I can't quite remember how... sorry, I'll look it up but try this it might work
$needle = "/title=(\"DaveBlender\"|\"electricdreamdesigns\")";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

Open in new window

Author

Commented:
No, I only want to get data out of links of those types.

there are like 100 links on website, I only want those containing usernames.
So some more complex pattern is required. See source:
view-source:http://www.youtube.com/profile?user=15Gigs&view=friends

Commented:
I'm not the best with regular expressions, but something like this might work...  When you say, "get data out of links," what data do you mean?  If you mean the URL, try this:

By the way, YouTube makes the URL the same as the contents of the title tag.  So, the second comment would append the appropriate folder to the user name - making it an acceptable URL.
// here's how you might extract the URL via regex
$needle = "/<a href=(\"\w+\")/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

// otherwise, here's how to make the URL
echo "/user/" . $matches[0]; // should echo something like /user/DaveBlender

Open in new window

Author

Commented:
WhatI tried to say is, there are many <a href > tags on the page, and some of them are not related to users. That will give false results which will than junk my app.

What I need is make sure that only links containing users are processed, so maybe some kind of regex pattern that is unique and captures only this kind of link....

Commented:
There are two ways to do this.  One is to use a more complex regex which is tried below.  The other is to use two simple regex, also tried below.

The first regex uses a positve lookahead assertion.
// more complex regex
$needle = "/<a href=(\"\w+\") onmousedown=(".*") (?=title=("\w+\"))/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

// using two simple regex
$needle = "/<a href=(\"\w+\")/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

foreach ($matches as $match) {

    if ($preg_match("/title=(\"\w+\")/", $match) {

        $array[] = $match;

    }

}

var_dump($array);

Open in new window

Commented:
Fixed a syntax error.  Let me know how it works, thanks.
// more complex regex
$needle = "/<a href=(\"\w+\") onmousedown=(".*") (?=title=("\w+\"))/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

// using two simple regex
$needle = "/<a href=(\"\w+\")/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

foreach ($matches as $match) {

    if (preg_match("/title=(\"\w+\")/", $match) {

        $array[] = $match;

    }

}

var_dump($array);

Open in new window

Author

Commented:
I get null results for complex regex, and error "preg_match() expects parameter 2 to be string, array given" for simple regex.

Commented:
Sorry, there were some more syntax errors.  Try this, please.
// more complex regex
$needle = "/<a href=(\"\w+") onmousedown=(".*") (?=title=("\w+"))/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

// using two simple regex
$needle = "/<a href=(\"\w+\")/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

foreach ($matches as $match) {

    if (preg_match("/title=(\"\w+\")/", $match)) {

        $array[] = $match;

    }

}

var_dump($array);

Open in new window

Commented:
I looked it over, again, and I think this code is better.  Let me know what happens.
// more complex regex
$needle = "/<a href=\"(\w+)\" onmousedown=(".*") (?=title="(\w+)\")/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

var_dump($matches);

// using two simple regex
$needle = "/<a href=(\"\w+\")/";

$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

preg_match_all($needle, $haystack, $matches);

foreach ($matches as $match) {

    if (preg_match("/title=(\"\w+\")/", $match)) {

        $array[] = $match;

    }

}

var_dump($array);

Open in new window

Author

Commented:
syntax error on $needle in complex regex

preg_match() expects parameter 2 to be string, array given in error in simple regex


Nothing fixed from previous code.

Commented:
Sorry for the confusion.  This regex should work better and I added the positive lookahead:

<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?= onmousedown=\".*\" title=\".*\")

The regex itself works.  I tested it online.  We just have to make it work with PHP.  Try the code below, please.
// make regex to capture URL when a title exists
$needle = "<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?= onmousedown=\".*\" title=\".*\")";

// make haystack
$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

// run regex
preg_match_all($needle, $haystack, $matches);

// output matches
print_r($matches);

Open in new window

Commented:
Further updated regex to support more or less whitespace:
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?=\s*onmousedown=\".*\"\s*title=\".*\")

Commented:
Don't mean to flood you with options, but the following regex will only look for the title tag, it won't require that there be an on mouse over:
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?=\s*.*\s*title=\".*\")
// make regex to capture URL when a title exists
$needle = "<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?=\s*.*\s*title=\".*\")";

// make haystack
$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

// run regex
preg_match_all($needle, $haystack, $matches);

// output matches
print_r($matches);

Open in new window

Author

Commented:
Code:
$pattern = "<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?= onmousedown=\".*\" title=\".*\")";                  
preg_match_all($pattern, $content, $matches);  

Output:
preg_match_all() [function.preg-match-all]: Unknown modifier ']' in /home/.../search.php on line 49

Line 49 is second line in code.
Commented:
Good time to point out, perl-compatible regex require a "/" delimiter at the beginning and end.  So, try this.
// make regex to capture URL when a title exists
$needle = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?=\s*.*\s*title=\".*\")/";

// make haystack
$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

// run regex
preg_match_all($needle, $haystack, $matches);

// output matches
print_r($matches);

Open in new window

Author

Commented:
works :)

Commented:
Thanks for the points.  I'm glad it's working, now.

Author

Commented:
EMB, I need help. YT just changed page code, now links are like:

<a href="/user/jmeartists" onmousedown="yt.analytics.trackEvent('ChannelPage', 'subscriptions_image_link', '15Gigs - jmeartists')">

See source of:
view-source:http://www.youtube.com/profile?user=15Gigs&view=friends&start=0


I need new regex pattern, please help.

Author

Commented:
to help, I have channel name (15Gigs) from '15Gigs - jmeartists' stored in variable $target
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
I'd recommend you post a new question if you need a new solution

Commented:
I'd love to help.  If you haven't opened a new question already, let me know if this works.  Thanks.

Author

Commented:
I have opened, see it there

Commented:
I don't think it got posted...  I'll try again.

// make regex to capture URL when a title exists
$needle = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?=\s*onmousedown=\"yt.analytics.trackEvent('ChannelPage')/";
// make regex to capture URL when a title exists
$needle = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >](?=\s*onmousedown=\"yt.analytics.trackEvent('ChannelPage')/";

// make haystack
$haystack = "<a href=\"/user/DaveBlender\" onmousedown=\"yt.analytics.trackEvent('ChannelPage', 'subscriptions_text_link', '15Gigs - DaveBlender')\" title=\"DaveBlender\">";

// run regex
preg_match_all($needle, $haystack, $matches);

// output matches
print_r($matches);

Open in new window