• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 397
  • Last Modified:

PHP Data Extracting

Hi, Having some trouble extracting data from websites, mainly just need titles of movies for now from Google.  I have tried the following code. I know I could get the titles maybe elsewhere but my client insist this is what they want.  Also using cURL to do the page extract. Thanks!

$content = {grabbing from cURL};

//-----GET TEXT BETWEEN STRINGS------
function TextBetween($s1,$s2,$s){
  $s1 = strtolower($s1);
  $s2 = strtolower($s2);
  $L1 = strlen($s1);
  $scheck = strtolower($s);
  if($L1>0){$pos1 = strpos($scheck,$s1);} else {$pos1=0;}
  if($pos1 !== false){
    if($s2 == '') return substr($s,$pos1+$L1);
    $pos2 = strpos(substr($scheck,$pos1+$L1),$s2);
    if($pos2!==false) return substr($s,$pos1+$L1,$pos2);
  }
  return '';
}
//echo TextBetween('<td','</td>',$header_main);
echo TextBetween("<td colspan=6><a href=","</td>",$content);  

Here is what is inside the page that I need to grab the title from:

<td colspan=4><a href="/movies?hl=en&near=21225&sort=1&mid=b54727ef077c2117"><b>Michael Clayton</b></a>
0
vfetty
Asked:
vfetty
  • 9
  • 6
1 Solution
 
b0lsc0ttCommented:
What is it you actually want from the string?  Just the title?  Is there other stuff?  Is the html always the same?

I suggest using a regular expression and something like preg_match() to do this.  Let me know details of what you need and other types of "text" you may get.  I can help you get the expression to extract the info you want.

bol
0
 
vfettyAuthor Commented:
hey b0lsc0tt, I think the goal is to extract movie title, theater and times.  The html I believe changes a bit like the href tag changes a bit but I believe the rest stays the same.  Help is so appreciated after multiple hours of no success.
0
 
b0lsc0ttCommented:
Where is theater and times in content?  The movie title is obvious (I think) but I don't see the other 2 things.

To get the title, based on the text at the end of the question body, you use ...

preg_match_all('%<a[^>]+><b>([^<]+)</b></a>%', $content, $result, PREG_PATTERN_ORDER);
$title = $result[1];

That will get as many matches as possible in $content.  The next line puts the movie title part in a variable called $title (it is an array).

Let me know if you have a question.  You can modify the expression to get other info too and create other variables/arrays like on the last line.

bol
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
vfettyAuthor Commented:
Thanks. Only thing is I get is "even more ยป" ? The link I am using is http://www.google.com/movies?sc=1&sort=1&near=21225&rl=1
0
 
b0lsc0ttCommented:
Thanks for the link.  "Even more" was from the top of the Google page.  The others were there too but I modified the expression to just get the movie titles.

preg_match_all('%<a[^>]+&mid=[^>]+><b>([^<]+)</b></a>%', $content, $result, PREG_PATTERN_ORDER);
$title = $result[1];

Let me know how that works or if you have a question.  If you want help with other stuff on the page then let me know.

bol
0
 
b0lsc0ttCommented:
Remember, the variable $title is an array.  To see all the items that matched the expression and were "captured" in the group you will need to loop through the items in the array.

bol
0
 
vfettyAuthor Commented:
b0lsc0tt:, sorry for the long delay in reply.  I tried out the code you gave and it really works well.  I have been trying to get the theater name and times, plus the movies href tag but things are coming back not even close, starting to think I got some goofy expressions going on:)  How could I handle getting those items. Here is the html from what I want.

<td colspan="6"><a href="/movies?near=21225&amp;tid=9c26a79162540f94"><b>IMAX at the Maryland Science Center</b></a><br><font size="-1">601&nbsp;Light&nbsp;Street,&nbsp;Baltimore,&nbsp;MD,&nbsp;USA - (410)&nbsp;685-5225 - <a href="http://maps.google.com/maps?q=601+Light+Street+Baltimore+MD+21230+US+%28IMAX+at+the+Maryland+Science+Center%29&oi=moviesp" class="fl">Map</a></font></td>

The page it came from: http://www.google.com/movies?near=21225

Anything you can do to show me the examples is really appreciate it. Just a couple more examples and I know I can get it, spending lots of time trying to get it:) Thanks!
0
 
b0lsc0ttCommented:
I can help with that part too.

I did come up with a single expression to match everything but the problem is getting the info you want.  The capturing group is only good for 1 and, with things like times when looking at a theater, there will be multiple matches.  I don't see that PHP's regex engine provides a way to do this with just 1 expression.

To get everything you want and keep it organized I have come up with 3 steps.  Please change the variable names or modify the little PHP I provide below if you want.  I am only providing the expressions and that part of the steps for now.  I can help with the PHP script (e.g. loops, etc) if needed but you seem to have that down.

1.  The first expression below will get each movie.  The 1 capturing group will grab the movie title and that goes in the variable $title, which is an array.  The other variable, $arMovie, is used for the next step.

preg_match_all('%<a[^>]+&mid=[^>]+><b>([^<]+)</b></a>.*?(?=(?:<a[^>]+&mid=|\z))%', $subject, $result, PREG_PATTERN_ORDER);
$arMovie = $result[0];
$title = $result[1];

2.  You will need to loop each item in $arMovie.  This will use 2 groups.  The first will be the theater ($theaters) and the other is a string that contains the times for that theater ($arTimes).  Both are arrays of course but the second ($arTimes) is used in the 3rd step.

preg_match_all('%<a[^>]+&tid=[^>]+><b>([^<]+)</b></a>.*?Map</a><br>(.*?)</font>(?:.*?)(?=(?:<a[^>]+&tid=|\z))%', $arMovie, $resultMovie, PREG_PATTERN_ORDER);
$theaters = $resultMovie[1];
$arTimes = $resultMovie[2];

3.  You will need to loop through the $arTimes items to get each time.  1 group is used and the results are an array named $times.

preg_match_all('%((?:1[0-2]|[1-9]):[0-6][0-9])(?:</a>)?(?:&nbsp;|\z)%', $arTimes, $resultTimes, PREG_PATTERN_ORDER);
$times = $resultTimes[1];

Keep in mind that the "subject" (the second argument in preg_match_all) may need to be changed to work in the loop you choose to use.  Also you can modify how this works to get the results in a different way.  Let me know if you want the "big" expression but I don't know that it will be very useful to you.

Let me know if you have a question or need help using this.  If you do need help then let me know how you want the info (if that is what you need help with).  Let me know how this works.

bol
0
 
vfettyAuthor Commented:
Wow, cool!  So I have been working away at what you came up for me.  I get the movie titles but nothing else is showing up. Not sure if I am doing something wrong.  Seems the movies get pulled in just fine but the other arrays come back empty? I can see where arMovie returns the chunks of movie info, just not sure how to extract it with the expressions.
0
 
b0lsc0ttCommented:
Thanks. :D

Did you put those lines and expressions in with PHP code to use them?  On their own that isn't complete as a script.  The expressions are good but they need to be part of a script to "process" the info.  If you are still having a problem then post the PHP script you are using.

bol
0
 
vfettyAuthor Commented:
Seems I am almost there just don't understand how to get movie theaters and associations with titles and times to connect in.

//1.  The first expression below will get each movie.  The 1 capturing group will grab the movie title and that goes in the variable $title, which is an array.  The other variable, $arMovie, is used for the next step.

preg_match_all('%<a[^>]+&mid=[^>]+><b>([^<]+)</b></a>.*?(?=(?:<a[^>]+&mid=|\z))%', $content, $result, PREG_PATTERN_ORDER);
$arMovie = $result[0];
$title = $result[1];
foreach($title as $key => $val) {
echo $val."<br>";
}
foreach($arMovie as $key => $val1) {
//echo $val."<br />";
 
//2.  You will need to loop each item in $arMovie.  This will use 2 groups.  The first will be the theater ($theaters) and the other is a string that contains the times for that theater ($arTimes).  Both are arrays of course but the second ($arTimes) is used in the 3rd step.
preg_match_all('%<a[^>]+&tid=[^>]+><b>([^<]+)</b></a>.*?Map</a><br>(.*?)</font>(?:.*?)(?=(?:<a[^>]+&tid=|\z))%', $val1, $resultMovie, PREG_PATTERN_ORDER);
$theaters = $resultMovie[1];
$arTimes = $resultMovie[2];
echo "<br />".$theaters[0]." | " . $arTimes[0] . "<br />";
}
//foreach($ as $key => $val) {
//echo $val."<br />";
//}
// 3.  You will need to loop through the $arTimes items to get each time.  1 group is used and the results are an array named $times.

foreach($arTimes as $key => $val2) {
preg_match_all('%((?:1[0-2]|[1-9]):[0-6][0-9])(?:</a>)?(?:&nbsp;|\z)%', $val2, $resultTimes, PREG_PATTERN_ORDER);
$times = $resultTimes[1];
echo $times[0];
}
0
 
b0lsc0ttCommented:
The steps needed to be embedded.  Step 2 had to run in step 1 and step 3 had to run in step 2.  I know that may not make sense but the code will hopefully help.

// step 1
preg_match_all('%<a[^>]+&mid=[^>]+><b>([^<]+)</b></a>.*?(?=(?:<a[^>]+&mid=|\z))%', $content, $result, PREG_PATTERN_ORDER);
$arMovie = $result[0];
$title = $result[1];

for ($i=0; $i<count($title); $i++) {
      echo "<h5>$title[$i]</h5>";
      // step 2
      preg_match_all('%<a[^>]+&tid=[^>]+><b>([^<]+)</b></a>.*?Map</a><br>(.*?)</font>(?:.*?)(?=(?:<a[^>]+&tid=|\z))%', $arMovie[$i], $resultMovie, PREG_PATTERN_ORDER);
      $theaters = $resultMovie[1];
      $arTimes = $resultMovie[2];
      for ($j=0; $j<count($theaters); $j++) {
            echo "<div>";
            echo "$theaters[$j] - ";
            
            //step 3
            preg_match_all('%((?:1[0-2]|[1-9]):[0-6][0-9])(?:am|pm)?(?:</a>)?(?:&nbsp;|\z)%', $arTimes[$j], $resultTimes, PREG_PATTERN_ORDER);
            $times = $resultTimes[1];
            echo implode(",  ", $times);
            // end step 3
            echo "</div>";
            // end step 2
      }
      // end step 1
}

You can change the way the info is displayed.  I actually chose to use implode to take the times from the array in a string in just one step.  However you could do a for loop like the other steps if you need.

Let me know if you have a question.

bol
0
 
b0lsc0ttCommented:
By the way, in case you only copy parts, I had to modify the expression in step3.  There was one time (the last) that had "pm" so the expression is different.

bol
0
 
vfettyAuthor Commented:
Hey bol, Did i mention you made my life mega easier:)  I still have a ways to go on the project but you really got me going in the right direction. Thanks for all your hard work.  Are you ever for hire on projects?
0
 
b0lsc0ttCommented:
:D  Thanks!  It has been a real fun question.  You sure did get a bargain for 500 points. ;)

Good luck with the rest of the project.  Feel free to post a comment here with a new question's URL if you post one and want to make sure I see it.  The moderators can contact me too if this question is locked and you are "stuck" in a new one.

I do some contract/consulting work.  I don't currently have contact info in my member profile but have thought about putting something there.  My schedule does limit the jobs I can do and take but I am always looking for a good challenge/project.  Feel free to contact me if I put contact info in my profile.  Let me know if you don't know what I mean by the profile.  Thanks a lot for the interest.

Thanks for the grade, the points and the fun question.  I'm glad I could help.

bol
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 9
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now