Solved

PHP Crawler Sourceforge

Posted on 2010-09-09
4
1,006 Views
Last Modified: 2012-06-21
Hello,

I am attempting to implement an open source PHP application available from source Forge (http://sourceforge.net/projects/php-crawler).

Unfortunately there is not much documentation however I have managed to get the crawler working to a certain extent-I can target a web page and return all of the content and output to a new html file.

However the MYSQL database table used by the crawler is empty at the end of processing.  I do not need the whole HTML file-I just want one <DIV> section. I thought I could do this using the database contents however I am open to other suggestions.

I have tried using a string function but this returns a blank file so either it is incorrectly written or I cannot use the string function on the HTML file.

I have attached a copy of pro.php which successfully returns the whole page and pro1.php (returns nothing).  I have also attached a copy of index2.php which calls pro.php and pro1.php with crawl address.

The difference between files (where I have attempted to strip out the div) is displayed below-
$data = $usendid;
$string = between('<div id=section c1>', '</div>', $data);

function between($start, $end, $source) {
        $s = strpos($source, $start) + strlen($start);
        return substr($source, $s, strpos($source, $end, $s) - $s);

Can anyone advise me either how I can fix my string function to pull out the required section or how I could use the database table to complete the same thing?  does anyone have any more detailed documentation for PHP crawler?

Thanks
index2.php
pro.php
pro1.php
0
Comment
Question by:javaftper
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 4

Accepted Solution

by:
mpickreign earned 250 total points
ID: 33637022
I would use a preg_match instead of the between function. This should do it...

preg_match('/<div id=section c1>(.+)<\/div>/',$data,$matched_content);
$matched_content[1]     <--- This will be just the content between the div tags.
0
 
LVL 4

Author Comment

by:javaftper
ID: 33648005
thanks.  preg_match works better however i'm having to take and re-create the whole file rather than creating the div in one pass possibly using the MYSQL DB.
Anyone got any docs for php crawler?
0
 
LVL 1

Assisted Solution

by:ahmad_alinat
ahmad_alinat earned 250 total points
ID: 33657465
Instead of using this open source project

you can use the php dom document + xpath interface

suppose you want to get DIV where id = sectionc1

first of all, you must load the page html with DOMDocument

$dom = new DOMDocument();
@$dom->loadHTMLFile(html url);

then use dom xpath to extract the DIV

$xpath = new DOMXPath($dom);
$results  = $xpath->query("//div[@id='sectionc1']");

simple and easy!
0
 
LVL 4

Author Closing Comment

by:javaftper
ID: 33685246
both comments very helpful.
0

Featured Post

Back Up Your Microsoft Windows Server®

Back up all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
This article discusses how to implement server side field validation and display customized error messages to the client.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question