[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Creating site crawler

Posted on 2004-03-21
13
Medium Priority
?
611 Views
Last Modified: 2008-01-16
Hey guys,

I'm trying to add a search feature to my site, and I've been doing some looking around. Seems to me, that the best way to do this is to do a crawler which will update a MySQL table every now and then for this purpose. There is an ASP version of it, but I have yet to find a PHP version, so if anybody could help me build or get one in PHP, I would be great.

http://www.webwizguide.com/asp/sample_scripts/site_search_script.asp
0
Comment
Question by:drakkarnoir
  • 5
  • 3
  • 3
  • +1
13 Comments
 

Author Comment

by:drakkarnoir
ID: 10646931
I don't think you understood, I want a crawler, that will store my pages contents into a independent table, which will then be searched from. I know how to do it direct if the content is in MySQL, but what if it's the content of the generated pages I want stored.
0
 

Author Comment

by:drakkarnoir
ID: 10646932
Like Google or other web crawlers.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646942

Sorry, disregard my earlier post
0
 
LVL 13

Expert Comment

by:lozloz
ID: 10648537
maybe have a look at phpdig: www.phpdig.net

cheers,

loz
0
 

Author Comment

by:drakkarnoir
ID: 10648852
I guess the best solution is that I code this myself, can anybody get me started on a PHP script that will open each one of my pages with dynamic ID's and have it index and store the text into a variable?
0
 
LVL 10

Expert Comment

by:frugle
ID: 10648937
How are your pages created? why don't you store the actual pages in the database and create freetext indexes on them?
displaying the page from the database would be quicker to produce and wouldn't have any non-indexed time between publishing and spidering.

For what it's worth, spiders are probably better written in Perl (ducks and runs for cover)

Mike
0
 
LVL 13

Expert Comment

by:lozloz
ID: 10648953
why don't you have a look at the source for phpdig if you want to get started - have a look at the features list to see what you can learn from it:

http://www.phpdig.net/navigation.php?action=doc#toc3

cheers,

loz
0
 

Author Comment

by:drakkarnoir
ID: 10649649
You better run frugle!! Tehehe

My pages are being generated from MySQL db, but I want to do something like the following:

$array = array("my product id's");
foreach ($array as $key)
fopen("http://www.products.com/index.php?product_id=$key");
fread (?) all the HTML
Get rid of HTML tags
Store only the plain text into a table.
When user searches, well I can do this part.
0
 
LVL 10

Expert Comment

by:frugle
ID: 10649917
strip_tags() function will get rid of the HTML

http://uk.php.net/manual/en/function.strip-tags.php

Mike
0
 
LVL 10

Accepted Solution

by:
frugle earned 2000 total points
ID: 10649959
in fact, why use fread?

have you tried...

$basic = array();

foreach ($array as $key){

      $url = "http://www.products.com/index.php?product_id=".$key;

      $basic[] = strip_tags(implode("",file($url)));

}

# should return an array of basic (text only) content.

Mike
0
 

Author Comment

by:drakkarnoir
ID: 10674682
Thanks, worked great.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to dynamically set the form action using jQuery.
Suggested Courses
Course of the Month20 days, 13 hours left to enroll

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question