Solved

Creating site crawler

Posted on 2004-03-21
13
600 Views
Last Modified: 2008-01-16
Hey guys,

I'm trying to add a search feature to my site, and I've been doing some looking around. Seems to me, that the best way to do this is to do a crawler which will update a MySQL table every now and then for this purpose. There is an ASP version of it, but I have yet to find a PHP version, so if anybody could help me build or get one in PHP, I would be great.

http://www.webwizguide.com/asp/sample_scripts/site_search_script.asp
0
Comment
Question by:drakkarnoir
  • 5
  • 3
  • 3
  • +1
13 Comments
 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646905
0
 

Author Comment

by:drakkarnoir
ID: 10646931
I don't think you understood, I want a crawler, that will store my pages contents into a independent table, which will then be searched from. I know how to do it direct if the content is in MySQL, but what if it's the content of the generated pages I want stored.
0
 

Author Comment

by:drakkarnoir
ID: 10646932
Like Google or other web crawlers.
0
 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646936
0
 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646942

Sorry, disregard my earlier post
0
 
LVL 13

Expert Comment

by:lozloz
ID: 10648537
maybe have a look at phpdig: www.phpdig.net

cheers,

loz
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Comment

by:drakkarnoir
ID: 10648852
I guess the best solution is that I code this myself, can anybody get me started on a PHP script that will open each one of my pages with dynamic ID's and have it index and store the text into a variable?
0
 
LVL 10

Expert Comment

by:frugle
ID: 10648937
How are your pages created? why don't you store the actual pages in the database and create freetext indexes on them?
displaying the page from the database would be quicker to produce and wouldn't have any non-indexed time between publishing and spidering.

For what it's worth, spiders are probably better written in Perl (ducks and runs for cover)

Mike
0
 
LVL 13

Expert Comment

by:lozloz
ID: 10648953
why don't you have a look at the source for phpdig if you want to get started - have a look at the features list to see what you can learn from it:

http://www.phpdig.net/navigation.php?action=doc#toc3

cheers,

loz
0
 

Author Comment

by:drakkarnoir
ID: 10649649
You better run frugle!! Tehehe

My pages are being generated from MySQL db, but I want to do something like the following:

$array = array("my product id's");
foreach ($array as $key)
fopen("http://www.products.com/index.php?product_id=$key");
fread (?) all the HTML
Get rid of HTML tags
Store only the plain text into a table.
When user searches, well I can do this part.
0
 
LVL 10

Expert Comment

by:frugle
ID: 10649917
strip_tags() function will get rid of the HTML

http://uk.php.net/manual/en/function.strip-tags.php

Mike
0
 
LVL 10

Accepted Solution

by:
frugle earned 500 total points
ID: 10649959
in fact, why use fread?

have you tried...

$basic = array();

foreach ($array as $key){

      $url = "http://www.products.com/index.php?product_id=".$key;

      $basic[] = strip_tags(implode("",file($url)));

}

# should return an array of basic (text only) content.

Mike
0
 

Author Comment

by:drakkarnoir
ID: 10674682
Thanks, worked great.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now