Solved

Creating site crawler

Posted on 2004-03-21
13
601 Views
Last Modified: 2008-01-16
Hey guys,

I'm trying to add a search feature to my site, and I've been doing some looking around. Seems to me, that the best way to do this is to do a crawler which will update a MySQL table every now and then for this purpose. There is an ASP version of it, but I have yet to find a PHP version, so if anybody could help me build or get one in PHP, I would be great.

http://www.webwizguide.com/asp/sample_scripts/site_search_script.asp
0
Comment
Question by:drakkarnoir
  • 5
  • 3
  • 3
  • +1
13 Comments
 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646905
0
 

Author Comment

by:drakkarnoir
ID: 10646931
I don't think you understood, I want a crawler, that will store my pages contents into a independent table, which will then be searched from. I know how to do it direct if the content is in MySQL, but what if it's the content of the generated pages I want stored.
0
 

Author Comment

by:drakkarnoir
ID: 10646932
Like Google or other web crawlers.
0
 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646936
0
 
LVL 12

Expert Comment

by:venkateshwarr
ID: 10646942

Sorry, disregard my earlier post
0
 
LVL 13

Expert Comment

by:lozloz
ID: 10648537
maybe have a look at phpdig: www.phpdig.net

cheers,

loz
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:drakkarnoir
ID: 10648852
I guess the best solution is that I code this myself, can anybody get me started on a PHP script that will open each one of my pages with dynamic ID's and have it index and store the text into a variable?
0
 
LVL 10

Expert Comment

by:frugle
ID: 10648937
How are your pages created? why don't you store the actual pages in the database and create freetext indexes on them?
displaying the page from the database would be quicker to produce and wouldn't have any non-indexed time between publishing and spidering.

For what it's worth, spiders are probably better written in Perl (ducks and runs for cover)

Mike
0
 
LVL 13

Expert Comment

by:lozloz
ID: 10648953
why don't you have a look at the source for phpdig if you want to get started - have a look at the features list to see what you can learn from it:

http://www.phpdig.net/navigation.php?action=doc#toc3

cheers,

loz
0
 

Author Comment

by:drakkarnoir
ID: 10649649
You better run frugle!! Tehehe

My pages are being generated from MySQL db, but I want to do something like the following:

$array = array("my product id's");
foreach ($array as $key)
fopen("http://www.products.com/index.php?product_id=$key");
fread (?) all the HTML
Get rid of HTML tags
Store only the plain text into a table.
When user searches, well I can do this part.
0
 
LVL 10

Expert Comment

by:frugle
ID: 10649917
strip_tags() function will get rid of the HTML

http://uk.php.net/manual/en/function.strip-tags.php

Mike
0
 
LVL 10

Accepted Solution

by:
frugle earned 500 total points
ID: 10649959
in fact, why use fread?

have you tried...

$basic = array();

foreach ($array as $key){

      $url = "http://www.products.com/index.php?product_id=".$key;

      $basic[] = strip_tags(implode("",file($url)));

}

# should return an array of basic (text only) content.

Mike
0
 

Author Comment

by:drakkarnoir
ID: 10674682
Thanks, worked great.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Echo vs ?><?php  html code 4 46
simplest php form 3 63
<? versus <?php 5 37
mysql update statement effect only some rows 4 27
Build an array called $myWeek which will hold the array elements Today, Yesterday and then builds up the rest of the week by the name of the day going back 1 week.   (CODE) (CODE) Then you just need to pass your date to the function. If i…
These days socially coordinated efforts have turned into a critical requirement for enterprises.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

896 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now