Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

web-crawling

Posted on 1998-08-05
12
Medium Priority
?
310 Views
Last Modified: 2013-12-25
I currently have a cgi Perl program that can allow a user to search a flat file database. The script then returns the answer in the form of a url, linking them to the site where the matched words were found (i.e. a basic search engine). The problem is, that the database must be created manually.
I require a web robot that can crawl over a given site, following any links found there (to a given depth, and/or, including a given string). The web crawler should then create a flat file database of the parsed html, and the url where the words were found. I could then use my current search engine to allow users to query a site. Are there any robots available that could be easily modified to provide this functionality? I hav contacted some contractors about providing me with one, and was quoted in the region of $600, I would obvciously prefer it if I could find somehting cheaper.
-The robot and search engine will not be used commercially.
0
Comment
Question by:andy10
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
12 Comments
 
LVL 10

Expert Comment

by:MasseyM
ID: 1831394
qwerty lksdfkjhnne !! :)
0
 

Author Comment

by:andy10
ID: 1831395
Edited text of question
0
 

Author Comment

by:andy10
ID: 1831396
Sorry, I couldn't get the original question to be accepted, so I tried tryping in rubbish. Once this worked, I edited the question with the correct text.
Thanks
0
Quick Start: DOCKER

Sometimes you just need a Quick Start on a topic in order to begin using it.. this is just what you need to know to get up and running with Docker!

 
LVL 5

Expert Comment

by:julio011597
ID: 1831397
You are not likely to get an answer here other than some pointers to public domain software. I'd like to suggest you do a web search for the following packages (sorry, i haven't got the urls):

Harvest
MG (Managing Gigabytes)

They are provided with source code (C and Perl), and they both should be under the GNU licence.

Good luck, julio
0
 
LVL 7

Expert Comment

by:jconde
ID: 1831398
I once had that problem!

I searched for freeware robots, but I didn't find any.

I decided to code my own robot, but didn't finish!

Coding your own is pretty much simple.

What you need to do is to recursivly open the url's found in the index.html page, and save it to a file.....

Give me some time, and I'll send you the Unfinished source I have.  I will not finish it up, since it was a long time ago, and my coding s...s!

The only problem it has is that after being called recursiveley, if an url appears in a url that has all ready checked, it will loop for ever.  I keep no track uppon the visited urls.


0
 
LVL 1

Expert Comment

by:evilgreg
ID: 1831399
I've written a program that does something similar to what you want. I assume that the program can be written in Perl, and writes to local files to create the database? It's possible to write a program that does that. Let me know some more about what exactly you need, and I'll see what I can come up with. I asuume that the spider should not follow links that are "off-site", as well as avoiding any cgi programs.
0
 

Author Comment

by:andy10
ID: 1831400
I was basically looking for something fairly 'simple' that could traverse a web site, and index the words that it finds there.
Your right that it should only follow links that are on site, and not index cgi programs.
I have a program written in Perl that can search a flat file database. At present, however, you are required to create the database manually, by filling out a list of words which you think are representative of the sites content. There are four fields in the database, seperated by the '|' character:
1. the url for the page where the keywords represent
2. the site name which is used as the anchor for the url
3. the keywords input by the site administrator
4. a description of the site, to output next to the url, to inform the user what has been found, regarding their search

Right now, I would be happy if I could get a program that could be given a url to search, make a note of all the words found there, placing these in a file with the url where they were found. Basically, an automated part 3 above.

I'm still learning Perl, and finding it a slow process, this seems that it should be a reasonable thing to do in Perl. What do you think? I'd really like to get this working as quickly as possible, and have thought about paying a contractor - the cheapest I have come across is $400 - worth the money?
Cheers
Andy  
0
 
LVL 1

Expert Comment

by:evilgreg
ID: 1831401
$400?!? Hell, no. I'd write it for 1/4th of that. :)

Some questions about the above: Assuming a program creates the database files, where does it get #4 from? I assume #2 can use whatever is in the <TITLE> tags. As far as
keywords, do you just want basically every word on the page to be stored?


0
 

Author Comment

by:andy10
ID: 1831402
Yeah, keywords could just be everyword found on the page, and #2 would use the <title>.

 As far as #4 goes, I don't know how you could automate that, I guess maybe printing out the first few lines of the page, and hoping that they're representative. Still, that's not a major concern.

So you reckon you could do this? I've got a guy chomping at the bit to do this for $400. :(
Let me know what you  can do.
Cheers
Andy
0
 
LVL 1

Expert Comment

by:evilgreg
ID: 1831403
I'll give it a shot for $50. :) Email me at "greg@turnstep.com" so we don't bore everyone else.

0
 

Author Comment

by:andy10
ID: 1831404
Adjusted points to 250
0
 

Accepted Solution

by:
ShadowSpawn earned 500 total points
ID: 1831405
Check Out this article for a simple Perl Bot
http://www.hotwired.com/webmonkey/code/97/16/index2a.html?tw=perl

If you want to get any deeper check out the source code for HTDIG (http://htdig.sdsu.edu/)

Also check CPAN and libwww-perl info I think they have some bot examples.
0

Featured Post

Interactive Way of Training for the AWS CSA Exam

An interactive way of learning that will help you visualize core concepts so that you can be more effective when taking your AWS certification exam.  Built for students by a student to help them understand the concepts that they are being taught.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
The viewer will learn how to count occurrences of each item in an array.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question