Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

WEB BOT

Posted on 1997-08-02
11
392 Views
Last Modified: 2010-04-04
I ask this question a few months ago, but the answer I got was not very clear, so here I go again.  I am building a Web Bot, that I want to extract images, links, email, movies, etc., and to do it several levels deep, I have everything now except that it will not extract emails or links, and will not go further than the first page, can someone point me in the right direction this time?  I know I need the program to parse the information, and I have created a parser.  However I appearantly have the wrong code for it to pull the mailto's and links.  Please answer as soon as possible.  Also if someone nows of a better way to do this I am open to suggestions.  I currently use Delphi 2.0,.  I know of only one book on this topic and I currently have it but it is not very clear either.

Thank you,
Tony
0
Comment
Question by:aj85
  • 6
  • 4
11 Comments
 

Author Comment

by:aj85
ID: 1340376
Edited text of question
0
 

Expert Comment

by:kimfriis
ID: 1340377
I am not sure that this is what you are looking for? But is it the Tags for the links and mailto's like: <A HREF ...>???
This should be easy if you know how to extract images and so on, you just say that if you find a <A HREF=mailto:...> then it is a mailto ??
Please clerify if this is not what you want
0
 

Author Comment

by:aj85
ID: 1340378
Actually I have figured out the problem of extracting links & mailto's since I posted this question.  However I can't separate the two, i.e. the links and mialto's come in on the same page.  Also I still need the answer on how to make the program go beyond the first page.  I will be waiting for an answer.
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 1

Accepted Solution

by:
kyriacos earned 250 total points
ID: 1340379
I am neither sure for the information you want.
So you made a parcer.
That's good... Parcers work in an intelligent way so they do not be confused with multiple versions of the same meaning.

So if you want to separete links, emails, images from an HTML content (the source HTML) you will have to read the whole tag command with its parameters, and then YOU decide what the <A HREF ...> command will to.

To make things more clear.

Read the HTML until you find the string "<A HREF"
This is necessary.
Then read any parameter until you find the ">".
This is also necessary.

So whatever the HTML command will do will be enclosed the mentioned "<A HREF" AND ">" pairs.
If you encounter:
 a MAILTO parameter read the address
 a .jpg read the image location
 a .gif read the image location as well
 a .wav also read the location
 an .html -> read the new page location AND ALSO SAVE THE URL in a linked list of URLs because you will need this to find other links to DEPTH 2.

After you finish with the parcing call your main parce procedure with each link you found in page 1. Then do the same with links from pages in depth 2 to find the depth 3 pages and so on...

I can write you some code of this, on request for free if i got in the spirit of your question.
0
 

Author Comment

by:aj85
ID: 1340380
Yes you have got a good idea for what I am looking for, if you could write the code with an example I will increase my points to 220.  Also can you tell me how to get a count of the number of emails, images, etc. that have been collected.  I will give a bonus of 50 points if you can answer this.  Please answer as soon as possible.

Thanks
Tony
0
 

Author Comment

by:aj85
ID: 1340381
Follow up to comment added.  I want to get an automatic count of the emails, images, etc., as they come in.
0
 

Author Comment

by:aj85
ID: 1340382
Adjusted points to 250
0
 
LVL 1

Expert Comment

by:kyriacos
ID: 1340383
hello,
  sorry if i'm late
have a look at
http://members.tripod.com/~kyriacos/htmlparcer.zip

it contains a sample prorgam that adds the links, images and email addresses.

NOTE: There are 2 buttons in the program. Before you press the "Process" button, you must press the button "Save..."

How it works for your needs...


0
 
LVL 1

Expert Comment

by:kyriacos
ID: 1340384
NOTE: This program works by searching in the source file for the keywords:
HREF - which indicates a link
IMG - which indicates an image and
MAILTO - which indicates a MAILTO

then it updates the variables used to count the instanses of each one...
0
 

Author Comment

by:aj85
ID: 1340385

The sample code you wrote was fine except that I already have a parser.  However the count part of the program gives me some insight.  But what I need to know is how to make the program go serveral levels deep, and get an automatic count as it finds the images, etc..  I am not sure this can be accomplish in Delphi 2.0, do you think there is another direction I should be headed in?  Prehaps another lang.  Please answer at your earliest convenience.

Thanks
Tony
0
 
LVL 1

Expert Comment

by:kyriacos
ID: 1340386
delphi is fine...just fine... it will do anything you want.
i will give you an example later tomorrow...
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Objective: - This article will help user in how to convert their numeric value become words. How to use 1. You can copy this code in your Unit as function 2. than you can perform your function by type this code The Code   (CODE) The Im…
Introduction Raise your hands if you were as upset with FireMonkey as I was when I discovered that there was no TListview.  I use TListView in almost all of my applications I've written, and I was not going to compromise by resorting to TStringGrid…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
Nobody understands Phishing better than an anti-spam company. That’s why we are providing Phishing Awareness Training to our customers. According to a report by Verizon, only 3% of targeted users report malicious emails to management. With compan…

791 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question