Solved

WEB BOT

Posted on 1997-08-02
11
389 Views
Last Modified: 2010-04-04
I ask this question a few months ago, but the answer I got was not very clear, so here I go again.  I am building a Web Bot, that I want to extract images, links, email, movies, etc., and to do it several levels deep, I have everything now except that it will not extract emails or links, and will not go further than the first page, can someone point me in the right direction this time?  I know I need the program to parse the information, and I have created a parser.  However I appearantly have the wrong code for it to pull the mailto's and links.  Please answer as soon as possible.  Also if someone nows of a better way to do this I am open to suggestions.  I currently use Delphi 2.0,.  I know of only one book on this topic and I currently have it but it is not very clear either.

Thank you,
Tony
0
Comment
Question by:aj85
  • 6
  • 4
11 Comments
 

Author Comment

by:aj85
Comment Utility
Edited text of question
0
 

Expert Comment

by:kimfriis
Comment Utility
I am not sure that this is what you are looking for? But is it the Tags for the links and mailto's like: <A HREF ...>???
This should be easy if you know how to extract images and so on, you just say that if you find a <A HREF=mailto:...> then it is a mailto ??
Please clerify if this is not what you want
0
 

Author Comment

by:aj85
Comment Utility
Actually I have figured out the problem of extracting links & mailto's since I posted this question.  However I can't separate the two, i.e. the links and mialto's come in on the same page.  Also I still need the answer on how to make the program go beyond the first page.  I will be waiting for an answer.
0
 
LVL 1

Accepted Solution

by:
kyriacos earned 250 total points
Comment Utility
I am neither sure for the information you want.
So you made a parcer.
That's good... Parcers work in an intelligent way so they do not be confused with multiple versions of the same meaning.

So if you want to separete links, emails, images from an HTML content (the source HTML) you will have to read the whole tag command with its parameters, and then YOU decide what the <A HREF ...> command will to.

To make things more clear.

Read the HTML until you find the string "<A HREF"
This is necessary.
Then read any parameter until you find the ">".
This is also necessary.

So whatever the HTML command will do will be enclosed the mentioned "<A HREF" AND ">" pairs.
If you encounter:
 a MAILTO parameter read the address
 a .jpg read the image location
 a .gif read the image location as well
 a .wav also read the location
 an .html -> read the new page location AND ALSO SAVE THE URL in a linked list of URLs because you will need this to find other links to DEPTH 2.

After you finish with the parcing call your main parce procedure with each link you found in page 1. Then do the same with links from pages in depth 2 to find the depth 3 pages and so on...

I can write you some code of this, on request for free if i got in the spirit of your question.
0
 

Author Comment

by:aj85
Comment Utility
Yes you have got a good idea for what I am looking for, if you could write the code with an example I will increase my points to 220.  Also can you tell me how to get a count of the number of emails, images, etc. that have been collected.  I will give a bonus of 50 points if you can answer this.  Please answer as soon as possible.

Thanks
Tony
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:aj85
Comment Utility
Follow up to comment added.  I want to get an automatic count of the emails, images, etc., as they come in.
0
 

Author Comment

by:aj85
Comment Utility
Adjusted points to 250
0
 
LVL 1

Expert Comment

by:kyriacos
Comment Utility
hello,
  sorry if i'm late
have a look at
http://members.tripod.com/~kyriacos/htmlparcer.zip

it contains a sample prorgam that adds the links, images and email addresses.

NOTE: There are 2 buttons in the program. Before you press the "Process" button, you must press the button "Save..."

How it works for your needs...


0
 
LVL 1

Expert Comment

by:kyriacos
Comment Utility
NOTE: This program works by searching in the source file for the keywords:
HREF - which indicates a link
IMG - which indicates an image and
MAILTO - which indicates a MAILTO

then it updates the variables used to count the instanses of each one...
0
 

Author Comment

by:aj85
Comment Utility

The sample code you wrote was fine except that I already have a parser.  However the count part of the program gives me some insight.  But what I need to know is how to make the program go serveral levels deep, and get an automatic count as it finds the images, etc..  I am not sure this can be accomplish in Delphi 2.0, do you think there is another direction I should be headed in?  Prehaps another lang.  Please answer at your earliest convenience.

Thanks
Tony
0
 
LVL 1

Expert Comment

by:kyriacos
Comment Utility
delphi is fine...just fine... it will do anything you want.
i will give you an example later tomorrow...
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Introduction I have seen many questions in this Delphi topic area where queries in threads are needed or suggested. I know bumped into a similar need. This article will address some of the concepts when dealing with a multithreaded delphi database…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now