Solved

URL Spider

Posted on 2004-03-27
8
370 Views
Last Modified: 2010-04-05
i need a method to nput a URL, search out all of the html files in that directory and all subdirectories (online). Make a list of these files
Then load each of these html files in the default browser. It should use the same browser window for all of them.
0
Comment
Question by:aliahmedali
8 Comments
 
LVL 8

Accepted Solution

by:
gmayo earned 54 total points
ID: 10694002
Most web servers won't allow you to browse their directory contents.

What Google-type spiders do is start from a given page and then find the links on that page. From each of those links, it downloads that page and looks for the links on that page. And so on.

Geoff M.
0
 
LVL 5

Assisted Solution

by:Jeff_2
Jeff_2 earned 53 total points
ID: 10694071
I don't know of an easy Delphi-based solution for this, but there are
some command-line tools that have somewhat similar functionality:
  http://www.gnu.org/software/wget/wget.html
  http://www.w3.org/Robot/
0
 

Author Comment

by:aliahmedali
ID: 10694102
thanks for caring but,

i want the code not an external appilcaon.

the code may be in other languages like c++, JAVA OR DELPHI

THANX AGAIN
0
 
LVL 5

Expert Comment

by:Jeff_2
ID: 10694122
Both of the links I posted are open-source software
0
 
LVL 11

Assisted Solution

by:shaneholmes
shaneholmes earned 53 total points
ID: 10694781
Here's a piece of code that'll take a URL and grab the contents into a TMemoryStream passed to the procedure, and save to a file, which you could then use to parse for all html files, and repeat the process over again...

Call it using, e.g.

  aMS := TMemoryStream.Create;
  try
    getURLOnStream('http://www.somewhere.com/datafile.xyz');
    aMs.Position := 0;
    aMS.SaveToFile('c:\myapp\datafile.xyz');
  finally
    aMS.Free;
  end;

Don't worry about the custom exception classes below - change them to
Exception if you like.  Also, there are a couple of global variables you
may have to define (like proxyServer - the proxyserver address, if you
need one).

Hope this helps.

Shane

procedure getURLOnStream(const aURL: string; aMS: TMemoryStream);
// Go get the URL aURL and write it to the stream aMS.
// General version, that can use either GET or POST
resourceString
  sMethod = 'GET';
  // sMethod = 'POST';
var
  aHi, aHConnect, aHFile: HInternet;
  bytesRead: DWORD;
  aBuf: PByteArray;
  s, t, u: string;
  gotIt: boolean;
  aURLc: TURLComponents;
begin
  // Initialization, fall-through
  aHi := nil;
  aHConnect := nil;
  aHFile := nil;

  // Bale out if no stream
  if not assigned(aMS) then
    raise EInetStreamError.create('No stream passed');

  // Crack the incoming URL
  setLength(s, INTERNET_MAX_PATH_LENGTH);
  setLength(t, INTERNET_MAX_PATH_LENGTH);
  setLength(u, INTERNET_MAX_PATH_LENGTH);

  //Clear the structure
  FillChar(aURLC, sizeOf(TURLComponents), 0);
  with aURLC do
  begin
    dwStructSize := sizeOf(TURLComponents);
    lpSzExtraInfo := PChar(s);
    dwExtraInfoLength := INTERNET_MAX_PATH_LENGTH;
    lpSzHostName := PChar(t);
    dwHostNameLength := INTERNET_MAX_PATH_LENGTH;
    lpszUrlPath := PChar(u);
    dwUrlPathLength := INTERNET_MAX_PATH_LENGTH;
  end;

  // Attempt to crack the URL
  if not InternetCrackUrl(PChar(aURL), 0, ICU_ESCAPE, aURLC) then
    raise EInetCrackURLError.createFmt('Error - %d = ', [GetLastError,
SysErrorMessage(GetLastError)]);

  // Get hold of a buffer that'll be used over and over for each read
  GetMem(aBuf, inetBufferSize);

  // Now go do it
  try
    // Open the internet
    if useProxyServer then // explicitly use the proxy server
      aHi := InternetOpen(PChar(Application.Name),
INTERNET_OPEN_TYPE_PROXY,
        PChar(proxyServer), nil, 0)
    else  // do default.  May still use a proxy server if one is set up
      ahI := InternetOpen(PChar(Application.Name),
INTERNET_OPEN_TYPE_PRECONFIG,
        nil, nil, 0);
    if (aHi = nil) then
      raise EInetOpenError.create('Could not open Internet');

    // Set options for the internet handle
    InternetSetOption(aHi, INTERNET_OPTION_CONNECT_TIMEOUT, @timeOutMS,
sizeOf(timeOutMS));

    // Make a connection to that host, raising an exception if no
connection}
    aHConnect := InternetConnect(aHI, aURLc.lpSzHostName,
INTERNET_INVALID_PORT_NUMBER, nil, nil,
      INTERNET_SERVICE_HTTP, 0, 0);
    if (aHConnect = nil) then
      raise EInetConnectError.createFmt('Could not connect to server
%s', [aURLc.lpSzHostName]);

    // Open a reqest to get ready to GET data, raising an exception if
not successful
    aHFile := HTTPOpenRequest(aHConnect, PChar(sMethod),
aURLc.lpSzUrlPath, HTTP_VERSION, nil,
      nil, INTERNET_FLAG_DONT_CACHE, 0);
    if (aHFile = nil) then
      raise(EHTTPOpenReqError.create('Could not open HTTP request'));

    // Add any extra headers to the request, raising an exception if not
successful
    //   if not HTTPAddRequestHeaders(aHFile, PChar(s), length(s),
HTTP_ADDREQ_FLAG_ADD) then
    //     raise(EHTTPAddReqError.create('Could not add HTTP request
header'));

    // Send the request, raising an exception if not successful
    if not HTTPSendRequest(aHFile, nil, 0, aURLc.lpSzExtraInfo,
aURLc.dwExtraInfoLength) then
      raise(EHTTPSendReqError.create('Could not send HTTP request'));

    // Loop to read the content from the URL in chunks of size
inetBufferSize.
    repeat
      // Let the program do other things
      Application.processMessages;

      // Get the next chunk
      gotIt := InternetReadFile(aHFile, aBuf, inetBufferSize,
bytesRead);

      // Pass it along to the stream
      if (gotIt and (bytesRead <> 0)) then
        aMS.WriteBuffer(aBuf^, bytesRead);

      // Repeat until we get no more data
    until (gotIt and (bytesRead = 0)) or (not gotIt);

  finally
    // Clean up memory
    FreeMem(aBuf, inetBufferSize);

    //Clean up by closing the handles.
    // According to the docs, we only need to close aHI,
    // which should automatically close the other ones that descend from
it
    InternetCloseHandle(aHFile);
    InternetCloseHandle(aHConnect);
    InternetCloseHandle(aHI);
  end;
end;
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Introduction The parallel port is a very commonly known port, it was widely used to connect a printer to the PC, if you look at the back of your computer, for those who don't have newer computers, there will be a port with 25 pins and a small print…
Creating an auto free TStringList The TStringList is a basic and frequently used object in Delphi. On many occasions, you may want to create a temporary list, process some items in the list and be done with the list. In such cases, you have to…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now