Solved

URL Spider

Posted on 2004-03-27
8
373 Views
Last Modified: 2010-04-05
i need a method to nput a URL, search out all of the html files in that directory and all subdirectories (online). Make a list of these files
Then load each of these html files in the default browser. It should use the same browser window for all of them.
0
Comment
Question by:aliahmedali
8 Comments
 
LVL 8

Accepted Solution

by:
gmayo earned 54 total points
ID: 10694002
Most web servers won't allow you to browse their directory contents.

What Google-type spiders do is start from a given page and then find the links on that page. From each of those links, it downloads that page and looks for the links on that page. And so on.

Geoff M.
0
 
LVL 5

Assisted Solution

by:Jeff_2
Jeff_2 earned 53 total points
ID: 10694071
I don't know of an easy Delphi-based solution for this, but there are
some command-line tools that have somewhat similar functionality:
  http://www.gnu.org/software/wget/wget.html
  http://www.w3.org/Robot/
0
 

Author Comment

by:aliahmedali
ID: 10694102
thanks for caring but,

i want the code not an external appilcaon.

the code may be in other languages like c++, JAVA OR DELPHI

THANX AGAIN
0
 
LVL 5

Expert Comment

by:Jeff_2
ID: 10694122
Both of the links I posted are open-source software
0
 
LVL 11

Assisted Solution

by:shaneholmes
shaneholmes earned 53 total points
ID: 10694781
Here's a piece of code that'll take a URL and grab the contents into a TMemoryStream passed to the procedure, and save to a file, which you could then use to parse for all html files, and repeat the process over again...

Call it using, e.g.

  aMS := TMemoryStream.Create;
  try
    getURLOnStream('http://www.somewhere.com/datafile.xyz');
    aMs.Position := 0;
    aMS.SaveToFile('c:\myapp\datafile.xyz');
  finally
    aMS.Free;
  end;

Don't worry about the custom exception classes below - change them to
Exception if you like.  Also, there are a couple of global variables you
may have to define (like proxyServer - the proxyserver address, if you
need one).

Hope this helps.

Shane

procedure getURLOnStream(const aURL: string; aMS: TMemoryStream);
// Go get the URL aURL and write it to the stream aMS.
// General version, that can use either GET or POST
resourceString
  sMethod = 'GET';
  // sMethod = 'POST';
var
  aHi, aHConnect, aHFile: HInternet;
  bytesRead: DWORD;
  aBuf: PByteArray;
  s, t, u: string;
  gotIt: boolean;
  aURLc: TURLComponents;
begin
  // Initialization, fall-through
  aHi := nil;
  aHConnect := nil;
  aHFile := nil;

  // Bale out if no stream
  if not assigned(aMS) then
    raise EInetStreamError.create('No stream passed');

  // Crack the incoming URL
  setLength(s, INTERNET_MAX_PATH_LENGTH);
  setLength(t, INTERNET_MAX_PATH_LENGTH);
  setLength(u, INTERNET_MAX_PATH_LENGTH);

  //Clear the structure
  FillChar(aURLC, sizeOf(TURLComponents), 0);
  with aURLC do
  begin
    dwStructSize := sizeOf(TURLComponents);
    lpSzExtraInfo := PChar(s);
    dwExtraInfoLength := INTERNET_MAX_PATH_LENGTH;
    lpSzHostName := PChar(t);
    dwHostNameLength := INTERNET_MAX_PATH_LENGTH;
    lpszUrlPath := PChar(u);
    dwUrlPathLength := INTERNET_MAX_PATH_LENGTH;
  end;

  // Attempt to crack the URL
  if not InternetCrackUrl(PChar(aURL), 0, ICU_ESCAPE, aURLC) then
    raise EInetCrackURLError.createFmt('Error - %d = ', [GetLastError,
SysErrorMessage(GetLastError)]);

  // Get hold of a buffer that'll be used over and over for each read
  GetMem(aBuf, inetBufferSize);

  // Now go do it
  try
    // Open the internet
    if useProxyServer then // explicitly use the proxy server
      aHi := InternetOpen(PChar(Application.Name),
INTERNET_OPEN_TYPE_PROXY,
        PChar(proxyServer), nil, 0)
    else  // do default.  May still use a proxy server if one is set up
      ahI := InternetOpen(PChar(Application.Name),
INTERNET_OPEN_TYPE_PRECONFIG,
        nil, nil, 0);
    if (aHi = nil) then
      raise EInetOpenError.create('Could not open Internet');

    // Set options for the internet handle
    InternetSetOption(aHi, INTERNET_OPTION_CONNECT_TIMEOUT, @timeOutMS,
sizeOf(timeOutMS));

    // Make a connection to that host, raising an exception if no
connection}
    aHConnect := InternetConnect(aHI, aURLc.lpSzHostName,
INTERNET_INVALID_PORT_NUMBER, nil, nil,
      INTERNET_SERVICE_HTTP, 0, 0);
    if (aHConnect = nil) then
      raise EInetConnectError.createFmt('Could not connect to server
%s', [aURLc.lpSzHostName]);

    // Open a reqest to get ready to GET data, raising an exception if
not successful
    aHFile := HTTPOpenRequest(aHConnect, PChar(sMethod),
aURLc.lpSzUrlPath, HTTP_VERSION, nil,
      nil, INTERNET_FLAG_DONT_CACHE, 0);
    if (aHFile = nil) then
      raise(EHTTPOpenReqError.create('Could not open HTTP request'));

    // Add any extra headers to the request, raising an exception if not
successful
    //   if not HTTPAddRequestHeaders(aHFile, PChar(s), length(s),
HTTP_ADDREQ_FLAG_ADD) then
    //     raise(EHTTPAddReqError.create('Could not add HTTP request
header'));

    // Send the request, raising an exception if not successful
    if not HTTPSendRequest(aHFile, nil, 0, aURLc.lpSzExtraInfo,
aURLc.dwExtraInfoLength) then
      raise(EHTTPSendReqError.create('Could not send HTTP request'));

    // Loop to read the content from the URL in chunks of size
inetBufferSize.
    repeat
      // Let the program do other things
      Application.processMessages;

      // Get the next chunk
      gotIt := InternetReadFile(aHFile, aBuf, inetBufferSize,
bytesRead);

      // Pass it along to the stream
      if (gotIt and (bytesRead <> 0)) then
        aMS.WriteBuffer(aBuf^, bytesRead);

      // Repeat until we get no more data
    until (gotIt and (bytesRead = 0)) or (not gotIt);

  finally
    // Clean up memory
    FreeMem(aBuf, inetBufferSize);

    //Clean up by closing the handles.
    // According to the docs, we only need to close aHI,
    // which should automatically close the other ones that descend from
it
    InternetCloseHandle(aHFile);
    InternetCloseHandle(aHConnect);
    InternetCloseHandle(aHI);
  end;
end;
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
Hello everybody This Article will show you how to validate number with TEdit control, What's the TEdit control? TEdit is a standard Windows edit control on a form, it allows to user to write, read and copy/paste single line of text. Usua…
Along with being a a promotional video for my three-day Annielytics Dashboard Seminor, this Micro Tutorial is an intro to Google Analytics API data.
This is used to tweak the memory usage for your computer, it is used for servers more so than workstations but just be careful editing registry settings as it may cause irreversible results. I hold no responsibility for anything you do to the regist…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now