Solved

Simple find

Posted on 2004-03-31
15
372 Views
Last Modified: 2008-03-17
A few years back there use to be this application called "Simple find". It was a nice Windows based app that could querry search engines by subject. It would then give you a list of results.

I have been trying to duplicate this feature in my own application and having some difficulty.

In my application the user can request a file type by extention.

I started out using Wininnit.dll with a VC++6.0 example called Tear that could download an html page.

My thinking was to download the page then parse it for either the location of files or other html page links. Then recursivley parse those pages and go on and on until the user depth -level had been reached or there where no more links found.

What I found was that not all the links are so straight forward. Some of them are full paths ="http://www.xxx.yyy.zzzz.com/apple.jpg" while others where relative "apple.jpg". Besides that there seem to be many more page types than just htm or html.

So I went looking to see if someone had already conquered a parsing mechanism.

Then I was directed to WGet.

WGet is a great tool. But it still does not quite do what I want it to do. I use WGet with ShellExecute and just pass parms to the Wget.exe from my program.

This only sorta works. It seems that WGet has just as much problem with parsing as I precieve it to be a pain.

Not only this but I wanted to have the ability to querry search engines. As things stand now users have to enter a starting web page address.

I also notice that I can even look at some of the pages where WGet missed files and I see absolute and relative path files. ???

Then one day I was chatting and someone suggested that I have a common server and a php script. The users would go to the one site (not too sure about how that would perform) and each of there applications would querry that site and the php script would return them results that it obtained from the search engines.

The persons thought being that you can querry search engines but you have to be carefull because from time to time they change their format.
But it sounded as he was only guessing and had never done the scripting himself.

I know that by studying the address bar when I do searches from some of the less popular search engines that I could adjust the variable to change search content and page starting. That seemed hopefull.

But then I noticed that Google and perhaps Yahoo had some kind of restriction because I would get "page forbidden or "no access" (cant exactly remember) but I was not allowed access. Somehow it could detect that it was not an original querry but a machine generated one.
I notice that some of the less popular ones did not do this.

So I am here fishing for guidence.

RJ

0
Comment
Question by:RJSoft
  • 7
  • 4
  • 3
  • +1
15 Comments
 
LVL 30

Expert Comment

by:Axter
Comment Utility
So what exactly is your question?
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Hell Axter.

A few questions.

First if anyone knows of a good parsing function or example.

I guess one that could start from the first page results of a search engine querry and go deep enough to get to file locations. ( Build relative paths also)

And perhaps a method, function ?? to querry search engines. Maybe avoiding php because I dont know it. Or should I invest time to learn so I can get this functionality?

RJ
0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
I created a program some what similar to this before.

The file extension should not matter.  Your program should just verify that the file has the HTML tags to validate that it is an html page.
You can add code to skip over common file types like (gif, png, jpg, etc..).
The logic for creating a full path from a relative path is not that complicated either.
Just check if the path has "//" set.  If it doesn't have this set, then assume it's a relative path, and prefix the current path to the target path.

In my program, I use the search engines as a starting point.
I had an option page in which the user could change the format for the search engine, or add other search engines with associated format.

One approach you can use is to have a fix site that stores the format, and have you're program look for this site every time it starts up.
That way if the format changes, all you have to do is update the one site, and that will update all the users.
0
 
LVL 12

Assisted Solution

by:stefan73
stefan73 earned 50 total points
Comment Utility
Hi RJSoft,
Probably the easiest way to automatically query search engines is the WWW::Search classes of Perl:

http://search.cpan.org/~mthurn/WWW-Search-2.46/lib/WWW/Search.pm

This small sample prog shows the power of it:

    require WWW::Search;
    my $sQuery = 'Columbus Ohio sushi restaurant';
    my $oSearch = new WWW::Search('AltaVista');
    $oSearch->native_query(WWW::Search::escape_query($sQuery));
    $oSearch->login($sUser, $sPassword);
    while (my $oResult = $oSearch->next_result())
      {
      print $oResult->url, "\n";
      } # while
    $oSearch->logout;

If you don't know Perl yet, now's the perfect moment to learn it ;-)

Cheers,
Stefan
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
That's pretty much what I had gathered before. But I have become a bit spoiled by trying to get WGet to do all the work by simply "Shellexecute" with parms.

I kinda dred going back to Wininit.dll and creating my own parsing. Also not knowing if I could find some or any reliable search engines that would not simpy change there format.

>>In my program, I use the search engines as a starting point.
I had an option page in which the user could change the format for the search engine, or add other search engines with associated format.

Sounds great! But is that a bit much for some users? Maybe I do not fully understand how much configuring the user is doing. As far as I could understand the configuring would involve changing variable names on search engine querries.

Ex.

address bar on querry shows (just example dont remember exactly)

http://www.dogpile.com/search?subject=xxx;page=1;

So I coud have a dialog with user input. Apples.

http://www.dogpile.com/search?subject=Apples;page=1;


Now if dogpile suddenly changed it's format to

http://www.dogpile.com/search?page=1;find=Apples;


Then how would you have a user interface to suggest changing variable names and adjusting location?

I know I am probably way off base here. Maybe this does not matter.

BTW, the only way I could figure to querry the search engines was to manipulate the varible found in the title bar. Also I found that I could reduce to a working querry by only using subject and page number. Page number was important to me because I wanted to pull in a good size listing of web addresses that had to do with the subject.

I would then use Wininit.dll to download the pages.


So what happened to the code? Do you still have users that use it?

Would you mind sending an example of it? Or is that asking a bit too much?


Thanks in advance
RJ
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Stephan.

Yes. your correct. I would love to learn Perl.

But I am a bit fuzzy on what I would be doing. Do I have to re-write my whole application in perl or could I call up a perl script using my Windows application? If so how?

Also does it not have to exist upon a server? Then I also have the problem of multiple users accessing the same site? Or is this really a non issue?

Tell me in your example I see that it also ask for password.

$oSearch->login($sUser, $sPassword);
   
Is this because of it's existance on a server. Or is it something else?

Is there a way arround this and why is it required in the example?

How about some beginner books?

Thanks

RJ

0
 
LVL 17

Accepted Solution

by:
rstaveley earned 50 total points
Comment Utility
If you're not too comfortable in Perl, you can call C/C++ programs from your CGI script to do the parsing.

Use WWW::Search to get the URLs. Use wget/lynx -source to fetch the HTML (or LWP::UserAgent if you want to spread your wings in Perl). Then parse the HTML file with your C++ program for more URLs, and use wget/lynx -source to fetch their HTML... <etc.>

You'll probably need to pass a nesting indicator so you don't recurse ad infinitum. You'll need to pass the URL path, so that relative URLs can be resolved. You should look for "href=" (or "HREF="). I doubt if processing "action=" would be too fruitful. Processing URLs in JavaScript would be hard work.

Here's a quick'n'dirty stab at the C++ code. It simply lists the URLs found in HREF attributes in an HTML file. It would be easy to add SRC= to this, if you reckon that would be valuable. The list is written to standard output, which should be something Perl groks. This would probably be done more easily in Perl, but like you, my Perl is weak.
--------8<--------
#include <iostream>
#include <fstream>
#include <string>
#include <vector>

using std::string;

int main(int argc,const char *argv[])
{
      // Usage
      if (argc != 3) {
            std::cerr << "Usage: " << *argv << " {filename} {url}\n";
            exit(1);
      }

      // Open the HTML file
      std::ifstream fin(*++argv);      // Input file
      if (!fin) {
            std::cerr << "Error: Unable to open " << *argv << '\n';
            exit(2);
      }

      // Read off the full URL
      string url(*++argv);      // Full URL

      // Use path for relative URLs
      string path = url;      // Get the path from which URLs are relative to
      string::size_type pos;
      if ((pos = path.find('?')) != string::npos)      // Lose the query string
            path.resize(pos);
      if ((pos = path.rfind('/')) != string::npos && pos > 7)      // Keep the "http://", but lose the filename
            path.resize(pos);

      // Get the root for URLs, which start with '/'
      string root = path;
      if ((pos = root.find('/',7)) != string::npos)      // Keep the "http://", but lose the filename
            root.resize(pos);

      path += '/';            // Add a '/' separator to the path for relative URLs

      // Looking for these attributes
      typedef std::vector<string> SVector;
      SVector attributeList;
      attributeList.push_back("href=");
      attributeList.push_back("HREF=");

      // Process the file
      string line;
      while (getline(fin,line))
            // Process each of the sought attributes
            for (int i = 0;i < attributeList.size();++i) {
                  const string& attr = attributeList[i];
                  for (string::size_type pos = 0;(pos = line.find(attr,pos)) != string::npos;++pos) {
                        const string remains = line.substr(pos+attr.size());
                        if (!remains.size())
                                continue;
                        string url;
                        if (remains[0] == '\"') {
                              string::size_type pos = remains.find('\"',1);
                              if (pos != string::npos && pos > 0)
                                    --pos;
                              url = remains.substr(1,pos);
                        }
                        else
                              url = remains.substr(0,remains.find_first_of(" \t"));
                        if (!url.size())
                                continue;
                        if (url.find("://") != string::npos)
                              std::cout << url << '\n';
                        else if (url[0] != '/')
                              std::cout << path << url << '\n';
                        else
                              std::cout << root << url << '\n';
                  }
            }
}
--------8<--------
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 3

Author Comment

by:RJSoft
Comment Utility
rstaveley

Thanks for taking the time to write/post the parsing code above. I may end up using it, I dont know.

For now I am simply trying to decide how I should design this thing. Before I end up spending serious time trying to re-invent the wheel.

If Perl is what I should learn , then Perl I will learn.

I am just unfamiliar with how the arrangement should be.

My understanding is...

SCENARIO #1 perl script or cgi script or php??
My application.
The user selects a file type and a subject matter.
Button is pressed.

My application makes a request to a Perl/cgi/php.. script that resides on a specific web server.

How, I am not exactly sure. How do I activate a server side script from within my client side application?

Next the script querries a search engine or group of search engines. And perhaps the results is returned in the form of a html page which is later parsed so something like WGet can download files from given addresses.
Or maybe some stuff from ftp. I might like to get away from WGet.exe as I dont really like to shell an exe as opposed to using a dll.

SCENARIO #2

My application.
The user selects a file type and a subject matter.
Button is pressed.

My application uses something like Wininit.dll to get the search engine pages. Pages are parsed and 2 list are built. One list contains web page links which will be recursivley parsed. The other actual file locations.
Maybe use WGet to download files.

//////////////////////////

Both scenarios leave me confused.

On one hand I have a server side script that produces an output file. I guess it would not matter if that file had the same name for each user that used the script as the result would be over-written (assuming that the result file reside with the server). I take the downloading of the result to be a copy. I dont really percieve this as a problem as my software is not that popular yet, but I do have problems with continually adding band width for more and more users. It could become a problem of server slowness.

Both scenarios have the same problem of a changing search engine format but the script could be changed in one place and all is corrected.
So that is a plus for the server script.

On the other hand I really dont know how often the search engines change format or if it even really matters. Because if they keep the variable names the same. What is the diff? Maybe even less popular search engines who are not worried about being sucked up for process time from scraper type programs like what I am trying to create wont change thier format.

Scenario 2 has the advantage of not relying on a server script. Which could prove to be more cost effective in the long run.

But I gotta tell ya. I like the perl script by stefan73.

Hey stefan73 am I making any sense?
have you done this before?

RJ


0
 
LVL 30

Expert Comment

by:Axter
Comment Utility
>>So what happened to the code? Do you still have users that use it?
>>Would you mind sending an example of it? Or is that asking a bit too much?

Sorry, but I lost the code and the program when my computer crashed a couple of years ago.
It was something I was playing around with, and I lost interest in it, so I didn't pull it out of my tape-backup when I recovered my computer.
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Thanks anyway Axter.

You know you always seem to be a few steps (years) ahead of me. (you been there done that).

Out of curiosity what ideals are you kicking arround these days. (Maybe I will re-adjust my scope. I am tired of being too far behind the times. Seems like the time I concieve an ideal and finally dump it into the market I am already way behind.).

I am not wanting to steal any ideals. Just love to program and hoping to develop something more substantial / profitable.

RJ
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
> SCENARIO #1 perl script or cgi script or php??

CGI scripts tend to be written in Perl. Stefan's WWW::Search is too good a fit for you not to use and it should be easy to adapt it into a CGI. You could perhaps have it return XML which you could parse on your Windows application using MSXML. [If you've not already done this sort of thing, you'll be pleasantly surprised by MSXML.]

So your Windows application issues a request via MSXML to your CGI script as follows:

   http://yourhost.yournetwork.net/cgi-bin/yourcgiscript.pl?search=XXXX

The CGI script works on your (say) Linux server with WWW::Search, wget and your C++ parser executable to return:

  <?xml version="1.0" ?>
  <results>
    <result url="http://somehost.somenetwork.net/somepath/somefile.html" />
    <result url="http://otherhost.othernetwork.net/otherpath/otherfile.php" />
  </results>

Your windows application then uses MSXML's DOM parser to do pretty things with the URLs.

That's how I'd do it.
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
>  hoping to develop something more .... profitable.

I reckon I earn my living at the trailing edge of technology. It is interesting and profitable... and much better documented than the leading edge.
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Thanks rstaveley.

I am currently shopping arround for a good beginning Perl book. I am glad to hear that I don't have to re-write my whole application. I have read a little on MSXML. And have somewhat of ideals.

Apreciated. Definitley have to save these post on my pc.

RJ
0
 
LVL 3

Author Comment

by:RJSoft
Comment Utility
Rstavley. Now you got me curious. What do you do fix up legacy code for some shop? What kind of product is it? (What market?).

I use to work in prison inmate accounting software. It was good. But long story short, they sold out.

RJ
0
 
LVL 17

Expert Comment

by:rstaveley
Comment Utility
I write applications for broadcast television. It is a mixed bag of technologies, but none of them could claim to be leading edge - unless you were a salesman ;-)
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Written by John Humphreys C++ Threading and the POSIX Library This article will cover the basic information that you need to know in order to make use of the POSIX threading library available for C and C++ on UNIX and most Linux systems.   [s…
  Included as part of the C++ Standard Template Library (STL) is a collection of generic containers. Each of these containers serves a different purpose and has different pros and cons. It is often difficult to decide which container to use and …
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now