RJSoft
asked on
Simple find
A few years back there use to be this application called "Simple find". It was a nice Windows based app that could querry search engines by subject. It would then give you a list of results.
I have been trying to duplicate this feature in my own application and having some difficulty.
In my application the user can request a file type by extention.
I started out using Wininnit.dll with a VC++6.0 example called Tear that could download an html page.
My thinking was to download the page then parse it for either the location of files or other html page links. Then recursivley parse those pages and go on and on until the user depth -level had been reached or there where no more links found.
What I found was that not all the links are so straight forward. Some of them are full paths ="http://www.xxx.yyy.zzzz.com/apple.jpg" while others where relative "apple.jpg". Besides that there seem to be many more page types than just htm or html.
So I went looking to see if someone had already conquered a parsing mechanism.
Then I was directed to WGet.
WGet is a great tool. But it still does not quite do what I want it to do. I use WGet with ShellExecute and just pass parms to the Wget.exe from my program.
This only sorta works. It seems that WGet has just as much problem with parsing as I precieve it to be a pain.
Not only this but I wanted to have the ability to querry search engines. As things stand now users have to enter a starting web page address.
I also notice that I can even look at some of the pages where WGet missed files and I see absolute and relative path files. ???
Then one day I was chatting and someone suggested that I have a common server and a php script. The users would go to the one site (not too sure about how that would perform) and each of there applications would querry that site and the php script would return them results that it obtained from the search engines.
The persons thought being that you can querry search engines but you have to be carefull because from time to time they change their format.
But it sounded as he was only guessing and had never done the scripting himself.
I know that by studying the address bar when I do searches from some of the less popular search engines that I could adjust the variable to change search content and page starting. That seemed hopefull.
But then I noticed that Google and perhaps Yahoo had some kind of restriction because I would get "page forbidden or "no access" (cant exactly remember) but I was not allowed access. Somehow it could detect that it was not an original querry but a machine generated one.
I notice that some of the less popular ones did not do this.
So I am here fishing for guidence.
RJ
I have been trying to duplicate this feature in my own application and having some difficulty.
In my application the user can request a file type by extention.
I started out using Wininnit.dll with a VC++6.0 example called Tear that could download an html page.
My thinking was to download the page then parse it for either the location of files or other html page links. Then recursivley parse those pages and go on and on until the user depth -level had been reached or there where no more links found.
What I found was that not all the links are so straight forward. Some of them are full paths ="http://www.xxx.yyy.zzzz.com/apple.jpg" while others where relative "apple.jpg". Besides that there seem to be many more page types than just htm or html.
So I went looking to see if someone had already conquered a parsing mechanism.
Then I was directed to WGet.
WGet is a great tool. But it still does not quite do what I want it to do. I use WGet with ShellExecute and just pass parms to the Wget.exe from my program.
This only sorta works. It seems that WGet has just as much problem with parsing as I precieve it to be a pain.
Not only this but I wanted to have the ability to querry search engines. As things stand now users have to enter a starting web page address.
I also notice that I can even look at some of the pages where WGet missed files and I see absolute and relative path files. ???
Then one day I was chatting and someone suggested that I have a common server and a php script. The users would go to the one site (not too sure about how that would perform) and each of there applications would querry that site and the php script would return them results that it obtained from the search engines.
The persons thought being that you can querry search engines but you have to be carefull because from time to time they change their format.
But it sounded as he was only guessing and had never done the scripting himself.
I know that by studying the address bar when I do searches from some of the less popular search engines that I could adjust the variable to change search content and page starting. That seemed hopefull.
But then I noticed that Google and perhaps Yahoo had some kind of restriction because I would get "page forbidden or "no access" (cant exactly remember) but I was not allowed access. Somehow it could detect that it was not an original querry but a machine generated one.
I notice that some of the less popular ones did not do this.
So I am here fishing for guidence.
RJ
So what exactly is your question?
ASKER
Hell Axter.
A few questions.
First if anyone knows of a good parsing function or example.
I guess one that could start from the first page results of a search engine querry and go deep enough to get to file locations. ( Build relative paths also)
And perhaps a method, function ?? to querry search engines. Maybe avoiding php because I dont know it. Or should I invest time to learn so I can get this functionality?
RJ
A few questions.
First if anyone knows of a good parsing function or example.
I guess one that could start from the first page results of a search engine querry and go deep enough to get to file locations. ( Build relative paths also)
And perhaps a method, function ?? to querry search engines. Maybe avoiding php because I dont know it. Or should I invest time to learn so I can get this functionality?
RJ
I created a program some what similar to this before.
The file extension should not matter. Your program should just verify that the file has the HTML tags to validate that it is an html page.
You can add code to skip over common file types like (gif, png, jpg, etc..).
The logic for creating a full path from a relative path is not that complicated either.
Just check if the path has "//" set. If it doesn't have this set, then assume it's a relative path, and prefix the current path to the target path.
In my program, I use the search engines as a starting point.
I had an option page in which the user could change the format for the search engine, or add other search engines with associated format.
One approach you can use is to have a fix site that stores the format, and have you're program look for this site every time it starts up.
That way if the format changes, all you have to do is update the one site, and that will update all the users.
The file extension should not matter. Your program should just verify that the file has the HTML tags to validate that it is an html page.
You can add code to skip over common file types like (gif, png, jpg, etc..).
The logic for creating a full path from a relative path is not that complicated either.
Just check if the path has "//" set. If it doesn't have this set, then assume it's a relative path, and prefix the current path to the target path.
In my program, I use the search engines as a starting point.
I had an option page in which the user could change the format for the search engine, or add other search engines with associated format.
One approach you can use is to have a fix site that stores the format, and have you're program look for this site every time it starts up.
That way if the format changes, all you have to do is update the one site, and that will update all the users.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
That's pretty much what I had gathered before. But I have become a bit spoiled by trying to get WGet to do all the work by simply "Shellexecute" with parms.
I kinda dred going back to Wininit.dll and creating my own parsing. Also not knowing if I could find some or any reliable search engines that would not simpy change there format.
>>In my program, I use the search engines as a starting point.
I had an option page in which the user could change the format for the search engine, or add other search engines with associated format.
Sounds great! But is that a bit much for some users? Maybe I do not fully understand how much configuring the user is doing. As far as I could understand the configuring would involve changing variable names on search engine querries.
Ex.
address bar on querry shows (just example dont remember exactly)
http://www.dogpile.com/search?subject=xxx;page=1;
So I coud have a dialog with user input. Apples.
http://www.dogpile.com/search?subject=Apples;page=1;
Now if dogpile suddenly changed it's format to
http://www.dogpile.com/search?page=1;find=Apples;
Then how would you have a user interface to suggest changing variable names and adjusting location?
I know I am probably way off base here. Maybe this does not matter.
BTW, the only way I could figure to querry the search engines was to manipulate the varible found in the title bar. Also I found that I could reduce to a working querry by only using subject and page number. Page number was important to me because I wanted to pull in a good size listing of web addresses that had to do with the subject.
I would then use Wininit.dll to download the pages.
So what happened to the code? Do you still have users that use it?
Would you mind sending an example of it? Or is that asking a bit too much?
Thanks in advance
RJ
I kinda dred going back to Wininit.dll and creating my own parsing. Also not knowing if I could find some or any reliable search engines that would not simpy change there format.
>>In my program, I use the search engines as a starting point.
I had an option page in which the user could change the format for the search engine, or add other search engines with associated format.
Sounds great! But is that a bit much for some users? Maybe I do not fully understand how much configuring the user is doing. As far as I could understand the configuring would involve changing variable names on search engine querries.
Ex.
address bar on querry shows (just example dont remember exactly)
http://www.dogpile.com/search?subject=xxx;page=1;
So I coud have a dialog with user input. Apples.
http://www.dogpile.com/search?subject=Apples;page=1;
Now if dogpile suddenly changed it's format to
http://www.dogpile.com/search?page=1;find=Apples;
Then how would you have a user interface to suggest changing variable names and adjusting location?
I know I am probably way off base here. Maybe this does not matter.
BTW, the only way I could figure to querry the search engines was to manipulate the varible found in the title bar. Also I found that I could reduce to a working querry by only using subject and page number. Page number was important to me because I wanted to pull in a good size listing of web addresses that had to do with the subject.
I would then use Wininit.dll to download the pages.
So what happened to the code? Do you still have users that use it?
Would you mind sending an example of it? Or is that asking a bit too much?
Thanks in advance
RJ
ASKER
Stephan.
Yes. your correct. I would love to learn Perl.
But I am a bit fuzzy on what I would be doing. Do I have to re-write my whole application in perl or could I call up a perl script using my Windows application? If so how?
Also does it not have to exist upon a server? Then I also have the problem of multiple users accessing the same site? Or is this really a non issue?
Tell me in your example I see that it also ask for password.
$oSearch->login($sUser, $sPassword);
Is this because of it's existance on a server. Or is it something else?
Is there a way arround this and why is it required in the example?
How about some beginner books?
Thanks
RJ
Yes. your correct. I would love to learn Perl.
But I am a bit fuzzy on what I would be doing. Do I have to re-write my whole application in perl or could I call up a perl script using my Windows application? If so how?
Also does it not have to exist upon a server? Then I also have the problem of multiple users accessing the same site? Or is this really a non issue?
Tell me in your example I see that it also ask for password.
$oSearch->login($sUser, $sPassword);
Is this because of it's existance on a server. Or is it something else?
Is there a way arround this and why is it required in the example?
How about some beginner books?
Thanks
RJ
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
rstaveley
Thanks for taking the time to write/post the parsing code above. I may end up using it, I dont know.
For now I am simply trying to decide how I should design this thing. Before I end up spending serious time trying to re-invent the wheel.
If Perl is what I should learn , then Perl I will learn.
I am just unfamiliar with how the arrangement should be.
My understanding is...
SCENARIO #1 perl script or cgi script or php??
My application.
The user selects a file type and a subject matter.
Button is pressed.
My application makes a request to a Perl/cgi/php.. script that resides on a specific web server.
How, I am not exactly sure. How do I activate a server side script from within my client side application?
Next the script querries a search engine or group of search engines. And perhaps the results is returned in the form of a html page which is later parsed so something like WGet can download files from given addresses.
Or maybe some stuff from ftp. I might like to get away from WGet.exe as I dont really like to shell an exe as opposed to using a dll.
SCENARIO #2
My application.
The user selects a file type and a subject matter.
Button is pressed.
My application uses something like Wininit.dll to get the search engine pages. Pages are parsed and 2 list are built. One list contains web page links which will be recursivley parsed. The other actual file locations.
Maybe use WGet to download files.
//////////////////////////
Both scenarios leave me confused.
On one hand I have a server side script that produces an output file. I guess it would not matter if that file had the same name for each user that used the script as the result would be over-written (assuming that the result file reside with the server). I take the downloading of the result to be a copy. I dont really percieve this as a problem as my software is not that popular yet, but I do have problems with continually adding band width for more and more users. It could become a problem of server slowness.
Both scenarios have the same problem of a changing search engine format but the script could be changed in one place and all is corrected.
So that is a plus for the server script.
On the other hand I really dont know how often the search engines change format or if it even really matters. Because if they keep the variable names the same. What is the diff? Maybe even less popular search engines who are not worried about being sucked up for process time from scraper type programs like what I am trying to create wont change thier format.
Scenario 2 has the advantage of not relying on a server script. Which could prove to be more cost effective in the long run.
But I gotta tell ya. I like the perl script by stefan73.
Hey stefan73 am I making any sense?
have you done this before?
RJ
Thanks for taking the time to write/post the parsing code above. I may end up using it, I dont know.
For now I am simply trying to decide how I should design this thing. Before I end up spending serious time trying to re-invent the wheel.
If Perl is what I should learn , then Perl I will learn.
I am just unfamiliar with how the arrangement should be.
My understanding is...
SCENARIO #1 perl script or cgi script or php??
My application.
The user selects a file type and a subject matter.
Button is pressed.
My application makes a request to a Perl/cgi/php.. script that resides on a specific web server.
How, I am not exactly sure. How do I activate a server side script from within my client side application?
Next the script querries a search engine or group of search engines. And perhaps the results is returned in the form of a html page which is later parsed so something like WGet can download files from given addresses.
Or maybe some stuff from ftp. I might like to get away from WGet.exe as I dont really like to shell an exe as opposed to using a dll.
SCENARIO #2
My application.
The user selects a file type and a subject matter.
Button is pressed.
My application uses something like Wininit.dll to get the search engine pages. Pages are parsed and 2 list are built. One list contains web page links which will be recursivley parsed. The other actual file locations.
Maybe use WGet to download files.
//////////////////////////
Both scenarios leave me confused.
On one hand I have a server side script that produces an output file. I guess it would not matter if that file had the same name for each user that used the script as the result would be over-written (assuming that the result file reside with the server). I take the downloading of the result to be a copy. I dont really percieve this as a problem as my software is not that popular yet, but I do have problems with continually adding band width for more and more users. It could become a problem of server slowness.
Both scenarios have the same problem of a changing search engine format but the script could be changed in one place and all is corrected.
So that is a plus for the server script.
On the other hand I really dont know how often the search engines change format or if it even really matters. Because if they keep the variable names the same. What is the diff? Maybe even less popular search engines who are not worried about being sucked up for process time from scraper type programs like what I am trying to create wont change thier format.
Scenario 2 has the advantage of not relying on a server script. Which could prove to be more cost effective in the long run.
But I gotta tell ya. I like the perl script by stefan73.
Hey stefan73 am I making any sense?
have you done this before?
RJ
>>So what happened to the code? Do you still have users that use it?
>>Would you mind sending an example of it? Or is that asking a bit too much?
Sorry, but I lost the code and the program when my computer crashed a couple of years ago.
It was something I was playing around with, and I lost interest in it, so I didn't pull it out of my tape-backup when I recovered my computer.
>>Would you mind sending an example of it? Or is that asking a bit too much?
Sorry, but I lost the code and the program when my computer crashed a couple of years ago.
It was something I was playing around with, and I lost interest in it, so I didn't pull it out of my tape-backup when I recovered my computer.
ASKER
Thanks anyway Axter.
You know you always seem to be a few steps (years) ahead of me. (you been there done that).
Out of curiosity what ideals are you kicking arround these days. (Maybe I will re-adjust my scope. I am tired of being too far behind the times. Seems like the time I concieve an ideal and finally dump it into the market I am already way behind.).
I am not wanting to steal any ideals. Just love to program and hoping to develop something more substantial / profitable.
RJ
You know you always seem to be a few steps (years) ahead of me. (you been there done that).
Out of curiosity what ideals are you kicking arround these days. (Maybe I will re-adjust my scope. I am tired of being too far behind the times. Seems like the time I concieve an ideal and finally dump it into the market I am already way behind.).
I am not wanting to steal any ideals. Just love to program and hoping to develop something more substantial / profitable.
RJ
> SCENARIO #1 perl script or cgi script or php??
CGI scripts tend to be written in Perl. Stefan's WWW::Search is too good a fit for you not to use and it should be easy to adapt it into a CGI. You could perhaps have it return XML which you could parse on your Windows application using MSXML. [If you've not already done this sort of thing, you'll be pleasantly surprised by MSXML.]
So your Windows application issues a request via MSXML to your CGI script as follows:
http://yourhost.yournetwork.net/cgi-bin/yourcgiscript.pl?search=XXXX
The CGI script works on your (say) Linux server with WWW::Search, wget and your C++ parser executable to return:
<?xml version="1.0" ?>
<results>
<result url="http://somehost.somenetwork.net/somepath/somefile.html" />
<result url="http://otherhost.othernetwork.net/otherpath/otherfile.php" />
</results>
Your windows application then uses MSXML's DOM parser to do pretty things with the URLs.
That's how I'd do it.
CGI scripts tend to be written in Perl. Stefan's WWW::Search is too good a fit for you not to use and it should be easy to adapt it into a CGI. You could perhaps have it return XML which you could parse on your Windows application using MSXML. [If you've not already done this sort of thing, you'll be pleasantly surprised by MSXML.]
So your Windows application issues a request via MSXML to your CGI script as follows:
http://yourhost.yournetwork.net/cgi-bin/yourcgiscript.pl?search=XXXX
The CGI script works on your (say) Linux server with WWW::Search, wget and your C++ parser executable to return:
<?xml version="1.0" ?>
<results>
<result url="http://somehost.somenetwork.net/somepath/somefile.html" />
<result url="http://otherhost.othernetwork.net/otherpath/otherfile.php" />
</results>
Your windows application then uses MSXML's DOM parser to do pretty things with the URLs.
That's how I'd do it.
> hoping to develop something more .... profitable.
I reckon I earn my living at the trailing edge of technology. It is interesting and profitable... and much better documented than the leading edge.
I reckon I earn my living at the trailing edge of technology. It is interesting and profitable... and much better documented than the leading edge.
ASKER
Thanks rstaveley.
I am currently shopping arround for a good beginning Perl book. I am glad to hear that I don't have to re-write my whole application. I have read a little on MSXML. And have somewhat of ideals.
Apreciated. Definitley have to save these post on my pc.
RJ
I am currently shopping arround for a good beginning Perl book. I am glad to hear that I don't have to re-write my whole application. I have read a little on MSXML. And have somewhat of ideals.
Apreciated. Definitley have to save these post on my pc.
RJ
ASKER
Rstavley. Now you got me curious. What do you do fix up legacy code for some shop? What kind of product is it? (What market?).
I use to work in prison inmate accounting software. It was good. But long story short, they sold out.
RJ
I use to work in prison inmate accounting software. It was good. But long story short, they sold out.
RJ
I write applications for broadcast television. It is a mixed bag of technologies, but none of them could claim to be leading edge - unless you were a salesman ;-)