Solved

Querry a Search engine with Wininit.dll or other?

Posted on 2003-12-08
10
393 Views
Last Modified: 2008-03-10
Hello all;

I am trying to add some functionality to my app. I would like to search the internet for various files found on the internet. Like an app that collects jpeg etc...

In my VC++6.0 I find the TEAR example app. This is almost what I want. Below is some of thier description of their TEAR app which uses wininet.dll.

The TEAR sample shows how to write an MFC console application that uses WININET.DLL to communicate with the Internet. The sample shows how to form an HTTP request using CHttpFile against CHttpConnection and CInternetSession objects.

I know I could create from the sample some recursive function that keeps gathering links and checking those pages and gathering links ect....

But what I am wondering is there a way that one could querry a search engine and get all of the listings (urls) to then use a TEAR like functionality to find image etc....

RJ
(*always grant A for working answer or link)
0
Comment
Question by:RJSoft
  • 5
  • 5
10 Comments
 
LVL 11

Expert Comment

by:KurtVon
ID: 9898753
How about the API for google?  http://www.google.com/apis/api_faq.html

Hope this helps.
0
 
LVL 3

Author Comment

by:RJSoft
ID: 9917709
KurtVon.

I checked it out. Problem is they dont have anything for VC++ 6.0. There is also some problem with licensing as a commercial product. But I could make a work arround where users of my app would need to register with Google to get thier limitted 1,000 searches per a day. Also I went to the news groups and wrote all my questions out.

The google api works with .Net and c# neither of which I know. So I also wrote asking if anyone knew of a patch that I could download. But I doubt that. Also there is some problem with search results bringing only 10 results at a time. I dont know if I could do a work arround for that also. How do you start a search over on the same subject and get the next 10? (untill 1,000 is reached)

I have seen other applications (SimpleFind for one) where a regular windows based app would querry a bunch of different search engines and fill a list box with semi description and a url.

I am wondering if I could not simply use wininit.dll to querry a search engine (prefferably google) by knowing the different variables that get produced in the browser's address bar and similating that when opening a page. Similating a search by calculating the search result's address.

Any clues, apreciated.

RJ

0
 
LVL 3

Author Comment

by:RJSoft
ID: 9917763
Heres another consideration.

I was thinking about adjusting the address.

So I did a search in google about fun times. I got the first page and then I hit the next and get these results.

http://www.google.com/search?q=fun+times&hl=en&lr=&ie=UTF-8&safe=off&start=100&sa=N

The reason I chose to manipulate the address from the next is that I notice that if I put &start=1 the search goes to the first page results.

If I put &start=100 it goes to the 11th page (which is proper).

Also I notice that if I replace the text fun+times with something like mp3+software it works fine.

Could things be this easy? I know I can use wininit.dll to obtain the web page, then I can extract the links. Later I can use the links and extract the files. Or give the user the url list for them to do.

I am wondering how I could know the amount of results. My guess is just to extract that from the first html page "results are xxxxx".

Any comments criticism etc.. apreciated.

RJ





0
 
LVL 11

Accepted Solution

by:
KurtVon earned 100 total points
ID: 9920708
Well, webcollage http://gd.tuwien.ac.at/linuxcommand.org/man_pages/webcollage1.html does something similar to this, but in perl.  Basically you want to write what is called a "scraper" for the google search engine.  Another perl program for that: http://search.cpan.org/dist/Scraper/lib/WWW/Scraper/Google.pm (note they have scrapers for a few other search engines too, there).

Hope this helps.




0
 
LVL 3

Author Comment

by:RJSoft
ID: 9923370
KurtVon;

Thanks.

I would like the webcollage thing. Think it would be kinda fun. I wonder if I could get the plain ol executable for that?

Just knowing the term "search scraper" sets me in the right direction. Also I see Google has a few issues with scrapers Don't know what to think about that. I guess I will have to be in the know about measures taken to circumvent scrapers. The author of your post second link, who created the Scraper class for perl seems to suggest that Google thinks of it as a non issue. But don't I have the right to open any web page that I see fit to. It is public property after all. Or is it? I would like to know if I am going to run into legal snags before I start coding.

RJ
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 11

Expert Comment

by:KurtVon
ID: 9923551
Well, google does own the copyright for their page, and downloading teh data and modifying it is creating a derivative work, which would then be copyright infringement.  The whole thing sounds like a legal gray-area though, and I am no lawyer by any long shot.  In the end, as usual, the one with the biggest lawyers probably wins, so unless you work for IBM it might be best to stay on google's good side.

That said, I bet they get a few thousand hits a second, so unless your software is insanely popular it is unlikely they would even notice the hit.

0
 
LVL 3

Author Comment

by:RJSoft
ID: 9932650
Dang!

Ran into another snag. It seems that querying google with wininit instead of a browser causes google to send a forbidden page.

I can manipulate the google variables to query with editing the navigation bar but when I try to do the same with the TEAR (VC++6.0) sample app it gives me the forbidden page. Explaining Google policy.

Now I could consider switching search engines. Anyone have any ideals on that?

I am still waiting on a response from a Google group for info on how to implement the Google API in VC++ windows based app.

Running out of options I guess.

RJ

0
 
LVL 11

Expert Comment

by:KurtVon
ID: 9941934
Interesting, sicne the ChttpConnection is how IE connects to the internet too.  I wonder how Google can tell?  It may be the parameters passed to OpenRequest.  Check the the referrer is "http://www.google.com/" since google would know that must be where the search was entered.  I can't think of anything else offhand that they could use to detect it, unless MS was cooperating.

Hope this helps.
0
 
LVL 3

Author Comment

by:RJSoft
ID: 9954068
Well incase your curious I found at least one major search engine that does not give back the forbidden page.

But now after thinking things through a little farther, I am not so sure how I should gather the links.

My application is basically a multi media viewer with some extra bells and whistles. One of the hot topic types of files that the users may be interested in is the mp3. Free mp3 files to download would even be nicer. Free and legal.

Basically the end result should be a list of urls to user requested files.

Problem is how to search for that. The page results in the search engine gives back mostly commercial listings. So I most likely would not even get much results as those files will be restricted. Maybe they could score on some demos.

As for most of the other media, I was hoping to use Google's Image list. That might have knocked out a majority of user request. But that is wishfull thinking.

In reguard to obtaining links my logic (currently untested) will be something like this...
(any comments apreciated here)

Get user desired file and a subject keyword (input)

Example: "nude jpeg"

Begin loop
{

Do querry on search engine using Tear like function that also addresses StartingPage

Extract html page for links with...
.com
.net
.org
etc......

Basically looking for links. What are most links composed of these days?
What do I do about scripts?

Gather any target files
if(.jpg found ) then add to a FoundLink list. (check for redundancy)

load the .com .net etc... into First page link list (check for redundancy)

Use .com .net etc.. link list for another querry (recursive)
Should I set a limit to how deep recursivley I should go?

Advance starting page for next querry
StartingPage+=10; // Different search engines might be better with diff value

if(StartingPage==MAX)break; //unsure about max. Might trim down for reasonable amount

}end loop

Work done. Discard the .com .net etc links

Write the found links to file for later use.

Clean up link memory


To extract the link first I look for the .com .net etc... and the target files .jpg etc... and I do this by reading the html file one char at a time. Constantly loading a large storage string so that when I find the desired extention, I create another result string by backing up in the storage string one at a time untill I find www and/or http: Cap the end of result with a '\0' then reverse the result string.

I wonder if someone has already a class for this type of string manipulation?

RJ

0
 
LVL 11

Expert Comment

by:KurtVon
ID: 9956919
Hmm, wouldn't it be easier to look for the pattern href="..." and take what is between the quotes?  It is possible this could appear in the text or description of  a link, but I suppose you could also make sure it is inside an <a> block if you wory about that.

To find them you could use the CString or string class and search for the substring "href=\"".  The returned index is 6 less than the start of the link, and you can search for the next " to find the end of the link.  Both classes have a search function that allows you to specify a starting point.

As far as finding legal stuff, even some of the people posting it wouldn't know what is legal and what isn't some of the time.  I don't think you'd have much control over that unless you limited the search to a few sites you know carry legal downloads (like the now-defunct mp3.com).

Hope this helps.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

When writing generic code, using template meta-programming techniques, it is sometimes useful to know if a type is convertible to another type. A good example of when this might be is if you are writing diagnostic instrumentation for code to generat…
Often, when implementing a feature, you won't know how certain events should be handled at the point where they occur and you'd rather defer to the user of your function or class. For example, a XML parser will extract a tag from the source code, wh…
The viewer will learn how to pass data into a function in C++. This is one step further in using functions. Instead of only printing text onto the console, the function will be able to perform calculations with argumentents given by the user.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now