asked on

Querry a Search engine with Wininit.dll or other?

Hello all;

I am trying to add some functionality to my app. I would like to search the internet for various files found on the internet. Like an app that collects jpeg etc...

In my VC++6.0 I find the TEAR example app. This is almost what I want. Below is some of thier description of their TEAR app which uses wininet.dll.

The TEAR sample shows how to write an MFC console application that uses WININET.DLL to communicate with the Internet. The sample shows how to form an HTTP request using CHttpFile against CHttpConnection and CInternetSession objects.

I know I could create from the sample some recursive function that keeps gathering links and checking those pages and gathering links ect....

But what I am wondering is there a way that one could querry a search engine and get all of the listings (urls) to then use a TEAR like functionality to find image etc....

RJ
(*always grant A for working answer or link)

KurtVon

How about the API for google? http://www.google.com/apis/api_faq.html

Hope this helps.

RJSoft

ASKER

KurtVon.

I checked it out. Problem is they dont have anything for VC++ 6.0. There is also some problem with licensing as a commercial product. But I could make a work arround where users of my app would need to register with Google to get thier limitted 1,000 searches per a day. Also I went to the news groups and wrote all my questions out.

The google api works with .Net and c# neither of which I know. So I also wrote asking if anyone knew of a patch that I could download. But I doubt that. Also there is some problem with search results bringing only 10 results at a time. I dont know if I could do a work arround for that also. How do you start a search over on the same subject and get the next 10? (untill 1,000 is reached)

I have seen other applications (SimpleFind for one) where a regular windows based app would querry a bunch of different search engines and fill a list box with semi description and a url.

I am wondering if I could not simply use wininit.dll to querry a search engine (prefferably google) by knowing the different variables that get produced in the browser's address bar and similating that when opening a page. Similating a search by calculating the search result's address.

Any clues, apreciated.

RJ

RJSoft

ASKER

Heres another consideration.

I was thinking about adjusting the address.

So I did a search in google about fun times. I got the first page and then I hit the next and get these results.

http://www.google.com/search?q=fun+times&hl=en&lr=&ie=UTF-8&safe=off&start=100&sa=N

The reason I chose to manipulate the address from the next is that I notice that if I put &start=1 the search goes to the first page results.

If I put &start=100 it goes to the 11th page (which is proper).

Also I notice that if I replace the text fun+times with something like mp3+software it works fine.

Could things be this easy? I know I can use wininit.dll to obtain the web page, then I can extract the links. Later I can use the links and extract the files. Or give the user the url list for them to do.

I am wondering how I could know the amount of results. My guess is just to extract that from the first html page "results are xxxxx".

Any comments criticism etc.. apreciated.

RJ

ASKER CERTIFIED SOLUTION

KurtVon

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

RJSoft

ASKER

KurtVon;

Thanks.

I would like the webcollage thing. Think it would be kinda fun. I wonder if I could get the plain ol executable for that?

Just knowing the term "search scraper" sets me in the right direction. Also I see Google has a few issues with scrapers Don't know what to think about that. I guess I will have to be in the know about measures taken to circumvent scrapers. The author of your post second link, who created the Scraper class for perl seems to suggest that Google thinks of it as a non issue. But don't I have the right to open any web page that I see fit to. It is public property after all. Or is it? I would like to know if I am going to run into legal snags before I start coding.

RJ

KurtVon

Well, google does own the copyright for their page, and downloading teh data and modifying it is creating a derivative work, which would then be copyright infringement. The whole thing sounds like a legal gray-area though, and I am no lawyer by any long shot. In the end, as usual, the one with the biggest lawyers probably wins, so unless you work for IBM it might be best to stay on google's good side.

That said, I bet they get a few thousand hits a second, so unless your software is insanely popular it is unlikely they would even notice the hit.

RJSoft

ASKER

Dang!

Ran into another snag. It seems that querying google with wininit instead of a browser causes google to send a forbidden page.

I can manipulate the google variables to query with editing the navigation bar but when I try to do the same with the TEAR (VC++6.0) sample app it gives me the forbidden page. Explaining Google policy.

Now I could consider switching search engines. Anyone have any ideals on that?

I am still waiting on a response from a Google group for info on how to implement the Google API in VC++ windows based app.

Running out of options I guess.

RJ

KurtVon

Interesting, sicne the ChttpConnection is how IE connects to the internet too. I wonder how Google can tell? It may be the parameters passed to OpenRequest. Check the the referrer is "http://www.google.com/" since google would know that must be where the search was entered. I can't think of anything else offhand that they could use to detect it, unless MS was cooperating.

Hope this helps.

RJSoft

ASKER

Well incase your curious I found at least one major search engine that does not give back the forbidden page.

But now after thinking things through a little farther, I am not so sure how I should gather the links.

My application is basically a multi media viewer with some extra bells and whistles. One of the hot topic types of files that the users may be interested in is the mp3. Free mp3 files to download would even be nicer. Free and legal.

Basically the end result should be a list of urls to user requested files.

Problem is how to search for that. The page results in the search engine gives back mostly commercial listings. So I most likely would not even get much results as those files will be restricted. Maybe they could score on some demos.

As for most of the other media, I was hoping to use Google's Image list. That might have knocked out a majority of user request. But that is wishfull thinking.

In reguard to obtaining links my logic (currently untested) will be something like this...
(any comments apreciated here)

Get user desired file and a subject keyword (input)

Example: "nude jpeg"

Begin loop
{

Do querry on search engine using Tear like function that also addresses StartingPage

Extract html page for links with...
.com
.net
.org
etc......

Basically looking for links. What are most links composed of these days?
What do I do about scripts?

Gather any target files
if(.jpg found ) then add to a FoundLink list. (check for redundancy)

load the .com .net etc... into First page link list (check for redundancy)

Use .com .net etc.. link list for another querry (recursive)
Should I set a limit to how deep recursivley I should go?

Advance starting page for next querry
StartingPage+=10; // Different search engines might be better with diff value

if(StartingPage==MAX)break; //unsure about max. Might trim down for reasonable amount

}end loop

Work done. Discard the .com .net etc links

Write the found links to file for later use.

Clean up link memory

To extract the link first I look for the .com .net etc... and the target files .jpg etc... and I do this by reading the html file one char at a time. Constantly loading a large storage string so that when I find the desired extention, I create another result string by backing up in the storage string one at a time untill I find www and/or http: Cap the end of result with a '\0' then reverse the result string.

I wonder if someone has already a class for this type of string manipulation?

RJ

KurtVon

Hmm, wouldn't it be easier to look for the pattern href="..." and take what is between the quotes? It is possible this could appear in the text or description of a link, but I suppose you could also make sure it is inside an <a> block if you wory about that.

To find them you could use the CString or string class and search for the substring "href=\"". The returned index is 6 less than the start of the link, and you can search for the next " to find the end of the link. Both classes have a search function that allows you to specify a starting point.

As far as finding legal stuff, even some of the people posting it wouldn't know what is legal and what isn't some of the time. I don't think you'd have much control over that unless you limited the search to a few sites you know carry legal downloads (like the now-defunct mp3.com).

Hope this helps.