WGet and thumbnails

I have incorperated WGet into my application by simply calling it with ShellExecute.

It basically gets files from a list of URLs.

But many of those URL's have thumb nail images. I would like to know how to avoid having WGet download them.

Another point about this is I have used WGet but have the option to use Perl type arrangement to FTP files from the URL's.

Although I do not know if Perl provides something simular to WGet in that WGet recursivley scrapes to a level that you can set. So it parses the html files and then reparses continuing untill the level has been met.

On one hand I would not like to have to write a WGet clone. On the other I imagine Perl to have a module like this. If WGet has the capability to do this then that would be the simplist. But I have not found this in the WGet documentation. Although its a bit unorganized so I may have missed it.

Any comments apreciated.
Who is Participating?
This is just a small update.
I didn't find anything about size options in wget, but I did read this in the TODO file for wget 1.8.2:

* Allow size limit to files (perhaps with an option to download oversize files
  up through the limit or not at all, to get more functionality than [u]limit.

So a size option is probably not yet available, and to me this comment reads as the opposite of what you need, as it seems the intention is to put an _upper_ limit on sizes, and you need a _lower_ limit.

Let me ask you again: why are those thumbnails a problem?
You can use LWP::Simple's getstore function combined with HTML::LinkExtor to do this.

RJSoftAuthor Commented:
I can see where the LinkExtor applies to get the pages and parse them.
But the simple I do not understand.

I have already a list of URL's that are provided for me from a search engine.
Does the simple control the actual downloading. I could not find example code.

Any chance you could write pseudo code or real example of how I could do this?

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Hi RJSoft,

WGet does have filtering options that you can use for recursive downloads. The options you're looking for are -R (or --reject) or -G (--ignore-tags). The both accept a comma-separated list of extensions or tags.
Specifically, I think
    wget -r -l=<your_level> -G IMG <your_url>
will go a long way towards your requirements.

Here's the complete list of options "wget --help" gives me that pertain to recursive retrieval:

Recursive retrieval:
  -r,  --recursive          recursive web-suck -- use with care!
  -l,  --level=NUMBER       maximum recursion depth (inf or 0 for infinite).
       --delete-after       delete files locally after downloading them.
  -k,  --convert-links      convert non-relative links to relative.
  -K,  --backup-converted   before converting file X, back up as X.orig.
  -m,  --mirror             shortcut option equivalent to -r -N -l inf -nr.
  -p,  --page-requisites    get all images, etc. needed to display HTML page.

Recursive accept/reject:
  -A,  --accept=LIST                comma-separated list of accepted extensions.
  -R,  --reject=LIST                comma-separated list of rejected extensions.
  -D,  --domains=LIST               comma-separated list of accepted domains.
       --exclude-domains=LIST       comma-separated list of rejected domains.
       --follow-ftp                 follow FTP links from HTML documents.
       --follow-tags=LIST           comma-separated list of followed HTML tags.
  -G,  --ignore-tags=LIST           comma-separated list of ignored HTML tags.
  -H,  --span-hosts                 go to foreign hosts when recursive.
  -L,  --relative                   follow relative links only.
  -I,  --include-directories=LIST   list of allowed directories.
  -X,  --exclude-directories=LIST   list of excluded directories.
  -np, --no-parent                  don't ascend to the parent directory.

RJSoftAuthor Commented:
Hello Kandura;

Apreciate the list.

But I notice there is no refference to file size. You see I am trying to avoid downloading of thumbnails.

If the -G the ignore list of html tags could detect a thumbnail image then that would work.

Is there a specific tag an image has for thumbnail? If so how would I provide this as a parametor to WGet -G "THUMB_NAIL" ??

What's the difference between a thumbnail and a regular image? There isn't any from an html perspective, they're both referenced with an IMG tag.

Can you specify what would differentiate thumbnails from regular images?

It would be nice if your thumbnails were stored in a different directory, so you could use the -X option, or if they had different extensions, so you could use the -R option.

If you're not sure, could you show some html snippets of images that you don't want, and some of images you do want?
RJSoftAuthor Commented:
Well I was hoping there was some size detecting attribute. Any file under so many k could be rejected.

I get a list of urls from perl script/api that I use to querry the search engine. So there is no way to know ahead of time what files will be downloaded. The WGet simply is used by my application to scrape those urls for files.

When I provide a url to WGet requesting a file type say JPG. It grabs allot of thumbnail images which are not desirable.

My application integrates Perl functionality with a Windows C++ based app. So basically I could determin file size after downloaded easy. I was hoping that I could detect the file size when using the WGet and have it reject file sizes under an estimated size. A size range most likely to be thumbnails.

Apreciate your comments though.
RJSoftAuthor Commented:
RJ2 do you have any example code. Or could explain with pseudo code?

I'd like you to explain one thing before I'll let you go ahead with RJ2 to think up a nice perl solution. I do think that perl ultimately allows for a more flexible and precise solution, but it will take a lot of work before you reach the reliability and features of wget, so I'd like to see if we can somehow make wget meet your requirements. In order to do that, you need to be more precise on your exact intentions and requirements:

- what is it exactly that you are trying to download, and for what purpose?
- what is it that makes you want to skip those so-called "thumbnails"?
- you want wget to do recursive downloads, which implies you're mainly downloading html, is that correct?
- on the other hand, you mention you want to retrieve a jpg file, and that sort of goes against the recursiveness: only html files will have references to other documents
- so you seem to want to download certain images, but not other images. What do you consider the difference between the two groups?
- rejecting files based on size would carry the risk of missing out on smaller html files, or on style sheets, javascript files, etc.
- if you're going to download recursively anyway, how big of a burden are those relatively speaking?

I hope you agree that making wget do more or less what you want will be the quickest and most reliable solution. Please clarify the above points, and I'll see what I can do to help you reach that point. On the other hand, if you decide you're going to go with a perl solution from scratch, I'll leave you in the capable hands of RJ2. Not that I wouldn't want to assist in creating such a script, but I'm going to leave The Old World for Los Angeles next week, for a week, so I'm a little strapped for time ;^)

One hint on the perl path: searching CPAN for "mirror" will give you links to lwp-mirror and w3mir (among others). Especially w3mir seems a nice starting point.
Hmmm... On re-reading that, the "one thing" from the first sentence seems to be a bit of an understatement. Sorry about that ;^)
RJSoftAuthor Commented:
My application currently uses WGet to download specific files that I request by modifying parametors to it. In order to obtain the file type I want I use the -A followed by the 3 letter extentions that I wish to download. In the above case that would be jpg.

You could say my application is just a GUI interface to the WGet.

WGet downloads temporary the html files to parse them looking for the -A (accept list) it rejects the html files after parsing them to obtain other links. My application also controls the recursion level with the -r and the retry level.

I was hoping that maybe I missed a flag somewhere that would have allowed me to specify a minimum size. Only hoping that is.

If you run across the solution let me know.
Your opinion is much apreciated.

RJSoftAuthor Commented:
Thumbnails are to small to enjoy?

About the note from the WGet author, perhaps I could request that he make the U limit flexable enough to stop undersized files (thumbnails).

Thanks for your help.
Ok, so you're just bothered by them, is that it? I mean, it's not a matter of them taking too much time to download?
If that's the case, then why not simply delete them? If you're on linux or similar (and I hope so ;^), you could do a "find . -size -5k" to find them (the -5k means "files less than 5 KB"). To delete them as well, you'd do "find . -size -5k -exec rm {} \;"
RJSoftAuthor Commented:
Yes that's it.

Sorry about not being able to grant you the points as there seems to be no viable answer using WGet. I wish I could do this with WGet.

But if you could answer this question then perhaps I am still in buisness.

Is there a way to obtain a WGet library-dll?

I am looking for a way to display in my dialogs the progress and messages that are normally displayed using WGet accept for in my application. That way I could also monitor size.

My app is Windows based. So I could just use a _findfirst with the complete filename and path. the ptffblk structue will tell me the size.

Also is 5k the average size of a thumbnail?

Thanks in advance.
RJSoftAuthor Commented:
Yes. And that could possibly be the Perl links from rj2.
Although I get no followup replies.

But I would like to reward you with points anyway, because I know you put a sincere effort on my behalf.

Is it possible for you to send me a link where you found the follow up?

>>This is just a small update.
>>I didn't find anything about size options in wget, but I did read this in the TODO file for wget >>1.8.2:
>>* Allow size limit to files (perhaps with an option to download oversize files
>>  up through the limit or not at all, to get more functionality than [u]limit.

That way I can at least split points for you to have.

This was the url google found for me: http://www.hupo.org.cn/docs/linuxdoc/wget-1.8.2/TODO
In case you missed my comment on this earlier, I'll repeat it here.

>> One hint on the perl path: searching CPAN for "mirror" will give you links to lwp-mirror and w3mir (among others). Especially w3mir seems a nice starting point.

If you do decide to write your own wget implementation, these scripts may give you a head start.
RJSoftAuthor Commented:
RJ2. I am still hoping to hear from you. Or anyone else familiar with this issue.

About the LinkExtor example code. I am wondering if you could please respond to the comments below..

Example code from LinkExtor:

  use LWP::UserAgent;
  use HTML::LinkExtor;
  use URI::URL;

  $url = "http://www.perl.org/";  # for instance
  $ua = LWP::UserAgent->new;

  # Set up a callback that collect image links
  my @imgs = ();
  sub callback {
     my($tag, %attr) = @_;
     return if $tag ne 'img';  # we only look closer at <img ...>
     push(@imgs, values %attr);

  # Make the parser.  Unfortunately, we don't know the base yet
  # (it might be diffent from $url)
  $p = HTML::LinkExtor->new(\&callback);

  # Request document and parse it as it arrives
  $res = $ua->request(HTTP::Request->new(GET => $url),
                      sub {$p->parse($_[0])});


I see the call to GET here. Could this be enough to replace the need for the Simple::GetStore.

But since I want to duplicate WGet functionality. I am wondering about doing a recursive search here for other links to web pages. Or should I be using Simple::GetStore to accomplish (gathering urls by extracting links to create other urls) and then passing all the urls that are extracted one at a time to a function like this LinkExtor.

I have to be able to extract links at the user requested level deep. Default is 5. So my guess is that I should use GetStore && LinkExtor to first build up the list of web pages to scrape. Then use LinkExtor to extract the files.

Also I notice that this code example uses the img tag. My application intends to download files based on any ext type. So I am wondering if you could give me a few pointers on how to querry for file type perhaps based on the attrib. I also intend to see if the attrib can contain info about the file size.

Should I just simply use the SRC?
return if $tag ne 'src';  # use source tag. But does this really give file location??
push(@imgs, values %attr);

To top all that off I need to do this with embeded Perl statements. I have seen some of the library modules at one point that would allow me to compile Perl into my application. But as it stands now for using Perl integrated into my app all I have accomplished is to take a script/app in Perl and then transformed it to an exe using Perl2Exe. Then I would call that exe with ShellExecute from my Windows based app.

Although I also read an article talking about using pipes between applications. But I doubt that this type of info could be used to set members of a dialog. But I do not know.

I believe what I want is to embed Perl statements so I can create visual feedback in my application dialog. Otherwise I basically get the same quality as WGet. An executable that runs in background as either hidden or annoyingly in front of my app. Currently part of my app serves as a GUI interface to pass parametors and execute WGet.

What I want is to have feedback on progression. I intend to use SetWindowText to CStatic object owned by the dialog from the Windows perspective. If I have to maybe even implement a timer on the Windows side to run Perl statement to do some kind of status statement. Then display with the SetWindowText.

But I am open to suggestion.

Also I am a bit concearned over the libraries I remember seeing about embedding Perl statements. It did not strike me to be Windows friendly. Or what I am trying to say is that I do not know if the libraries where just for C style applications and not Windows messsage based systems or if it even matters.

All comments are apreciated.



  # Expand all image URLs to absolute ones
  my $base = $res->base;
  @imgs = map { $_ = url($_, $base)->abs; } @imgs;

  # Print them out
  print join("\n", @imgs), "\n";

Thanks in advance
Using libcurl seems like the way to go
RJSoftAuthor Commented:

Do you know of example code I could see to help guide me?
I see the example from the libcurl, and it's hard for me to interpret.

How would I go about using this if I had a text file with urls
one url for each line in the text file?

Take a look at WWW::Curl::Lite . Can't be any simpler. There's even a one-liner example.
RJSoftAuthor Commented:

Thanks for the link.

Basically I am new to Perl. I have only used Perl in a few of my applications so far. But have a book (Beginning Perl by Simon Cozens) which explains things in good detail. But I still have problems.

For starters I am unsure how to install the module/package and which module/package to use. I believe what I would like to use is just the WWW::Curl::Lite module/package.

But is WWW::Curl::Lite part of WWW::Curl?

This would mean that I should download the whole WWW::Curl module/package as it contains the Lite part. Also maybe Lite depens upon components of the parent module/package WWW::Curl. I dont know...

Next should I download the gz file that you have given me link to and use my WinZip to unzip the gz file. But I dont know how to tell PPM to use the files unzipped. Should the files be unzipped in a specific folder in order to have PPM install them into the Perl system?

I know this might be clear to you on what to do but why I ask is because Perl has so many options. I get confused over what I am supposed to do.

Should I download the gz, unzip and then use PPM (Perl Package Manager) to install the module so it is available as a library to my perl code. I have ActivePerl (Windows 98)?

Or should I try to use the cpan? I see that the cpan can do this for me and looks easy in the book but it ask me way too many questions that I do not understand when it first initializes.

Also it gave me a bunch of URL's to choose from. But evidently I chose the wrong URL, because when I type from the cpan prompt  

cpan> install WWW::Curl

it fails to find the module.

From the d:\Perl\Cpan Config.pm I have the line...

'urllist' => [q[ftp://ftp.ou.edu/mirrors/CPAN/]],

Here is the whole Config.pm file.
maybe I could edit a setting to get the cpan to work. So it would autmatically download and install whatever module I wish to do.

But I really dont care if I get the cpan installer to work or not. All I really want is just to get the module installed.

Thanks in advance

# This is CPAN.pm's systemwide configuration file. This file provides
# defaults for users, and the values can be changed in a per-user
# configuration file. The user-config file is being looked for as
# ~/.cpan/CPAN/MyConfig.pm.

$CPAN::Config = {
  'build_cache' => q[10],
  'build_dir' => q[\.cpan\build],
  'cache_metadata' => q[1],
  'cpan_home' => q[\.cpan],
  'dontload_hash' => {  },
  'ftp' => q[C:\WINDOWS\ftp.exe],
  'ftp_proxy' => q[],
  'getcwd' => q[cwd],
  'gpg' => q[],
  'gzip' => q[],
  'histfile' => q[\.cpan\histfile],
  'histsize' => q[100],
  'http_proxy' => q[],
  'inactivity_timeout' => q[0],
  'index_expire' => q[1],
  'inhibit_startup_message' => q[0],
  'keep_source_where' => q[\.cpan\sources],
  'lynx' => q[],
  'make' => q[],
  'make_arg' => q[],
  'make_install_arg' => q[],
  'makepl_arg' => q[],
  'ncftp' => q[],
  'ncftpget' => q[],
  'no_proxy' => q[],
  'pager' => q[C:\WINDOWS\COMMAND\more.com],
  'prerequisites_policy' => q[ask],
  'scan_cache' => q[atstart],
  'shell' => q[],
  'tar' => q[],
  'term_is_latin' => q[1],
  'unzip' => q[],
  'urllist' => [q[ftp://ftp.ou.edu/mirrors/CPAN/]],
  'wget' => q[],

You shouldn't use CPAN for this.
"Note that you need my modified version of WWWCurl (>= 3.0) to use WWWCurlLite!"
This isn't the same as the CPAN WWW::Curl

Here's even a simpler non-Perl solution:
Use HTTrack website downloader. It has a very easy to use interface.
If you need it to be completely automated you can execute HTTrack from the command line.
And you can give it a list of URLs directly from the command line
And you can specify files you want to exclude.
RJSoftAuthor Commented:
Ok I will try it. But I still wonder how to do just a regular install by downloading the gz file. Unzip with WinZip and then take those files and install?

For this to work you'll need also a proper build environment - a make utility, a C compiler, all the required headers, libcurl, etc.
This is out of the scope of this question, and probably not worth the effort. Also some build tools might not work correctly on Windows 98 (isn't it about time you upgraded?)

To get your build environment, you'll need to follow all the instructions in:

And then see the "Building Extensions" section in the above link. In short:
* perl Makefile.PL
* make
* make test
* make install

RJSoftAuthor Commented:

Thanks for your replies. I apreciate your patients with me as I am a newbie to Perl.

Yes I have XP upgrade. But I am looking to get another drive first before I install it. I see by the last link you posted that I will get errors due to inferior shell. So I may do this right away.

But let me make sure we understand each other. Please remember I am new to Perl.

I take it that your version of Curl::Lite is strictly for the unix environment. That the libcurl link that you first sent me has modules that may not work with ActivePerl. So I need to build an ActivePerl version of your Curl::Lite. Is this true?

I thought that I could simply download the link with the gz file. Unzipp with WinZip and install it somewhere, where I could use PPM (Perl Package Manager) to then install the modules into my Perl. So that I could then write a Perl app that would use the WWW::Curl::Lite library/modules.

I dont know if we are miscommunicating because of the issue of my wanting to use your modules in my Win32 based application.

All I want for now is to build a Perl application that would scrape the URLs I give it for the file types I wish to have using your WWW::Curl::Lite modules (or is it package).

I may have the input to this Perl application as a file with a list of URLs for it to parse. I would like to add recursive parsing so that when it finds links I could tell it to scrape those also. But I realize I first need to get my feet wet with familiarity with your modules.

Currently my Win32 based application is using WGet to do this. But I would like to replace it with my own Perl version so that I could add file size detection.

Later on down the road I plan to take that code and insert/integrate it into my Win32 app. If I can not do that. I know that I can at least pipe the standard output and standard error message to a window belonging to my application.

I want to control where and how the visual feedback is displayed. So having the code inserted or results piped into my application will help do this. But there is currently no control over file size selection using WGet.

I am taking the time to write this to you to make sure we both talking the same thing here?

I guess the basic question is do I need to build an ActivePerl version of your Perl code or can I simply install it into my Perl directory. So that I can create a Perl application that uses it?


Let's summarize:

Option 1: Use some command line downloader
HTTrack - http://www.httrack.com/
curl - http://curl.haxx.se/
WGet - http://www.gnu.org/software/wget/wget.html
Each one of these programs are quite powerful and have extensive command lines. You can just input all the URLs from a file directly. You can use exclusions. Even if you download a thumbnail image, you can always delete it later. BTW: you can't depend on the file size to determine the image dimensions because compression rates vary.

Option 2: Use some other Perl module on http://search.cpan.org/
Maybe HTTP::GetImages can do the trick. Can be easily installed using "ppm install HTTP::GetImages"

Option 3: Use WWW::Curl::easy.
You'll need to build it. Building it will won't be straightforward and might require an OS upgrade.
RJSoftAuthor Commented:
My application scrapes any type of file that the user reqest. So even though HTTP::GetImages is tempting, what I have now is my application serves as interface to WGet. So I am capable of downloading more than 1 media type. Jpg, Mpeg, Mp3, Wav, html etc....

So is Curl::Lite limitted to what type of files it will download?
I have trouble finding documentation.

RJSoftAuthor Commented:
Thanks much for your replies Shaic and kandura
RJSoftAuthor Commented:
About the point spread. I tried to spread it as evenly as possible.
rj2's answer of the combining of LWP::Simple's getstore and HTML::LinkExtor is probably the definitive answer.
Not sure yet.


Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.