Solved

Wget : downloading urls matching a regular expression

Posted on 2008-10-10
12
8,151 Views
Last Modified: 2013-12-20
Hi,
I want to download urls recursively,
starting from : http://code.google.com/apis/maps/,
but I want to download only those URLs which
match the this pattern :
http://code.google.com/apis/maps/*

I tried wget -r -D http://code.google.com/apis/maps/ http://code.google.com/apis/maps/
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.
0
Comment
Question by:dtivmk
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
12 Comments
 
LVL 10

Expert Comment

by:kukno
ID: 22690699
Hi,

there is an option "-I" or "--include-directories".

From the man page: http://linux.die.net/man/1/wget

-I list
--include-directories=list
    Specify a comma-separated list of directories you wish to follow when downloading Elements of list may contain wildcards.

Sample: wget --include-directories *test*,*test2* -r http://www....

Regards
Kurt
0
 
LVL 10

Expert Comment

by:TOPIO
ID: 22690716
If wget is not user friendly you can try httrack
http://www.httrack.com/
that does the same but with a more user friendly interface
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693593
Hi Topio,
I want to download all urls matching this pattern : http://code.google.com/apis/maps/documentation/flash/
I used the following options :

URL -> http://code.google.com/apis/maps/documentation/
Set Options -> Scan Rules -> Include Links -> Criterion -> Folder names containing : String: flash
Limits -> Max mirroring depth : 5
Limits ->  Max external depth : 3

I got the following error :


---------------------------
WinHTTrack Website Copier
---------------------------
* * MIRROR ERROR! * *

HTTrack has detected that the current mirror is empty. If it was an update, the previous mirror has been restored.

Reason: the first page(s) either could not be found, or a connection problem occured.

=> Ensure that the website still exists, and/or check your proxy settings! <=
---------------------------
OK  
---------------------------
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 1

Author Comment

by:dtivmk
ID: 22693597
Hi kukno,
I tried this :
wget --include-directories=flash -r http://code.google.com/apis/maps/documention/
in order to download all the urls matching http://code.google.com/apis/maps/documention/flash/.
but, only `code.google.com/apis/maps/documentation/index.html' was downloaded.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22693768
can you please post a real world sample? The link on google does not contain anything...
"The requested URL /apis/maps/documention/ was not found on this server. "

0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693780
0
 
LVL 10

Expert Comment

by:kukno
ID: 22694555
hm.. if you use a wildcard in the option, it will download a lot more:

wget --include-directories=*flash* -r http://code.google.com/apis/maps/documention/

However, then it's no longer limited to the path /apis/maps/documention/. I think wget is not able to do what you need. If you are not limited to Windows as platform, you could try pavuk.

   http://www.pavuk.org/man.html

pavuk support regular expressions in the URL and also recursive download.

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22694915
hi Kurt,
the wget command line modification you suggested, does the same thing as before.

and yes, I am limited to Windows :-(, pavuk doesn't seem to be there for cygwin.
0
 
LVL 10

Accepted Solution

by:
kukno earned 250 total points
ID: 22697465
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

http://www.langfeldt.net/w3mir/

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from http://www.winzip.com/
get and install ActivePerl (now Build 509) from http://www.activeperl.com/
get nmake.exe from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe

After installing the tools above, do this in the unpacked w3mir directory
   perl makefile.pl
   nmake
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:.  It fetches the first into the second
URL: http://code.google.com/apis/maps/documentation/
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this:

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22912306
w3mir doesn't work as expected.
it downloads a lot more stuff than I demand.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22915710
well, that might depend on the configuration. Can you post your config here and describe WHAT the "additional/unwanted" stuff was.
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22965560
I have not looked at the solution yet, but am in a hurry since too many of my questions
are open and the account would be suspended if I don't take an action.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
allotting specs for college tablet.. 3 75
CC Dreamweaver Regular Expression Problem 14 32
Problem to Office 1 40
Best in class privacy policy 6 48
With the shift in today’s hiring climate (http://blog.experts-exchange.com/ee-blog/5-tips-on-succeeding-in-the-new-gig-economy/?cid=Blog_031816), many companies are choosing to hire freelancers to get projects completed efficiently and inexpensively…
Gift cards are not a new concept - it's been around for a very long time.  Undoubtedly, over the past you have received such a card or purchased one for a friend or relative.  Are you aware that you've been feeding the machine?  If not, read on :)
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
The Bounty Board allows you to request an article or video on any technical topic, or fulfill a bounty request to earn points. Watch this video to learn how to use the Bounty Board to get the content you want, earn points, and browse submitted bount…

740 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question