Solved

Wget : downloading urls matching a regular expression

Posted on 2008-10-10
12
7,906 Views
Last Modified: 2013-12-20
Hi,
I want to download urls recursively,
starting from : http://code.google.com/apis/maps/,
but I want to download only those URLs which
match the this pattern :
http://code.google.com/apis/maps/*

I tried wget -r -D http://code.google.com/apis/maps/ http://code.google.com/apis/maps/
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.
0
Comment
Question by:dtivmk
  • 6
  • 5
12 Comments
 
LVL 10

Expert Comment

by:kukno
ID: 22690699
Hi,

there is an option "-I" or "--include-directories".

From the man page: http://linux.die.net/man/1/wget

-I list
--include-directories=list
    Specify a comma-separated list of directories you wish to follow when downloading Elements of list may contain wildcards.

Sample: wget --include-directories *test*,*test2* -r http://www....

Regards
Kurt
0
 
LVL 10

Expert Comment

by:TOPIO
ID: 22690716
If wget is not user friendly you can try httrack
http://www.httrack.com/
that does the same but with a more user friendly interface
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693593
Hi Topio,
I want to download all urls matching this pattern : http://code.google.com/apis/maps/documentation/flash/
I used the following options :

URL -> http://code.google.com/apis/maps/documentation/
Set Options -> Scan Rules -> Include Links -> Criterion -> Folder names containing : String: flash
Limits -> Max mirroring depth : 5
Limits ->  Max external depth : 3

I got the following error :


---------------------------
WinHTTrack Website Copier
---------------------------
* * MIRROR ERROR! * *

HTTrack has detected that the current mirror is empty. If it was an update, the previous mirror has been restored.

Reason: the first page(s) either could not be found, or a connection problem occured.

=> Ensure that the website still exists, and/or check your proxy settings! <=
---------------------------
OK  
---------------------------
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693597
Hi kukno,
I tried this :
wget --include-directories=flash -r http://code.google.com/apis/maps/documention/
in order to download all the urls matching http://code.google.com/apis/maps/documention/flash/.
but, only `code.google.com/apis/maps/documentation/index.html' was downloaded.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22693768
can you please post a real world sample? The link on google does not contain anything...
"The requested URL /apis/maps/documention/ was not found on this server. "

0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693780
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 10

Expert Comment

by:kukno
ID: 22694555
hm.. if you use a wildcard in the option, it will download a lot more:

wget --include-directories=*flash* -r http://code.google.com/apis/maps/documention/

However, then it's no longer limited to the path /apis/maps/documention/. I think wget is not able to do what you need. If you are not limited to Windows as platform, you could try pavuk.

   http://www.pavuk.org/man.html

pavuk support regular expressions in the URL and also recursive download.

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22694915
hi Kurt,
the wget command line modification you suggested, does the same thing as before.

and yes, I am limited to Windows :-(, pavuk doesn't seem to be there for cygwin.
0
 
LVL 10

Accepted Solution

by:
kukno earned 250 total points
ID: 22697465
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

http://www.langfeldt.net/w3mir/

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from http://www.winzip.com/
get and install ActivePerl (now Build 509) from http://www.activeperl.com/
get nmake.exe from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe

After installing the tools above, do this in the unpacked w3mir directory
   perl makefile.pl
   nmake
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:.  It fetches the first into the second
URL: http://code.google.com/apis/maps/documentation/
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this:

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22912306
w3mir doesn't work as expected.
it downloads a lot more stuff than I demand.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22915710
well, that might depend on the configuration. Can you post your config here and describe WHAT the "additional/unwanted" stuff was.
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22965560
I have not looked at the solution yet, but am in a hurry since too many of my questions
are open and the account would be suspended if I don't take an action.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Stuck in voice control mode on your Amazon Firestick?  Here is how to turn it off!!!
Digital marketing agencies have encountered both the opportunities and difficulties that emerge from working with a wide-ranging organizations.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now