Solved

Wget : downloading urls matching a regular expression

Posted on 2008-10-10
12
8,382 Views
Last Modified: 2013-12-20
Hi,
I want to download urls recursively,
starting from : http://code.google.com/apis/maps/,
but I want to download only those URLs which
match the this pattern :
http://code.google.com/apis/maps/*

I tried wget -r -D http://code.google.com/apis/maps/ http://code.google.com/apis/maps/
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.
0
Comment
Question by:dtivmk
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
12 Comments
 
LVL 10

Expert Comment

by:kukno
ID: 22690699
Hi,

there is an option "-I" or "--include-directories".

From the man page: http://linux.die.net/man/1/wget

-I list
--include-directories=list
    Specify a comma-separated list of directories you wish to follow when downloading Elements of list may contain wildcards.

Sample: wget --include-directories *test*,*test2* -r http://www....

Regards
Kurt
0
 
LVL 10

Expert Comment

by:TOPIO
ID: 22690716
If wget is not user friendly you can try httrack
http://www.httrack.com/
that does the same but with a more user friendly interface
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693593
Hi Topio,
I want to download all urls matching this pattern : http://code.google.com/apis/maps/documentation/flash/
I used the following options :

URL -> http://code.google.com/apis/maps/documentation/
Set Options -> Scan Rules -> Include Links -> Criterion -> Folder names containing : String: flash
Limits -> Max mirroring depth : 5
Limits ->  Max external depth : 3

I got the following error :


---------------------------
WinHTTrack Website Copier
---------------------------
* * MIRROR ERROR! * *

HTTrack has detected that the current mirror is empty. If it was an update, the previous mirror has been restored.

Reason: the first page(s) either could not be found, or a connection problem occured.

=> Ensure that the website still exists, and/or check your proxy settings! <=
---------------------------
OK  
---------------------------
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 1

Author Comment

by:dtivmk
ID: 22693597
Hi kukno,
I tried this :
wget --include-directories=flash -r http://code.google.com/apis/maps/documention/
in order to download all the urls matching http://code.google.com/apis/maps/documention/flash/.
but, only `code.google.com/apis/maps/documentation/index.html' was downloaded.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22693768
can you please post a real world sample? The link on google does not contain anything...
"The requested URL /apis/maps/documention/ was not found on this server. "

0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693780
0
 
LVL 10

Expert Comment

by:kukno
ID: 22694555
hm.. if you use a wildcard in the option, it will download a lot more:

wget --include-directories=*flash* -r http://code.google.com/apis/maps/documention/

However, then it's no longer limited to the path /apis/maps/documention/. I think wget is not able to do what you need. If you are not limited to Windows as platform, you could try pavuk.

   http://www.pavuk.org/man.html

pavuk support regular expressions in the URL and also recursive download.

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22694915
hi Kurt,
the wget command line modification you suggested, does the same thing as before.

and yes, I am limited to Windows :-(, pavuk doesn't seem to be there for cygwin.
0
 
LVL 10

Accepted Solution

by:
kukno earned 250 total points
ID: 22697465
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

http://www.langfeldt.net/w3mir/

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from http://www.winzip.com/
get and install ActivePerl (now Build 509) from http://www.activeperl.com/
get nmake.exe from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe

After installing the tools above, do this in the unpacked w3mir directory
   perl makefile.pl
   nmake
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:.  It fetches the first into the second
URL: http://code.google.com/apis/maps/documentation/
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this:

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22912306
w3mir doesn't work as expected.
it downloads a lot more stuff than I demand.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22915710
well, that might depend on the configuration. Can you post your config here and describe WHAT the "additional/unwanted" stuff was.
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22965560
I have not looked at the solution yet, but am in a hurry since too many of my questions
are open and the account would be suspended if I don't take an action.
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you want to move up through the ranks in your technology career, talent and hard work are the bare necessities. But they aren’t enough to make you stand out. Expanding your skills, actively promoting your accomplishments and using promotion st…
Gift cards are not a new concept - it's been around for a very long time.  Undoubtedly, over the past you have received such a card or purchased one for a friend or relative.  Are you aware that you've been feeding the machine?  If not, read on :)
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question