Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 8633
  • Last Modified:

Wget : downloading urls matching a regular expression

Hi,
I want to download urls recursively,
starting from : http://code.google.com/apis/maps/,
but I want to download only those URLs which
match the this pattern :
http://code.google.com/apis/maps/*

I tried wget -r -D http://code.google.com/apis/maps/ http://code.google.com/apis/maps/
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.
0
dtivmk
Asked:
dtivmk
  • 6
  • 5
1 Solution
 
kuknoCommented:
Hi,

there is an option "-I" or "--include-directories".

From the man page: http://linux.die.net/man/1/wget

-I list
--include-directories=list
    Specify a comma-separated list of directories you wish to follow when downloading Elements of list may contain wildcards.

Sample: wget --include-directories *test*,*test2* -r http://www....

Regards
Kurt
0
 
TOPIOCommented:
If wget is not user friendly you can try httrack
http://www.httrack.com/
that does the same but with a more user friendly interface
0
 
dtivmkAuthor Commented:
Hi Topio,
I want to download all urls matching this pattern : http://code.google.com/apis/maps/documentation/flash/
I used the following options :

URL -> http://code.google.com/apis/maps/documentation/
Set Options -> Scan Rules -> Include Links -> Criterion -> Folder names containing : String: flash
Limits -> Max mirroring depth : 5
Limits ->  Max external depth : 3

I got the following error :


---------------------------
WinHTTrack Website Copier
---------------------------
* * MIRROR ERROR! * *

HTTrack has detected that the current mirror is empty. If it was an update, the previous mirror has been restored.

Reason: the first page(s) either could not be found, or a connection problem occured.

=> Ensure that the website still exists, and/or check your proxy settings! <=
---------------------------
OK  
---------------------------
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
dtivmkAuthor Commented:
Hi kukno,
I tried this :
wget --include-directories=flash -r http://code.google.com/apis/maps/documention/
in order to download all the urls matching http://code.google.com/apis/maps/documention/flash/.
but, only `code.google.com/apis/maps/documentation/index.html' was downloaded.
0
 
kuknoCommented:
can you please post a real world sample? The link on google does not contain anything...
"The requested URL /apis/maps/documention/ was not found on this server. "

0
 
dtivmkAuthor Commented:
0
 
kuknoCommented:
hm.. if you use a wildcard in the option, it will download a lot more:

wget --include-directories=*flash* -r http://code.google.com/apis/maps/documention/

However, then it's no longer limited to the path /apis/maps/documention/. I think wget is not able to do what you need. If you are not limited to Windows as platform, you could try pavuk.

   http://www.pavuk.org/man.html

pavuk support regular expressions in the URL and also recursive download.

Regards
Kurt
0
 
dtivmkAuthor Commented:
hi Kurt,
the wget command line modification you suggested, does the same thing as before.

and yes, I am limited to Windows :-(, pavuk doesn't seem to be there for cygwin.
0
 
kuknoCommented:
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

http://www.langfeldt.net/w3mir/

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from http://www.winzip.com/
get and install ActivePerl (now Build 509) from http://www.activeperl.com/
get nmake.exe from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe

After installing the tools above, do this in the unpacked w3mir directory
   perl makefile.pl
   nmake
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:.  It fetches the first into the second
URL: http://code.google.com/apis/maps/documentation/
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this:

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

Regards
Kurt
0
 
dtivmkAuthor Commented:
w3mir doesn't work as expected.
it downloads a lot more stuff than I demand.
0
 
kuknoCommented:
well, that might depend on the configuration. Can you post your config here and describe WHAT the "additional/unwanted" stuff was.
0
 
dtivmkAuthor Commented:
I have not looked at the solution yet, but am in a hurry since too many of my questions
are open and the account would be suspended if I don't take an action.
0

Featured Post

Hire Technology Freelancers with Gigs

Work with freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely, and get projects done right.

  • 6
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now