Solved

Wget : downloading urls matching a regular expression

Posted on 2008-10-10
12
8,023 Views
Last Modified: 2013-12-20
Hi,
I want to download urls recursively,
starting from : http://code.google.com/apis/maps/,
but I want to download only those URLs which
match the this pattern :
http://code.google.com/apis/maps/*

I tried wget -r -D http://code.google.com/apis/maps/ http://code.google.com/apis/maps/
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.
0
Comment
Question by:dtivmk
  • 6
  • 5
12 Comments
 
LVL 10

Expert Comment

by:kukno
ID: 22690699
Hi,

there is an option "-I" or "--include-directories".

From the man page: http://linux.die.net/man/1/wget

-I list
--include-directories=list
    Specify a comma-separated list of directories you wish to follow when downloading Elements of list may contain wildcards.

Sample: wget --include-directories *test*,*test2* -r http://www....

Regards
Kurt
0
 
LVL 10

Expert Comment

by:TOPIO
ID: 22690716
If wget is not user friendly you can try httrack
http://www.httrack.com/
that does the same but with a more user friendly interface
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693593
Hi Topio,
I want to download all urls matching this pattern : http://code.google.com/apis/maps/documentation/flash/
I used the following options :

URL -> http://code.google.com/apis/maps/documentation/
Set Options -> Scan Rules -> Include Links -> Criterion -> Folder names containing : String: flash
Limits -> Max mirroring depth : 5
Limits ->  Max external depth : 3

I got the following error :


---------------------------
WinHTTrack Website Copier
---------------------------
* * MIRROR ERROR! * *

HTTrack has detected that the current mirror is empty. If it was an update, the previous mirror has been restored.

Reason: the first page(s) either could not be found, or a connection problem occured.

=> Ensure that the website still exists, and/or check your proxy settings! <=
---------------------------
OK  
---------------------------
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 1

Author Comment

by:dtivmk
ID: 22693597
Hi kukno,
I tried this :
wget --include-directories=flash -r http://code.google.com/apis/maps/documention/
in order to download all the urls matching http://code.google.com/apis/maps/documention/flash/.
but, only `code.google.com/apis/maps/documentation/index.html' was downloaded.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22693768
can you please post a real world sample? The link on google does not contain anything...
"The requested URL /apis/maps/documention/ was not found on this server. "

0
 
LVL 1

Author Comment

by:dtivmk
ID: 22693780
0
 
LVL 10

Expert Comment

by:kukno
ID: 22694555
hm.. if you use a wildcard in the option, it will download a lot more:

wget --include-directories=*flash* -r http://code.google.com/apis/maps/documention/

However, then it's no longer limited to the path /apis/maps/documention/. I think wget is not able to do what you need. If you are not limited to Windows as platform, you could try pavuk.

   http://www.pavuk.org/man.html

pavuk support regular expressions in the URL and also recursive download.

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22694915
hi Kurt,
the wget command line modification you suggested, does the same thing as before.

and yes, I am limited to Windows :-(, pavuk doesn't seem to be there for cygwin.
0
 
LVL 10

Accepted Solution

by:
kukno earned 250 total points
ID: 22697465
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

http://www.langfeldt.net/w3mir/

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from http://www.winzip.com/
get and install ActivePerl (now Build 509) from http://www.activeperl.com/
get nmake.exe from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe

After installing the tools above, do this in the unpacked w3mir directory
   perl makefile.pl
   nmake
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:.  It fetches the first into the second
URL: http://code.google.com/apis/maps/documentation/
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this:

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

Regards
Kurt
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22912306
w3mir doesn't work as expected.
it downloads a lot more stuff than I demand.
0
 
LVL 10

Expert Comment

by:kukno
ID: 22915710
well, that might depend on the configuration. Can you post your config here and describe WHAT the "additional/unwanted" stuff was.
0
 
LVL 1

Author Comment

by:dtivmk
ID: 22965560
I have not looked at the solution yet, but am in a hurry since too many of my questions
are open and the account would be suspended if I don't take an action.
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Originally published Entrepreneur.com Booming numbers of freelancing professionals are changing the face of work. In the United States alone last year, the number of workers freelancing grew from 700,000 to 54 million, according to a Freelancers’…
Whether you believe the “gig economy,” as it has been dubbed, is the next big economic paradigm shift (https://www.theguardian.com/commentisfree/2015/jul/26/will-we-get-by-gig-economy) or an overstated trend (http://www.wsj.com/articles/proof-of-a-g…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Where to go on the main page to find the job listings. How to apply to a job that you are interested in from the list that is featured on our Careers page.

803 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question