Solved

wget to find broken links on a website

Posted on 2010-08-25
4
770 Views
Last Modified: 2013-11-15
So, as the title says, I'm trying to get a list of broken links.  More importantly, though, I'm looking for what sites they're on.

This is what I tried:
wget --spider -r -o /log.txt will get me the list of broken links.

I then used wget -r to download the entire website.

I then had to use "cat log.txt | SED 's/http:\/\/www.website.com\//\//' > brokenhtml.txt " to strip the "http://www.website.com/" from the broken links, so that it matched the HREFs in the html of the downloaded pages

I then used grep -rlf brokenhtml.txt www.website.

Everything up to the last steps produces the expected results.  When I cat the log.txt and randomly paste the URL into a browser window, it correctly gives me a 404.  When I do the last step to find the pages that the HREFs are in, I get pages with no broken links on them.
0
Comment
Question by:lunanat
  • 2
4 Comments
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522227
You grep should probably be an iteration through the contents of brokenhtml.txt....  I.E.

while read line
 do
   grep -rlf $line www.website.
done <brokenhtml.txt

or...

for f in `cat brokenhtml.txt`
 do
   grep -rlf $f www.website.
don

or something to that effect.
0
 
LVL 1

Accepted Solution

by:
lunanat earned 0 total points
ID: 33522857
apparently grep works better when you separate the parameters with spaces.

Current command that produces the desired results (thus far, it's still running):

grep -r -i -l -f b-links.txt www.website.com

the -f command allows you to use a file for matching patterns, one line per pattern.  I did also strip out a leading blank line from the pattern file... perhaps that was also breaking it.
0
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522868
Ah... I didn't even notice the f option... even when I retyped it.. my bad.
0
 
LVL 2

Expert Comment

by:Mohan Shivaiah
ID: 33537017
sed -n '/www.website.com/,$ p' brokenhtml.txt > <out_put_file>
0

Featured Post

Manage your data center from practically anywhere

The KN8164V features HD resolution of 1920 x 1200, FIPS 140-2 with level 1 security standards and virtual media transmissions at twice the speed. Built for reliability, the KN series provides local console and remote over IP access, ensuring 24/7 availability to all servers.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is possible to boost certain documents at query time in Solr. Query time boosting can be a powerful resource for finding the most relevant and "best" content. Of course the more information you index, the more fields you will be able to use for y…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question