Solved

wget to find broken links on a website

Posted on 2010-08-25
4
766 Views
Last Modified: 2013-11-15
So, as the title says, I'm trying to get a list of broken links.  More importantly, though, I'm looking for what sites they're on.

This is what I tried:
wget --spider -r -o /log.txt will get me the list of broken links.

I then used wget -r to download the entire website.

I then had to use "cat log.txt | SED 's/http:\/\/www.website.com\//\//' > brokenhtml.txt " to strip the "http://www.website.com/" from the broken links, so that it matched the HREFs in the html of the downloaded pages

I then used grep -rlf brokenhtml.txt www.website.

Everything up to the last steps produces the expected results.  When I cat the log.txt and randomly paste the URL into a browser window, it correctly gives me a 404.  When I do the last step to find the pages that the HREFs are in, I get pages with no broken links on them.
0
Comment
Question by:lunanat
  • 2
4 Comments
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522227
You grep should probably be an iteration through the contents of brokenhtml.txt....  I.E.

while read line
 do
   grep -rlf $line www.website.
done <brokenhtml.txt

or...

for f in `cat brokenhtml.txt`
 do
   grep -rlf $f www.website.
don

or something to that effect.
0
 
LVL 1

Accepted Solution

by:
lunanat earned 0 total points
ID: 33522857
apparently grep works better when you separate the parameters with spaces.

Current command that produces the desired results (thus far, it's still running):

grep -r -i -l -f b-links.txt www.website.com

the -f command allows you to use a file for matching patterns, one line per pattern.  I did also strip out a leading blank line from the pattern file... perhaps that was also breaking it.
0
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522868
Ah... I didn't even notice the f option... even when I retyped it.. my bad.
0
 
LVL 2

Expert Comment

by:Mohan Shivaiah
ID: 33537017
sed -n '/www.website.com/,$ p' brokenhtml.txt > <out_put_file>
0

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…

831 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question