Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

wget to find broken links on a website

Posted on 2010-08-25
4
Medium Priority
?
784 Views
Last Modified: 2013-11-15
So, as the title says, I'm trying to get a list of broken links.  More importantly, though, I'm looking for what sites they're on.

This is what I tried:
wget --spider -r -o /log.txt will get me the list of broken links.

I then used wget -r to download the entire website.

I then had to use "cat log.txt | SED 's/http:\/\/www.website.com\//\//' > brokenhtml.txt " to strip the "http://www.website.com/" from the broken links, so that it matched the HREFs in the html of the downloaded pages

I then used grep -rlf brokenhtml.txt www.website.

Everything up to the last steps produces the expected results.  When I cat the log.txt and randomly paste the URL into a browser window, it correctly gives me a 404.  When I do the last step to find the pages that the HREFs are in, I get pages with no broken links on them.
0
Comment
Question by:lunanat
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
4 Comments
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522227
You grep should probably be an iteration through the contents of brokenhtml.txt....  I.E.

while read line
 do
   grep -rlf $line www.website.
done <brokenhtml.txt

or...

for f in `cat brokenhtml.txt`
 do
   grep -rlf $f www.website.
don

or something to that effect.
0
 
LVL 1

Accepted Solution

by:
lunanat earned 0 total points
ID: 33522857
apparently grep works better when you separate the parameters with spaces.

Current command that produces the desired results (thus far, it's still running):

grep -r -i -l -f b-links.txt www.website.com

the -f command allows you to use a file for matching patterns, one line per pattern.  I did also strip out a leading blank line from the pattern file... perhaps that was also breaking it.
0
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522868
Ah... I didn't even notice the f option... even when I retyped it.. my bad.
0
 
LVL 2

Expert Comment

by:Mohan Shivaiah
ID: 33537017
sed -n '/www.website.com/,$ p' brokenhtml.txt > <out_put_file>
0

Featured Post

Simplifying Server Workload Migrations

This use case outlines the migration challenges that organizations face and how the Acronis AnyData Engine supports physical-to-physical (P2P), physical-to-virtual (P2V), virtual to physical (V2P), and cross-virtual (V2V) migration scenarios to address these challenges.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

1. Introduction As many people are interested in Linux but not as many are interested or knowledgeable (enough) to install Linux on their system, here is a safe way to try out Linux on your existing (Windows) system. The idea is that you insta…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Suggested Courses

722 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question