wget to find broken links on a website

So, as the title says, I'm trying to get a list of broken links.  More importantly, though, I'm looking for what sites they're on.

This is what I tried:
wget --spider -r -o /log.txt will get me the list of broken links.

I then used wget -r to download the entire website.

I then had to use "cat log.txt | SED 's/http:\/\/www.website.com\//\//' > brokenhtml.txt " to strip the "http://www.website.com/" from the broken links, so that it matched the HREFs in the html of the downloaded pages

I then used grep -rlf brokenhtml.txt www.website.

Everything up to the last steps produces the expected results.  When I cat the log.txt and randomly paste the URL into a browser window, it correctly gives me a 404.  When I do the last step to find the pages that the HREFs are in, I get pages with no broken links on them.
LVL 1
lunanatAsked:
Who is Participating?
 
lunanatAuthor Commented:
apparently grep works better when you separate the parameters with spaces.

Current command that produces the desired results (thus far, it's still running):

grep -r -i -l -f b-links.txt www.website.com

the -f command allows you to use a file for matching patterns, one line per pattern.  I did also strip out a leading blank line from the pattern file... perhaps that was also breaking it.
0
 
jeremycrussellCommented:
You grep should probably be an iteration through the contents of brokenhtml.txt....  I.E.

while read line
 do
   grep -rlf $line www.website.
done <brokenhtml.txt

or...

for f in `cat brokenhtml.txt`
 do
   grep -rlf $f www.website.
don

or something to that effect.
0
 
jeremycrussellCommented:
Ah... I didn't even notice the f option... even when I retyped it.. my bad.
0
 
Mohan ShivaiahCommented:
sed -n '/www.website.com/,$ p' brokenhtml.txt > <out_put_file>
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.