Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

wget to find broken links on a website

Posted on 2010-08-25
4
Medium Priority
?
790 Views
Last Modified: 2013-11-15
So, as the title says, I'm trying to get a list of broken links.  More importantly, though, I'm looking for what sites they're on.

This is what I tried:
wget --spider -r -o /log.txt will get me the list of broken links.

I then used wget -r to download the entire website.

I then had to use "cat log.txt | SED 's/http:\/\/www.website.com\//\//' > brokenhtml.txt " to strip the "http://www.website.com/" from the broken links, so that it matched the HREFs in the html of the downloaded pages

I then used grep -rlf brokenhtml.txt www.website.

Everything up to the last steps produces the expected results.  When I cat the log.txt and randomly paste the URL into a browser window, it correctly gives me a 404.  When I do the last step to find the pages that the HREFs are in, I get pages with no broken links on them.
0
Comment
Question by:lunanat
  • 2
4 Comments
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522227
You grep should probably be an iteration through the contents of brokenhtml.txt....  I.E.

while read line
 do
   grep -rlf $line www.website.
done <brokenhtml.txt

or...

for f in `cat brokenhtml.txt`
 do
   grep -rlf $f www.website.
don

or something to that effect.
0
 
LVL 1

Accepted Solution

by:
lunanat earned 0 total points
ID: 33522857
apparently grep works better when you separate the parameters with spaces.

Current command that produces the desired results (thus far, it's still running):

grep -r -i -l -f b-links.txt www.website.com

the -f command allows you to use a file for matching patterns, one line per pattern.  I did also strip out a leading blank line from the pattern file... perhaps that was also breaking it.
0
 
LVL 9

Expert Comment

by:jeremycrussell
ID: 33522868
Ah... I didn't even notice the f option... even when I retyped it.. my bad.
0
 
LVL 2

Expert Comment

by:Mohan Shivaiah
ID: 33537017
sed -n '/www.website.com/,$ p' brokenhtml.txt > <out_put_file>
0

Featured Post

Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In Solr 4.0 it is possible to atomically (or partially) update individual fields in a document. This article will show the operations possible for atomic updating as well as setting up your Solr instance to be able to perform the actions. One major …
It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Suggested Courses
Course of the Month7 days, 6 hours left to enroll

782 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question