So, as the title says, I'm trying to get a list of broken links. More importantly, though, I'm looking for what sites they're on.
This is what I tried:
wget --spider -r -o /log.txt will get me the list of broken links.
I then used wget -r to download the entire website.
I then had to use "cat log.txt | SED 's/http:\/\/www.website.com\//\//'
> brokenhtml.txt " to strip the "http://www.website.com/
" from the broken links, so that it matched the HREFs in the html of the downloaded pages
I then used grep -rlf brokenhtml.txt www.website
Everything up to the last steps produces the expected results. When I cat the log.txt and randomly paste the URL into a browser window, it correctly gives me a 404. When I do the last step to find the pages that the HREFs are in, I get pages with no broken links on them.