• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 471
  • Last Modified:

Grep for all Page Titles

I need to get a text file that lists all of the titles (text within the <title></title> tags) for every webpage on my site. I imagine I need to use the grep command. I would like to write it to a file in my user directory:

> $HOME/me

Please assist!

spreeez
0
spreeez
Asked:
spreeez
  • 14
  • 12
  • 2
  • +1
1 Solution
 
SixpaxCommented:
Put yourself in the parent directory where the wep page files start.  Then run this:

find . -type f -name '*.htm*' -exec grep -i title {} \;

This won't work though if your titles are more than 2 lines of text.  If that's the case, let me know and I'll see if I can write you up a quick and dirty script to do it.
0
 
SixpaxCommented:
BTW, that will send output to the terminal, not your file.

If you are happy with the results, run the same command with this appended to the end:

  > $HOME/me

0
 
spreeezAuthor Commented:
Sixpax,

Thanks for the reply. Yes, this worked fairly well, although the results text file is a bit discombobulated because of cases where there was a form in which a field name had a name of title or other similar cases. Also, there are cases in which a title has a carriage return in it.

Is there any way to a find on to include the actual brackets of the <title></title> tags? Maybe a script would perform better?

spreeez
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

 
SixpaxCommented:
I went ahead and wrote you a little script to do this.  I assume this is Linux, if not, let me know what OS so I can modify it.  Be sure to run this from the parent directory of your HTML files:

#!/bin/bash
for FILE in $(find . -type f -name '*.htm*'); do
   FIRST=$(grep -in "<title>" $FILE | cut -d: -f1)
   LAST=$(grep -in "</title>" $FILE | cut -d: -f1)
   (( DIFF = FIRST - LAST + 1 ))
   cat $FILE | tail -n +${FIRST} | head -n ${DIFF}
done

That should give you what you want.  Make sure after you create the script that you run "chmod 755 scriptname" to make it executable.  When you run it the output will go to the screen, so once again, send the results to your file:

./scriptname > $HOME/me

0
 
SixpaxCommented:
I can also modify it to include the file name in the output, some sort of seperator, and strip the html tags.  That would make your text file more readable... something like this:

index.html===================================
Welcome to my web page
---------------------------------------------
search.html==================================
Here is a list of search engines
---------------------------------------------
suggestions.html=============================
Give any and all suggestions
you might have for me here.
---------------------------------------------

would something like that be beneficial?
0
 
spreeezAuthor Commented:
Yes, that separated page would be more readable and beneficial. The operating system is SunOS 5.8.

Also, I am familiar with changing permissions in UNIX, but what file extension do I give to the script name?

Thanks for the help so far!
spreeez
0
 
SixpaxCommented:
Since you are not using Linux, you'll have to make some changes.  I redid the script so it will run for you (and included the new format):

#!/bin/ksh
for FILE in $(find . -type f -name '*.htm*'); do
  FIRST=$(grep -in "<title>" $FILE | cut -d: -f1)
  LAST=$(grep -in "</title>" $FILE | cut -d: -f1)
  (( DIFF = FIRST - LAST + 1 ))
  printf "%-30.30s\n" "$(basename $FILE)=============================="
  cat $FILE | tail -n +${FIRST} | head -n ${DIFF} | sed 's:<.*title>::' | sed 's:<.*TITLE>::'
  echo "------------------------------"
done

Let me know how it works for ya.
0
 
SixpaxCommented:
BTW, some people use the convention of .scr or .ksh for their scripts, but in UNIX, it's not necessary to give it any extension.
0
 
spreeezAuthor Commented:
I tried running the script and I get a "not found" error, specifically:

ksh: ./titles:  not found

I also tried the two file extensions and set the right permissions -- but the same error occured. I ensured that the file was there(!) by doing a cat titles -- so it definitely is there. Maybe the file has a problem with the first line:

#!/bin/ksh

What does this mean?

spreeez
0
 
spreeezAuthor Commented:
Maybe that first line is referring to a library of functions, but the library is in a different spot on my system?
0
 
SixpaxCommented:
First of all, I noticed a typo on my part.  The DIFF calculation should read like this:

(( DIFF = LAST - FIRST + 1 ))

But that's not the cause of the error.  If you are giving your script an extension, you have to type that as part of the file name when you run the script.  If you named your script titles.scr, you execute it like this:

./titles.scr

Also, to be in the safe side, add this line after the "#!/bin/ksh":

export PATH=$PATH:/bin:/usr/bin:/sbin:/usr/sbin

By the way, that first line just tells it what shell to use... /bin/ksh is the Korn shell.
0
 
spreeezAuthor Commented:
Yes, I tried with and without the extensions. I just renamed the file to all_titles1.scr, which contains:

#!/bin/ksh
export PATH=$PATH:/bin:/usr/bin:/sbin:/usr/sbin
for FILE in $(find . -type f -name '*.htm*'); do
 FIRST=$(grep -in "<title>" $FILE | cut -d: -f1)
 LAST=$(grep -in "</title>" $FILE | cut -d: -f1)
 (( DIFF = LAST - FIRST + 1 ))
 printf "%-30.30s\n" "$(basename $FILE)=============================="
 cat $FILE | tail -n +${FIRST} | head -n ${DIFF} | sed 's:<.*title>::' | sed 's:<.*TITLE>::'
 echo "------------------------------"
done

$ ./all_titles1.scr
ksh: all_titles1.scr: cannot execute

Then changed then permissions:
$ chmod 775 all_titles1.scr

Then tried running it again
& ./all_titles1.scr > $HOME/all_titles_sixpax.txt
ksh: ./all_titles1.scr:  not found

-- and got the above error once again. I wonder what I am doing wrong! Thanks for your help so far.

spreeez
0
 
spreeezAuthor Commented:
(Oops, that was obviously supposed to be a dollar sign:)

$ ./all_titles1.scr > $HOME/all_titles_sixpax.txt
_
0
 
SixpaxCommented:
Your ksh might be in another directory besides /bin.  See if you can locate it.  Maybe it's in /usr/bin.

You can just run this if necessary:

find / -name ksh

Then fix the first line accordingly.
0
 
spreeezAuthor Commented:
I found the ksh and it was in usr/bin:
#!/usr/bin/ksh

Still, I get the error:
ksh: ./all_titles1.scr:  not found

Could it be that your script uses functions in the ksh and my ksh is out of date or is an earlier version?

spreeez

0
 
spreeezAuthor Commented:
BTW, I actually have three ksh in the /usr/bin sub-directory: ksh, pfksh and rksh. Will any of these be helpful?
0
 
spreeezAuthor Commented:
I just rejected your previous answer so the question can be re-opened to other experts. Thanks for your help so far sixpax!
0
 
SixpaxCommented:
No problem on the reject.  I should have waited before submitting it as an answer.

I probably mislead you on my instructions.  I think the problem is your script is in a different directory than you are trying to run it from.  All you have to do is specify the path with the script name, such as:

/home/spreeez/scripts/all_titles1.scr > $HOME/me

or wherever you have it saved.

Hope that works!
0
 
SixpaxCommented:
No problem on the reject.  I should have waited before submitting it as an answer.

I probably mislead you on my instructions.  I think the problem is your script is in a different directory than you are trying to run it from.  All you have to do is specify the path with the script name, such as:

/home/spreeez/scripts/all_titles1.scr > $HOME/me

or wherever you have it saved.

Hope that works!
0
 
spreeezAuthor Commented:
Yes, I tried placing the script in the webpage root, and running it from there, and I've tried placing the script in my home directory, and then cd to the webpage root, and run it from there, e.g.:

$ pwd
$ /users/spreeez
$ ls -l *.scr
-rwxrwxr-x 1 spreeez  web 457 Mar 14 12:18 all_titles1.scr
$ cd /nsprod/ns-home/myhome
$ ./users/spreeez/all_titles1.scr
ksh: ./users/ricse01/all_titles1.scr:  not found

and

$ pwd
$ /nsprod/ns-home/myhome
$ ls -l *.scr
-rwxrwxr-x 1 spreeez  web 457 Mar 14 12:18 all_titles1.scr
$ ./all_titles1.scr
ksh: ./users/ricse01/all_titles1.scr:  not found

Any ideas?

spreeez
0
 
SixpaxCommented:
Yes, do it exactly like you did the first way except take the "." out:

$ /users/spreeez/all_titles1.scr

The "." refers to the current directory.
0
 
ecwCommented:
On solaris ...
  cd to root of web pages,
  find . -print |
    xargs nawk '/<title>/,/<\/title>/{print FILENAME":"$0}'
0
 
spreeezAuthor Commented:
ecw,

That worked well, although it doesn't seem to be compiling a list of all the files in the webpage root. Is there any way to limit this to *.htm* and *.txt files? I ask because it is searching jpeg's and it is issuing "too long" errors.

spreeez
0
 
spreeezAuthor Commented:
Also, after running it, I cannot type in the terminal window. I have the blinking cursor next to the dollar sign $ but it doesn't accept keystrokes -- strange. I wonder if it is hung up on something.

spreeez
0
 
SixpaxCommented:
Use ctrl-c to try to kill the process.

To just search for the HTML files, change his find to this:

find . -type f -name '*.htm*'

I think you might have the same problem with his suggestion as you did with my initial one, although I don't know awk that well to tell.

Did you ever try running my script again without the "." ??
0
 
ecwCommented:
As sixpax says, use -name '*.htm*', and if you can install gawk.  Both nawk and awk have a limited line length of around 1024 chars, so if your web pages are formatted nicely, they can barf.  Gnu's awk, gawk, is much more sensible in this respect, I think it mallocs up the line length as it goes, so it can cope with badly formed input much better.  Usually I'm rather loathe to resort to gnuisms to get a job done, often for this kind of thing, where awk barf, one can use sed, for example,
  sed -n '/<title>/,/<\/title>/p'
this is fine when <title> and </title> are known to be on seperate lines, but can misreport if both are on the same line.

Horses for courses...

0
 
spreeezAuthor Commented:
Sixpax,

Yes, I tried  without the "." and I still get the not found error, yet the script is there, and I am calling it from the right location. The "not found" I believe is referring to something within the script.

spreeez
0
 
Steve JenningsIT ManagerCommented:
spreeez,

Type in sixpax's script at the command line from the directory where the files are.

Steve
0
 
spreeezAuthor Commented:
Steve, yes tried that.

I guess I will close this issue. I wasn't able to do exactly what I wanted but sixpax was the most helpful.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

  • 14
  • 12
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now