Solved

Grep for all Page Titles

Posted on 2002-03-13
29
447 Views
Last Modified: 2012-05-04
I need to get a text file that lists all of the titles (text within the <title></title> tags) for every webpage on my site. I imagine I need to use the grep command. I would like to write it to a file in my user directory:

> $HOME/me

Please assist!

spreeez
0
Comment
Question by:spreeez
  • 14
  • 12
  • 2
  • +1
29 Comments
 
LVL 1

Expert Comment

by:Sixpax
ID: 6861894
Put yourself in the parent directory where the wep page files start.  Then run this:

find . -type f -name '*.htm*' -exec grep -i title {} \;

This won't work though if your titles are more than 2 lines of text.  If that's the case, let me know and I'll see if I can write you up a quick and dirty script to do it.
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6861903
BTW, that will send output to the terminal, not your file.

If you are happy with the results, run the same command with this appended to the end:

  > $HOME/me

0
 

Author Comment

by:spreeez
ID: 6862080
Sixpax,

Thanks for the reply. Yes, this worked fairly well, although the results text file is a bit discombobulated because of cases where there was a form in which a field name had a name of title or other similar cases. Also, there are cases in which a title has a carriage return in it.

Is there any way to a find on to include the actual brackets of the <title></title> tags? Maybe a script would perform better?

spreeez
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6862834
I went ahead and wrote you a little script to do this.  I assume this is Linux, if not, let me know what OS so I can modify it.  Be sure to run this from the parent directory of your HTML files:

#!/bin/bash
for FILE in $(find . -type f -name '*.htm*'); do
   FIRST=$(grep -in "<title>" $FILE | cut -d: -f1)
   LAST=$(grep -in "</title>" $FILE | cut -d: -f1)
   (( DIFF = FIRST - LAST + 1 ))
   cat $FILE | tail -n +${FIRST} | head -n ${DIFF}
done

That should give you what you want.  Make sure after you create the script that you run "chmod 755 scriptname" to make it executable.  When you run it the output will go to the screen, so once again, send the results to your file:

./scriptname > $HOME/me

0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6862851
I can also modify it to include the file name in the output, some sort of seperator, and strip the html tags.  That would make your text file more readable... something like this:

index.html===================================
Welcome to my web page
---------------------------------------------
search.html==================================
Here is a list of search engines
---------------------------------------------
suggestions.html=============================
Give any and all suggestions
you might have for me here.
---------------------------------------------

would something like that be beneficial?
0
 

Author Comment

by:spreeez
ID: 6862889
Yes, that separated page would be more readable and beneficial. The operating system is SunOS 5.8.

Also, I am familiar with changing permissions in UNIX, but what file extension do I give to the script name?

Thanks for the help so far!
spreeez
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6862948
Since you are not using Linux, you'll have to make some changes.  I redid the script so it will run for you (and included the new format):

#!/bin/ksh
for FILE in $(find . -type f -name '*.htm*'); do
  FIRST=$(grep -in "<title>" $FILE | cut -d: -f1)
  LAST=$(grep -in "</title>" $FILE | cut -d: -f1)
  (( DIFF = FIRST - LAST + 1 ))
  printf "%-30.30s\n" "$(basename $FILE)=============================="
  cat $FILE | tail -n +${FIRST} | head -n ${DIFF} | sed 's:<.*title>::' | sed 's:<.*TITLE>::'
  echo "------------------------------"
done

Let me know how it works for ya.
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6862952
BTW, some people use the convention of .scr or .ksh for their scripts, but in UNIX, it's not necessary to give it any extension.
0
 

Author Comment

by:spreeez
ID: 6864822
I tried running the script and I get a "not found" error, specifically:

ksh: ./titles:  not found

I also tried the two file extensions and set the right permissions -- but the same error occured. I ensured that the file was there(!) by doing a cat titles -- so it definitely is there. Maybe the file has a problem with the first line:

#!/bin/ksh

What does this mean?

spreeez
0
 

Author Comment

by:spreeez
ID: 6864870
Maybe that first line is referring to a library of functions, but the library is in a different spot on my system?
0
 
LVL 1

Accepted Solution

by:
Sixpax earned 50 total points
ID: 6864902
First of all, I noticed a typo on my part.  The DIFF calculation should read like this:

(( DIFF = LAST - FIRST + 1 ))

But that's not the cause of the error.  If you are giving your script an extension, you have to type that as part of the file name when you run the script.  If you named your script titles.scr, you execute it like this:

./titles.scr

Also, to be in the safe side, add this line after the "#!/bin/ksh":

export PATH=$PATH:/bin:/usr/bin:/sbin:/usr/sbin

By the way, that first line just tells it what shell to use... /bin/ksh is the Korn shell.
0
 

Author Comment

by:spreeez
ID: 6864950
Yes, I tried with and without the extensions. I just renamed the file to all_titles1.scr, which contains:

#!/bin/ksh
export PATH=$PATH:/bin:/usr/bin:/sbin:/usr/sbin
for FILE in $(find . -type f -name '*.htm*'); do
 FIRST=$(grep -in "<title>" $FILE | cut -d: -f1)
 LAST=$(grep -in "</title>" $FILE | cut -d: -f1)
 (( DIFF = LAST - FIRST + 1 ))
 printf "%-30.30s\n" "$(basename $FILE)=============================="
 cat $FILE | tail -n +${FIRST} | head -n ${DIFF} | sed 's:<.*title>::' | sed 's:<.*TITLE>::'
 echo "------------------------------"
done

$ ./all_titles1.scr
ksh: all_titles1.scr: cannot execute

Then changed then permissions:
$ chmod 775 all_titles1.scr

Then tried running it again
& ./all_titles1.scr > $HOME/all_titles_sixpax.txt
ksh: ./all_titles1.scr:  not found

-- and got the above error once again. I wonder what I am doing wrong! Thanks for your help so far.

spreeez
0
 

Author Comment

by:spreeez
ID: 6865006
(Oops, that was obviously supposed to be a dollar sign:)

$ ./all_titles1.scr > $HOME/all_titles_sixpax.txt
_
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6865074
Your ksh might be in another directory besides /bin.  See if you can locate it.  Maybe it's in /usr/bin.

You can just run this if necessary:

find / -name ksh

Then fix the first line accordingly.
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 

Author Comment

by:spreeez
ID: 6865203
I found the ksh and it was in usr/bin:
#!/usr/bin/ksh

Still, I get the error:
ksh: ./all_titles1.scr:  not found

Could it be that your script uses functions in the ksh and my ksh is out of date or is an earlier version?

spreeez

0
 

Author Comment

by:spreeez
ID: 6865582
BTW, I actually have three ksh in the /usr/bin sub-directory: ksh, pfksh and rksh. Will any of these be helpful?
0
 

Author Comment

by:spreeez
ID: 6865867
I just rejected your previous answer so the question can be re-opened to other experts. Thanks for your help so far sixpax!
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6866048
No problem on the reject.  I should have waited before submitting it as an answer.

I probably mislead you on my instructions.  I think the problem is your script is in a different directory than you are trying to run it from.  All you have to do is specify the path with the script name, such as:

/home/spreeez/scripts/all_titles1.scr > $HOME/me

or wherever you have it saved.

Hope that works!
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6866103
No problem on the reject.  I should have waited before submitting it as an answer.

I probably mislead you on my instructions.  I think the problem is your script is in a different directory than you are trying to run it from.  All you have to do is specify the path with the script name, such as:

/home/spreeez/scripts/all_titles1.scr > $HOME/me

or wherever you have it saved.

Hope that works!
0
 

Author Comment

by:spreeez
ID: 6866240
Yes, I tried placing the script in the webpage root, and running it from there, and I've tried placing the script in my home directory, and then cd to the webpage root, and run it from there, e.g.:

$ pwd
$ /users/spreeez
$ ls -l *.scr
-rwxrwxr-x 1 spreeez  web 457 Mar 14 12:18 all_titles1.scr
$ cd /nsprod/ns-home/myhome
$ ./users/spreeez/all_titles1.scr
ksh: ./users/ricse01/all_titles1.scr:  not found

and

$ pwd
$ /nsprod/ns-home/myhome
$ ls -l *.scr
-rwxrwxr-x 1 spreeez  web 457 Mar 14 12:18 all_titles1.scr
$ ./all_titles1.scr
ksh: ./users/ricse01/all_titles1.scr:  not found

Any ideas?

spreeez
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6866265
Yes, do it exactly like you did the first way except take the "." out:

$ /users/spreeez/all_titles1.scr

The "." refers to the current directory.
0
 
LVL 5

Expert Comment

by:ecw
ID: 6868202
On solaris ...
  cd to root of web pages,
  find . -print |
    xargs nawk '/<title>/,/<\/title>/{print FILENAME":"$0}'
0
 

Author Comment

by:spreeez
ID: 6869478
ecw,

That worked well, although it doesn't seem to be compiling a list of all the files in the webpage root. Is there any way to limit this to *.htm* and *.txt files? I ask because it is searching jpeg's and it is issuing "too long" errors.

spreeez
0
 

Author Comment

by:spreeez
ID: 6869486
Also, after running it, I cannot type in the terminal window. I have the blinking cursor next to the dollar sign $ but it doesn't accept keystrokes -- strange. I wonder if it is hung up on something.

spreeez
0
 
LVL 1

Expert Comment

by:Sixpax
ID: 6869562
Use ctrl-c to try to kill the process.

To just search for the HTML files, change his find to this:

find . -type f -name '*.htm*'

I think you might have the same problem with his suggestion as you did with my initial one, although I don't know awk that well to tell.

Did you ever try running my script again without the "." ??
0
 
LVL 5

Expert Comment

by:ecw
ID: 6876454
As sixpax says, use -name '*.htm*', and if you can install gawk.  Both nawk and awk have a limited line length of around 1024 chars, so if your web pages are formatted nicely, they can barf.  Gnu's awk, gawk, is much more sensible in this respect, I think it mallocs up the line length as it goes, so it can cope with badly formed input much better.  Usually I'm rather loathe to resort to gnuisms to get a job done, often for this kind of thing, where awk barf, one can use sed, for example,
  sed -n '/<title>/,/<\/title>/p'
this is fine when <title> and </title> are known to be on seperate lines, but can misreport if both are on the same line.

Horses for courses...

0
 

Author Comment

by:spreeez
ID: 6879941
Sixpax,

Yes, I tried  without the "." and I still get the not found error, yet the script is there, and I am calling it from the right location. The "not found" I believe is referring to something within the script.

spreeez
0
 
LVL 16

Expert Comment

by:SteveJ
ID: 6889233
spreeez,

Type in sixpax's script at the command line from the directory where the files are.

Steve
0
 

Author Comment

by:spreeez
ID: 6889257
Steve, yes tried that.

I guess I will close this issue. I wasn't able to do exactly what I wanted but sixpax was the most helpful.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Introduction Regular patching is part of a system administrator's tasks. However, many patches require that the system be in single-user mode before they can be installed. A cluster patch in particular can take quite a while to apply if the machine…
Why Shell Scripting? Shell scripting is a powerful method of accessing UNIX systems and it is very flexible. Shell scripts are required when we want to execute a sequence of commands in Unix flavored operating systems. “Shell” is the command line i…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now