Link to home
Create AccountLog in
Avatar of kvjajoo
kvjajoo

asked on

Need a shell script to copy html files from remote sever, read those html file and extract the particular data and put the data in new file

Hi,

I have been give a task to collect the usage stats from multiple virtual hosts we have on linux web server and put the summary of the data into new text or excel spreadsheet. Below is the explanation:

We have a linux web server which host multiple virtual hosts. Each of host site generate the html file at the end of each month which contain the information as follows :

Total Hits       418741
Total Files       377639
Total Pages       143324
Total Visits       29678
Total KBytes       15715617
Total Unique Sites       11912
Total Unique URLs       2722
Total Unique Referrers       2314
Total Unique Usernames       2
Total Unique User Agents       1251

This html file is generated by  Webalizer.  The html file is stored in /var/www/vhosts/<vhostname>/statistics/webstat/usage_<year><month>.html
For example : var/www/vhosts/xyz.com/statistics/webstat/usage_200806.html

Following are the html codes from the html file which displays the above data.

<TR><TD WIDTH=380><FONT SIZE="-1">Total Hits</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>279242</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Files</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>209861</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Pages</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>44464</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Visits</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>9029</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total KBytes</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>15578780</B></FONT></TD></TR>
<TR><TH HEIGHT=4></TH></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Unique Sites</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>5606</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Unique URLs</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>204</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Unique Referrers</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>1025</B></FONT></TD></TR>
<TR><TD WIDTH=380><FONT SIZE="-1">Total Unique User Agents</FONT></TD>
<TD ALIGN=right COLSPAN=2><FONT SIZE="-1"><B>605</B></FONT></TD></TR>


Now i have one more linux box which can access this web server box via ssh or sftp with root credentials. I want a shell script which will be executed on the linux box (not web server) should do the following :

1. copy the latest usage_<year><month>.html from each virtual host's webstats directory from the web server  ( please note the html file's names are similer for each host so the vhost name should be appended in the begining of the file so the new file will be <vhostname>usage_<year><month>.html

2. grab the data i want ( i have mention the required data in begining) from html files and put them into one single in one vhost data in one line format so i can paste them in excel. So the format should be somthing like this : <vhostname>  <Total Hits> <Total Files> <Total Pages> <Total Visits> and so on ....

I am a absolute beginner in shell scripting area.
ASKER CERTIFIED SOLUTION
Avatar of Gabriel Orozco
Gabriel Orozco
Flag of Mexico image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Addendum:
   decode_file $Vhost
should be
   decode_file $VHOST

(case matters)
Avatar of kvjajoo
kvjajoo

ASKER

Hi Redimido, Thanks for reply. Well the format for file dose changes and it is not unique. As i can see the you have written the script in accordance to line no which cant help me. Rest of your script is just fantastic. Can we find a way to read the html file which grabs the data in given string +1 line format. This simply means find the supplied string (As strings like Total Hits, Total files, Total Pages, Total Visits are commen in each file) and grab its data from next line so it become (from the above example) total hits = 279242, total files = 209861 and so on. Also in the begning of the file there is line which says '<TITLE>Usage Statistics for xyz.com - June 2008</TITLE>'. I need to grab xyz.com from it and paste before the information so i will know which data is for which host.

Hi Mysidia thanks very mush for your reply as well. Can you give me a rough example abt what your script dose so i can ask you for the changes i need. Also please keep in your information that there is whole lot of other data in the file which i dont need. (I have pasted only  the reqired bit in the question). Also in the begning of the file there is line which says '<TITLE>Usage Statistics for xyz.com - June 2008</TITLE>'. I need to grab xyz.com from it and paste before the information so i will know which data is for which host.
Then a mix from what Mysidia excelent script posted is on order.
it was not trivial as some changes needed to be done and some testing.

maybe you need to start digging into shell scripting books as experts-exchange is all about advising and not about coding :-)

anyway, here it is:


Then the script can be this one:
 
--------------8<-----------------
#!/bin/bash
 
# current year-month
yearmonth=$(date +%Y%m)
 
# replace with your own
webserverip=1.2.3.4
vhdir=/var/www/vhosts
vhfile=usage_$yearmonth.html
backupdir=/path/to/the/dir/you/store/your/files
globaldatafile=/path/to/your/vhost/file
 
function getdata {
  if [ "$1" = "Title" ]; then
    echo $(egrep "^.TITLE" $backupdir/$vhfile |  awk '{print $4}')
  else
    echo "$output" |  while read line
    do
      var=$(echo "$line" | sed 's/^\s*Total\s*//' | awk '{ NF--; print }')
      if [ "$1" = "$var" ]; then
         value=$(echo "$line" | awk '{ print $NF }')
         echo $value
      fi
    done
  fi
}
 
# obtain the list of subdirectories on the other machine.
# adecuate ssh keyfile handling allow us to not be asked for the password.
vhostlist=$(ssh $webserverip "cd $vhdir; ls")
 
for vhost in $vhostlist; do
  # avoid double process an already downloaded file
  if [ ! -f $backupdir/$vhfile ]; then
    scp $webserverip:$vhdir/$vhost/statistics/webstat/$vhfile $backupdir/
    
    output=$(/usr/bin/lynx -dump $backupdir/$vhfile)
    
    vhosttitle=$(getdata "Title")
    hits=$(getdata "Hits")
    files=$(getdata "Files")
    pages=$(getdata "Pages")
    visits=$(getdata "Visits")
    kbytes=$(getdata "KBytes")
    usites=$(getdata "Unique Sites")
    uurls=$(getdata "Unique URLs")
    urefer=$(getdata "Unique Referrers")
    uuagents=$(getdata "Unique User Agents")
 
    echo $vhosttitle $hits $files $pages $visits $kbytes $usites $uurls $urefer $uuagents >> $globaldatafile
  fi
done
--------------8<-----------------

Open in new window

This way you can add as many text fields you want and get their value. only "Title" is processed differently as lynx cannot print it using -dump ;)