Solved

Parse html tags using awk sed

Posted on 2009-04-13
5
2,605 Views
Last Modified: 2013-12-26
Please can someone help with ideas on how to parse html tags using sed and awk in linux

for example,

test.html:
<TS>SOMETHING</TS><TD>EXAMPLE</TD>

My script should be able to just output what's inside a tag

myscript TD
EXAMPLE  
myscript TS
SOMETHING

Any help is very much appreciated.

Many thanks,
Krish
0
Comment
Question by:Jkrish
  • 2
5 Comments
 
LVL 48

Accepted Solution

by:
Tintin earned 250 total points
ID: 24132305
sed and awk aren't suitable tools for parsing HTML.  

*if* you HTML is consistently formatted as per above, then you can do
#!/bin/sh

sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html

Open in new window

0
 
LVL 7

Assisted Solution

by:Murugesan Nagarajan
Murugesan Nagarajan earned 250 total points
ID: 24792939

Sample shell scripting for awk, sed commands.

Open in new window

test.txt
0
 
LVL 7

Expert Comment

by:Murugesan Nagarajan
ID: 24986388
#!/bin/sh
#Same from the attached file previously (test.txt)
echo "OUTPUT FROM awk:"
export PARAM=$1
awk -F"$PARAM" 'BEGIN{a=ENVIRON["PARAM"]}
{
{
if(substr($2, 1,4)=="<ENVIRON["PARAM"]>")
{
printf "%s      %s\n", substr($1, 6),substr($1, 1,1);
}
printf substr($0, index($0,"<"ENVIRON["PARAM"]">")+4, -4+index($0,"</"ENVIRON["PARAM"]">")-index($0,"<"ENVIRON["PARAM"]">"))"\n";
}
}' test.html
# In awk set the environment variable $PARAM
# Take PARAM as delimiter
# Dispaly the string that appears between <$PARAM>...</PARAM>
echo "


"

echo "OUTPUT FROM sed:"
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html
#      In sed replace
#            .*<$1>\(.*\)<\/$1>.*
#            any set of characters followed by <$PARAM>any set of characters excluding backslash.
#      With
#            \1
#            Display the string that appears between any set of characters followed AND backslash.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
Utilizing an array to gracefully append to a list of EmailAddresses
In this Micro Tutorial viewers will learn how to create navigation buttons that change on rollover, using CSS (Continuation of the CSS Image Sprite tutorial) Create a parent ID for all the list items       - Specify position: absolute and display: block…
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

914 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now