Jkrish
asked on
Parse html tags using awk sed
Please can someone help with ideas on how to parse html tags using sed and awk in linux
for example,
test.html:
<TS>SOMETHING</TS><TD>EXAM PLE</TD>
My script should be able to just output what's inside a tag
myscript TD
EXAMPLE
myscript TS
SOMETHING
Any help is very much appreciated.
Many thanks,
Krish
for example,
test.html:
<TS>SOMETHING</TS><TD>EXAM
My script should be able to just output what's inside a tag
myscript TD
EXAMPLE
myscript TS
SOMETHING
Any help is very much appreciated.
Many thanks,
Krish
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
#Same from the attached file previously (test.txt)
echo "OUTPUT FROM awk:"
export PARAM=$1
awk -F"$PARAM" 'BEGIN{a=ENVIRON["PARAM"]}
{
{
if(substr($2, 1,4)=="<ENVIRON["PARAM"]>"
{
printf "%s %s\n", substr($1, 6),substr($1, 1,1);
}
printf substr($0, index($0,"<"ENVIRON["PARAM
}
}' test.html
# In awk set the environment variable $PARAM
# Take PARAM as delimiter
# Dispaly the string that appears between <$PARAM>...</PARAM>
echo "
"
echo "OUTPUT FROM sed:"
sed "s/.*<$1>\(.*\)<\/$1>.*/\1
# In sed replace
# .*<$1>\(.*\)<\/$1>.*
# any set of characters followed by <$PARAM>any set of characters excluding backslash.
# With
# \1
# Display the string that appears between any set of characters followed AND backslash.