Solved

Parse html tags using awk sed

Posted on 2009-04-13
5
2,641 Views
Last Modified: 2013-12-26
Please can someone help with ideas on how to parse html tags using sed and awk in linux

for example,

test.html:
<TS>SOMETHING</TS><TD>EXAMPLE</TD>

My script should be able to just output what's inside a tag

myscript TD
EXAMPLE  
myscript TS
SOMETHING

Any help is very much appreciated.

Many thanks,
Krish
0
Comment
Question by:Jkrish
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
5 Comments
 
LVL 48

Accepted Solution

by:
Tintin earned 250 total points
ID: 24132305
sed and awk aren't suitable tools for parsing HTML.  

*if* you HTML is consistently formatted as per above, then you can do
#!/bin/sh
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html

Open in new window

0
 
LVL 8

Assisted Solution

by:Murugesan Nagarajan
Murugesan Nagarajan earned 250 total points
ID: 24792939

Sample shell scripting for awk, sed commands.

Open in new window

test.txt
0
 
LVL 8

Expert Comment

by:Murugesan Nagarajan
ID: 24986388
#!/bin/sh
#Same from the attached file previously (test.txt)
echo "OUTPUT FROM awk:"
export PARAM=$1
awk -F"$PARAM" 'BEGIN{a=ENVIRON["PARAM"]}
{
{
if(substr($2, 1,4)=="<ENVIRON["PARAM"]>")
{
printf "%s      %s\n", substr($1, 6),substr($1, 1,1);
}
printf substr($0, index($0,"<"ENVIRON["PARAM"]">")+4, -4+index($0,"</"ENVIRON["PARAM"]">")-index($0,"<"ENVIRON["PARAM"]">"))"\n";
}
}' test.html
# In awk set the environment variable $PARAM
# Take PARAM as delimiter
# Dispaly the string that appears between <$PARAM>...</PARAM>
echo "


"

echo "OUTPUT FROM sed:"
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html
#      In sed replace
#            .*<$1>\(.*\)<\/$1>.*
#            any set of characters followed by <$PARAM>any set of characters excluding backslash.
#      With
#            \1
#            Display the string that appears between any set of characters followed AND backslash.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Find out what you should include to make the best professional email signature for your organization.
Originally, this post was published on Monitis Blog, you can check it here . Websites are getting bigger and more complicated by the day. Video, images and custom fonts are all great for showcasing your product or service. But the price to pay in…
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.
Learn how to create flexible layouts using relative units in CSS.  New relative units added in CSS3 include vw(viewports width), vh(viewports height), vmin(minimum of viewports height and width), and vmax (maximum of viewports height and width).

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question