Solved

Parse html tags using awk sed

Posted on 2009-04-13
5
2,669 Views
Last Modified: 2013-12-26
Please can someone help with ideas on how to parse html tags using sed and awk in linux

for example,

test.html:
<TS>SOMETHING</TS><TD>EXAMPLE</TD>

My script should be able to just output what's inside a tag

myscript TD
EXAMPLE  
myscript TS
SOMETHING

Any help is very much appreciated.

Many thanks,
Krish
0
Comment
Question by:Jkrish
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
5 Comments
 
LVL 48

Accepted Solution

by:
Tintin earned 250 total points
ID: 24132305
sed and awk aren't suitable tools for parsing HTML.  

*if* you HTML is consistently formatted as per above, then you can do
#!/bin/sh
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html

Open in new window

0
 
LVL 9

Assisted Solution

by:Murugesan Nagarajan
Murugesan Nagarajan earned 250 total points
ID: 24792939

Sample shell scripting for awk, sed commands.

Open in new window

test.txt
0
 
LVL 9

Expert Comment

by:Murugesan Nagarajan
ID: 24986388
#!/bin/sh
#Same from the attached file previously (test.txt)
echo "OUTPUT FROM awk:"
export PARAM=$1
awk -F"$PARAM" 'BEGIN{a=ENVIRON["PARAM"]}
{
{
if(substr($2, 1,4)=="<ENVIRON["PARAM"]>")
{
printf "%s      %s\n", substr($1, 6),substr($1, 1,1);
}
printf substr($0, index($0,"<"ENVIRON["PARAM"]">")+4, -4+index($0,"</"ENVIRON["PARAM"]">")-index($0,"<"ENVIRON["PARAM"]">"))"\n";
}
}' test.html
# In awk set the environment variable $PARAM
# Take PARAM as delimiter
# Dispaly the string that appears between <$PARAM>...</PARAM>
echo "


"

echo "OUTPUT FROM sed:"
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html
#      In sed replace
#            .*<$1>\(.*\)<\/$1>.*
#            any set of characters followed by <$PARAM>any set of characters excluding backslash.
#      With
#            \1
#            Display the string that appears between any set of characters followed AND backslash.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Building a website can seem like a daunting task to the uninitiated but it really only requires knowledge of two basic languages: HTML and CSS.
Today, the web development industry is booming, and many people consider it to be their vocation. The question you may be asking yourself is – how do I become a web developer?
In this tutorial viewers will learn how to style transparent/translucent elements using alpha transparency in CSS Start with a normal styled element, such as a div.: Define its "background-color" property as "rgba (255, 255, 255, .5): The numbers in…
In this tutorial viewers will learn how to embed an audio file in a webpage using HTML5. Ensure your DOCTYPE declaration is set to HTML5: : The declaration should display (CODE) HTML5 is supported by the most recent versions of all major browsers…

718 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question