?
Solved

Parse html tags using awk sed

Posted on 2009-04-13
5
Medium Priority
?
2,719 Views
Last Modified: 2013-12-26
Please can someone help with ideas on how to parse html tags using sed and awk in linux

for example,

test.html:
<TS>SOMETHING</TS><TD>EXAMPLE</TD>

My script should be able to just output what's inside a tag

myscript TD
EXAMPLE  
myscript TS
SOMETHING

Any help is very much appreciated.

Many thanks,
Krish
0
Comment
Question by:Jkrish
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
5 Comments
 
LVL 48

Accepted Solution

by:
Tintin earned 1000 total points
ID: 24132305
sed and awk aren't suitable tools for parsing HTML.  

*if* you HTML is consistently formatted as per above, then you can do
#!/bin/sh
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html

Open in new window

0
 
LVL 9

Assisted Solution

by:Murugesan Nagarajan
Murugesan Nagarajan earned 1000 total points
ID: 24792939

Sample shell scripting for awk, sed commands.

Open in new window

test.txt
0
 
LVL 9

Expert Comment

by:Murugesan Nagarajan
ID: 24986388
#!/bin/sh
#Same from the attached file previously (test.txt)
echo "OUTPUT FROM awk:"
export PARAM=$1
awk -F"$PARAM" 'BEGIN{a=ENVIRON["PARAM"]}
{
{
if(substr($2, 1,4)=="<ENVIRON["PARAM"]>")
{
printf "%s      %s\n", substr($1, 6),substr($1, 1,1);
}
printf substr($0, index($0,"<"ENVIRON["PARAM"]">")+4, -4+index($0,"</"ENVIRON["PARAM"]">")-index($0,"<"ENVIRON["PARAM"]">"))"\n";
}
}' test.html
# In awk set the environment variable $PARAM
# Take PARAM as delimiter
# Dispaly the string that appears between <$PARAM>...</PARAM>
echo "


"

echo "OUTPUT FROM sed:"
sed "s/.*<$1>\(.*\)<\/$1>.*/\1/g" test.html
#      In sed replace
#            .*<$1>\(.*\)<\/$1>.*
#            any set of characters followed by <$PARAM>any set of characters excluding backslash.
#      With
#            \1
#            Display the string that appears between any set of characters followed AND backslash.
0

Featured Post

WordPress Tutorial 1: Installation & Setup

WordPress is a very popular option for running your web site and can be used to get your content online quickly for the world to see. This guide will walk you through installing the WordPress server software and the initial setup process.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
Is your Office 365 signature not working the way you want it to? Are signature updates taking up too much of your time? Let's run through the most common problems that an IT administrator can encounter when dealing with Office 365 email signatures.
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…
Suggested Courses

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question