Link to home
Start Free TrialLog in
Avatar of jrram
jrramFlag for United States of America

asked on

Help with a 'grep' statement

I need help with a grep statement.  Suppose I have the below XML code snippet stored in a variable called 'indicator'.  I am using the grep statement below in a loop to extract the conditions one at a time.

export condition=`echo $indicator | grep -o "<condition cid=\"\$COND_NUM\">*.*<\/condition>"`

On first pass of the loop, COND_NUM will equal 2 so I'm expecting to only get this condition, but everything gets returned.  I think the problem is b/c I am using the *.*<\/condition> in the grep statement and its recognizing the second <\/condition> at the end of the file instead of the first one it comes to.

How can I modify my grep statement to only get the first condition?
<condition cid="1">
        <description>TRN.MERCHANT_NAME1 = substr(VEN.SCRUB_NAME1,1,length(TRN.MERCHANT_NAME1)))</description>
 
        <change_sql>UPDATE AP_VENDOR SET NAME1='NAME1_6A', SCRUB_NAME1='VWXYZabcde' WHERE VENDOR_ID='VENID-6';</change_sql>                                
        <change_sql>UPDATE PCD_TRANSACTION SET MERCHANT_NAME1='VWXYZ', TRANSACTION_DATE=(SELECT INVOICE_DATE FROM AP_VOUCHER WHERE VOUCHER_ID='OSTBU-6') WHERE MERCHANT_ID='6';</change_sql>
        <change_verify_sql>SELECT COUNT(*) FROM PCD_TRANSACTION WHERE MERCHANT_NAME1='VWXYZ';</change_verify_sql>
        <change_verify_count>1</change_verify_count>
</condition>
<condition cid="2">
        <description>(VEN.SCRUB_NAME1 = substr(TRN.MERCHANT_NAME1,1,length(VEN.SCRUB_NAME1))</description>
 
        <change_sql>UPDATE AP_VENDOR SET NAME1='NAME1_5A', SCRUB_NAME1='ABCDE' WHERE VENDOR_ID='VENID-5';</change_sql>                                
        <change_sql>UPDATE PCD_TRANSACTION SET MERCHANT_NAME1='ABCDEjihgf', TRANSACTION_DATE=(SELECT INVOICE_DATE FROM AP_VOUCHER WHERE VOUCHER_ID='OSTBU-5') WHERE MERCHANT_ID='5';</change_sql>
        <change_verify_sql>SELECT COUNT(*) FROM PCD_TRANSACTION WHERE MERCHANT_NAME1='ABCDEjihgf';</change_verify_sql>
        <change_verify_count>1</change_verify_count>
</condition>

Open in new window

Avatar of Hugh Fraser
Hugh Fraser
Flag of Canada image

Grep doesn't support multi-line patterns. Try this awk script as a starting point.

echo $indicator | awk 'BEGIN {x=0}
{
if ($0~"<condition cid=\"1\">") {x=1}
if (x==1) {print $0}
if ($0~"</condition>") {x=0}
}'

I'm not an expert awk programmer, so you may have to play with the substitution for cid="n", or just write a couple of scripts with different values.
Avatar of jrram

ASKER

The XML is stored in a variable, so I don't think (?) multi-line input.  My thinking is if it was multi-line input then it wouldn't not work when trying to extract the 2nd condition also.

I think the questions is how do I tell it to stop when it finds the first </condition>.
I see. So if this this appears as a single line, the following works.

echo $y
This is a test <condition cid="1">more stuff</condition>More junk

echo $x
1

echo $y | grep -o "<condition cid=\"$x\">*.*<\/condition>"
<condition cid="1">more stuff</condition>

More importantly, this looks suspiciously like your example except for the fact that your XML string prints out as multiple lines. Can you do an

echo $indicator

to see what it looks like.
Avatar of jrram

ASKER

When I do an "echo $indicator | wc -l" it returns 1 so this confirms the input is only 1 line.

And yes, in the example test condition that you gave, the grep expression does work b/c you only have one </condition> in variable $y.  If you put a second one in there (see example) below, then run the grep statement, it returns too much.

Data Setup:

x="1"
y="<condition cid="1">test data 1</condition><condition cid="2">test data 2</condition>"

Problem Statement:
echo $y | grep -o "<condition cid=\"$x\">*.*<\/condition>"

=====

Expected Result:

<condition cid="1">test data 1</condition>

Actual Result:

<condition cid="1">test data 1</condition><condition cid="2">test data 2</condition>

Notes:

As it is, the grep statement correctly finds the <condition cid="1">, but I think because of the '*.*', it greedily ignores the first </condition> (expected stopping point) and includes everything up until the last </condition> value.

Does this make sense?  Know of any parameters or changes that can me made to grep statement?
It does make sense. The *.* should be .*? to make it non-greedy, but that doesn't seem to work either. Are you bound to a grep solution, or are you willing to use an alternative?
Avatar of jrram

ASKER

I'm open to using an alternative solution.  I chose grep b/c it seemed like a simple thing to do but doesn't appear that way anymore.  I also looked at SED, but that'd didn't work for me either (as a standalone solution) and I'm not that familiar with awk, but it seems like it could work.

I'm still interested in whatever alternate solution you can provide, but as a workaround I added a sed statement to after the grep statement to chop off the un-needed data and this works for me.

condition=`echo $indicator | grep -o "<condition cid=\"$COND_NUM\">*.*<\/condition>" | sed "s/<\/condition>.*//g"`

Avatar of macker-
macker-

Have you tried using -m to match just the first occurrence?

You could combine this in a bash script, with a for loop, to increment $i and loop thru the matches, assigning each to a corresponding numbered variable.
ASKER CERTIFIED SOLUTION
Avatar of jrram
jrram
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry for the delay, jrram. The solution you posted is classic Unix shell stuff, and I can't find a way to do better in shell code.