We help IT Professionals succeed at work.

Script to get count of parent Orders

radix655
radix655 used Ask the Experts™
on
The message will be like this:

20110803 22:25:03.887491 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore1][ParentID<25101>=-1][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY]......[some tags deleted]

I need to find out for each SourceSystem
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag
3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag

This is a huge file. Performance need not be optimal but should not run for more than an hour.

Even a perl script would be fine as well.

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2011

Commented:
Please post in "Perl" zone since it is easier to use perl to parse your message.

Besides, please provide longer message (at least includes 3~5 SourceSystem) in Code or File.

Author

Commented:
@Admin
Please edit this question to post it in Perl script. I can't seem to select zones when I edit.
Top Expert 2011

Commented:
Under
Programming -> Prog languages -> Perl

Please provide more data and sample output for experts to understand your questions as my first post.

Author

Commented:
20110803 22:25:03.887491 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore1][ParentID<25101>=255555][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY]

20110803 22:25:03.887492 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore2][ParentID<25101>=-1][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY2][DeskID=101]

20110803 22:25:03.887493 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore3][ParentID<25101>=-1][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY]


1) how many orders have ParentID != -1
SourceSystem RUBY has 1 Child
SourceSystem RUBY2 has 0 Parent

2) how many orders have ParentID == -1
SourceSystem RUBY has 1 Parent
SourceSystem RUBY2 has 1 Parent

3) how many orders do not have DeskId tag
SourceSystem RUBY has 2 Orders with no DeskID
SourceSystem RUBY2 has 0 Orders with no DeskID

3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
Source gtstore1 has 1 Child
Source gtstore2 has 0 Child
Source gtstore3 has 0 Child

2) how many orders have ParentID == -1
Source gtstore1 has 0 Parent
Source gtstore2 has 1 Parent
Source gtstore3 has 1 Parent

3) how many orders do not have DeskId tag
Source gtstore1 has 1 Orders with no DeskID
Source gtstore2 has 0 Orders with no DeskID
Source gtstore3 has 1 Orders with no DeskID

Author

Commented:
I wrote the Shell script. If there are any performance improvements that can be done. Please let me know.
#!/bin/gawk -f
BEGIN \
{ \
        PARENT_SUM=0; \
        CHILD_SUM=0; \
        PARENTS_WITH_DESK=0; \
        PARENTS_WITH_DESK_TYPES=0; \
        PARENTS_WITH_NONE=0; \
        CHILD_WITH_DESK=0; \
        CHILD_WITH_DESK_TYPES=0; \
        CHILD_WITH_NONE=0; \
        DESK_TYPE_LENGTH=length("<25417>=");
} \
{ \
        ENDED=1; \
        var=$0; \
        source_system_idx=0;
        while(ENDED==1) \
        { \
                source_system_idx=index(var,"<5177>="); \
                match(var, /ParentID<25101>=([^\]]*)\]/, parent_arr); \
                parent=parent_arr[1]; \
                match(var, /Source<25102>=([^\]]*)\]/, source_arr);  \
                source=source_arr[1]; \
                if(source_system_idx > 0) \
                { \
                        match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
                        source_system=source_system_arr[1];
                } \
                if(parent == -1) \
                { \
                        PARENT_SUM=PARENT_SUM+1; \
                        if(source_system_idx > 0) {
                                source_system_parent_sum[source_system] += 1;
                        }
                        source_parent_sum[source] += 1;
                        DESK_IDX=index(var, "<25418>=");
                        DESK_TYPES_IDX=index(var, "<25417>=");
                        if(DESK_IDX > 0)
                        {
                                PARENTS_WITH_DESK=PARENTS_WITH_DESK+1;
                                if(source_system_idx > 0) {
                                        source_system_parent_desk[source_system] += 1;
                                }
                                source_parent_desk[source] += 1;
                        }
                        if(DESK_TYPES_IDX > 0)
                        {
                                PARENTS_WITH_DESK_TYPES=PARENTS_WITH_DESK_TYPES+1;
                                desk_type=substr(var, DESK_TYPES_IDX + DESK_TYPE_LENGTH, 1);
                                if(source_system_idx > 0) {
                                        source_system_parent_desk_types[source_system] += 1;
                                }
                                source_parent_desk_types[source] += 1;
                        }
                        if(DESK_IDX <= 0 && DESK_TYPES_IDX <= 0)
                        {
                                PARENTS_WITH_NONE=PARENTS_WITH_NONE+1;
                                if(source_system_idx > 0) {
                                        source_system_parent_desk_none[source_system] += 1;
                                }
                                source_parent_desk_none[source] += 1;
                        }
                } \
                else \
                { \
                        CHILD_SUM=CHILD_SUM+1; \
                        if(source_system_idx > 0) {
                                source_system_child_sum[source_system] += 1;
                        }
                        source_child_sum[source] += 1;
                        DESK_IDX=index(var, "<25418>=");
                        DESK_TYPES_IDX=index(var, "<25417>=");
                        if(DESK_IDX > 0)
                        {
                                CHILD_WITH_DESK=CHILD_WITH_DESK+1;
                                if(source_system_idx > 0) {
                                        source_system_child_desk[source_system] += 1;
                                }
                                source_child_desk[source] += 1;
                        }
                        if(DESK_TYPES_IDX > 0)
                        {
                                CHILD_WITH_DESK_TYPES=CHILD_WITH_DESK_TYPES+1;
                                desk_type=substr(var, DESK_TYPES_IDX + DESK_TYPE_LENGTH, 1);
                                if(source_system_idx > 0) {
                                        source_system_child_desk_types[source_system] += 1;
                                }
                                source_child_desk_types[source] += 1;
                        }
                        if(DESK_IDX <= 0 && DESK_TYPES_IDX <= 0)
                        {
                                CHILD_WITH_NONE=CHILD_WITH_NONE+1;
                                if(source_system_idx > 0) {
                                        source_system_child_desk_none[source_system] += 1;
                                }
                                source_child_desk_none[source] += 1;
                        }

                } \
                ENDED=getline var; \
        }  \
} \
END \
{ \

	print "----------Consolidated Sum for Parent and Children-------------"; \
	
        print "Number of Parents:" PARENT_SUM; \
        print "Number of Children:" CHILD_SUM; \

        print "Number of Parents with No Desk ID's:" PARENTS_WITH_NONE; \
        print "Number of Parents with Desk ID's:" PARENTS_WITH_DESK; \
        print "Number of Parents with No Desk Types:" PARENTS_WITH_DESK_TYPES; \

        print "Number of Children with No Desk ID's:" CHILD_WITH_NONE; \
        print "Number of Children with Desk ID's:" CHILD_WITH_DESK; \
        print "Number of Children with No Desk Types:" CHILD_WITH_DESK_TYPES; \
        
        print "----------Consolidated Sum for each SourceSystem on Parent------"; \

        for (var in source_system_parent_sum) \
                print var ",  Source System Parent sum " source_system_parent_sum[var]; \
        for (var in source_system_parent_desk) \
                print var ",  Source System Parent With Desk " source_system_parent_desk[var]; \
        for (var in source_system_parent_desk_types) \
                print var ",  Source System Parent With DeskType " source_system_parent_desk_types[var]; \
        for (var in source_system_parent_desk_none) \
                print var ",  Source System Parent With no Desk or DeskType " source_system_parent_desk_none[var]; \
                
        print "----------Consolidated Sum for each SourceSystem on Children-----"; \

        for (var in source_system_child_sum) \
                print var ",  Source System Child sum " source_system_child_sum[var]; \
        for (var in source_system_child_desk) \
                print var ",  Source System Child With Desk " source_system_child_desk[var]; \
        for (var in source_system_child_desk_types) \
                print var ",  Source System Child With DeskType " source_system_child_desk_types[var]; \
        for (var in source_system_child_desk_none) \
                print var ",  Source System Child With no Desk or DeskType " source_system_child_desk_none[var]; \
                
        print "----------Consolidated Sum for each Source on Parent--------------"; \

        for (var in source_parent_sum) \
                print var ",  Source Parent sum " source_parent_sum[var]; \
        for (var in source_parent_desk) \
                print var ",  Source Parent With Desk " source_parent_desk[var]; \
        for (var in source_parent_desk_types) \
                print var ",  Source Parent With DeskType " source_parent_desk_types[var]; \
        for (var in source_parent_desk_none) \
                print var ",  Source Parent With no Desk or DeskType " source_parent_desk_none[var]; \
                
        print "----------Consolidated Sum for each Source on Children-------------"; \
               

        for (var in source_child_sum) \
                print var ",  Source Child sum " source_child_sum[var]; \
        for (var in source_child_desk) \
                print var ",  Source Child With Desk " source_child_desk[var]; \
        for (var in source_child_desk_types) \
                print var ", Source Child With DeskType " source_child_desk_types[var]; \
        for (var in source_child_desk_none) \
                print var ", Source Child With no Desk or DeskType " source_child_desk_none[var]; \
} \

Open in new window

Top Expert 2011

Commented:
line 25 ~ 29 can be combined into
line 33 ~ 35 and line 68 ~ 70
to reduce one if condition check.

Author

Commented:
@wesly_chen:
Can you please elaborate?

Thanks.
Top Expert 2011

Commented:
line 25-29
--------
if(source_system_idx > 0) \
{ \
        match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
        source_system=source_system_arr[1];
} \
------
And the following if--else statemnet
if(parent == -1)
    # line 33-35 ----
    if(source_system_idx > 0) {
        source_system_parent_sum[source_system] += 1;
    }
...
else
{
    # line 68-70 ----
    if(source_system_idx > 0) {
         source_system_child_sum[source_system] += 1;
    }
...
======================
They both check if condition, if(source_system_idx > 0)

So merge line 25-29 into
line 33 ~ 35 and line 68 ~ 70



if(parent == -1)
    # line 33-35 ----
    if(source_system_idx > 0) {
        match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
        source_system=source_system_arr[1];
        source_system_parent_sum[source_system] += 1;
    }
...
else
{
    # line 68-70 ----
    if(source_system_idx > 0) {
         match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
         source_system=source_system_arr[1];
         source_system_child_sum[source_system] += 1;
    }
...

Open in new window

Author

Commented:
I thought that match (to obtain the substring) might have some impact on performance. Hence, I tried to obtain the substring using match once at the start after getting each line.
Top Expert 2011

Commented:
in if--else, it is run once only, too. But you reduce one extra "if" check.

Author

Commented:
Okay.

I have one more question:

Is this the best way to get a substring in awk?

match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
source_system=source_system_arr[1];

And also, I am getting three substrings.
one for parentID, one for Source, and one for SourceSystem.

For that it processes the line three times, right?

Is there a way to get all the three substring processing the line only once?

I am running the script on a 107GB file. Lot of orders yesterday. It is taking like 2.5 hours to run the script. Hence I asked.
Top Expert 2011
Commented:
For those two matches,
match(var, /ParentID<25101>=([^\]]*)\]/, parent_arr); \
match(var, /Source<25102>=([^\]]*)\]/, source_arr);  \

If you know for sure Source<...> is before ParentID<..>, then
you can merge then into one
----------------------
match(var, /Source<25102>=([^\]]*)\].*ParentID<25101>=([^\]]*)\]/, arr);
source=arr[1];
parent=arr[2];

But you need to verify the regular expression.

Author

Commented:
Thanks for suggestions.