Link to home
Start Free TrialLog in
Avatar of radix655
radix655

asked on

Script to get count of parent Orders

The message will be like this:

20110803 22:25:03.887491 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore1][ParentID<25101>=-1][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY]......[some tags deleted]

I need to find out for each SourceSystem
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag
3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag

This is a huge file. Performance need not be optimal but should not run for more than an hour.

Even a perl script would be fine as well.

Avatar of wesly_chen
wesly_chen
Flag of United States of America image

Please post in "Perl" zone since it is easier to use perl to parse your message.

Besides, please provide longer message (at least includes 3~5 SourceSystem) in Code or File.
Avatar of radix655
radix655

ASKER

@Admin
Please edit this question to post it in Perl script. I can't seem to select zones when I edit.
Under
Programming -> Prog languages -> Perl

Please provide more data and sample output for experts to understand your questions as my first post.
20110803 22:25:03.887491 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore1][ParentID<25101>=255555][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY]

20110803 22:25:03.887492 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore2][ParentID<25101>=-1][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY2][DeskID=101]

20110803 22:25:03.887493 [ORDER INS            ] [Major=O][Minor=I][src=something][Id=1][ref=674][SourceSystemTimeStamp=2011-08-04 01:10:03.979][SourceSystemCode=1903][SourceSystemCode<25000>=1903][Source<25102>=gtstore3][ParentID<25101>=-1][Side<54>=2][OrderID<25100>=1][RefId<25105>=100000][Tif<59>=1][OrderType<40>=2][Capacity<528>=A][Destination<25104>=router][ClOrdId<11>=00299867010002][MaxFloor<111>=100][SourceSystem<5177>=RUBY]


1) how many orders have ParentID != -1
SourceSystem RUBY has 1 Child
SourceSystem RUBY2 has 0 Parent

2) how many orders have ParentID == -1
SourceSystem RUBY has 1 Parent
SourceSystem RUBY2 has 1 Parent

3) how many orders do not have DeskId tag
SourceSystem RUBY has 2 Orders with no DeskID
SourceSystem RUBY2 has 0 Orders with no DeskID

3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
Source gtstore1 has 1 Child
Source gtstore2 has 0 Child
Source gtstore3 has 0 Child

2) how many orders have ParentID == -1
Source gtstore1 has 0 Parent
Source gtstore2 has 1 Parent
Source gtstore3 has 1 Parent

3) how many orders do not have DeskId tag
Source gtstore1 has 1 Orders with no DeskID
Source gtstore2 has 0 Orders with no DeskID
Source gtstore3 has 1 Orders with no DeskID

I wrote the Shell script. If there are any performance improvements that can be done. Please let me know.
#!/bin/gawk -f
BEGIN \
{ \
        PARENT_SUM=0; \
        CHILD_SUM=0; \
        PARENTS_WITH_DESK=0; \
        PARENTS_WITH_DESK_TYPES=0; \
        PARENTS_WITH_NONE=0; \
        CHILD_WITH_DESK=0; \
        CHILD_WITH_DESK_TYPES=0; \
        CHILD_WITH_NONE=0; \
        DESK_TYPE_LENGTH=length("<25417>=");
} \
{ \
        ENDED=1; \
        var=$0; \
        source_system_idx=0;
        while(ENDED==1) \
        { \
                source_system_idx=index(var,"<5177>="); \
                match(var, /ParentID<25101>=([^\]]*)\]/, parent_arr); \
                parent=parent_arr[1]; \
                match(var, /Source<25102>=([^\]]*)\]/, source_arr);  \
                source=source_arr[1]; \
                if(source_system_idx > 0) \
                { \
                        match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
                        source_system=source_system_arr[1];
                } \
                if(parent == -1) \
                { \
                        PARENT_SUM=PARENT_SUM+1; \
                        if(source_system_idx > 0) {
                                source_system_parent_sum[source_system] += 1;
                        }
                        source_parent_sum[source] += 1;
                        DESK_IDX=index(var, "<25418>=");
                        DESK_TYPES_IDX=index(var, "<25417>=");
                        if(DESK_IDX > 0)
                        {
                                PARENTS_WITH_DESK=PARENTS_WITH_DESK+1;
                                if(source_system_idx > 0) {
                                        source_system_parent_desk[source_system] += 1;
                                }
                                source_parent_desk[source] += 1;
                        }
                        if(DESK_TYPES_IDX > 0)
                        {
                                PARENTS_WITH_DESK_TYPES=PARENTS_WITH_DESK_TYPES+1;
                                desk_type=substr(var, DESK_TYPES_IDX + DESK_TYPE_LENGTH, 1);
                                if(source_system_idx > 0) {
                                        source_system_parent_desk_types[source_system] += 1;
                                }
                                source_parent_desk_types[source] += 1;
                        }
                        if(DESK_IDX <= 0 && DESK_TYPES_IDX <= 0)
                        {
                                PARENTS_WITH_NONE=PARENTS_WITH_NONE+1;
                                if(source_system_idx > 0) {
                                        source_system_parent_desk_none[source_system] += 1;
                                }
                                source_parent_desk_none[source] += 1;
                        }
                } \
                else \
                { \
                        CHILD_SUM=CHILD_SUM+1; \
                        if(source_system_idx > 0) {
                                source_system_child_sum[source_system] += 1;
                        }
                        source_child_sum[source] += 1;
                        DESK_IDX=index(var, "<25418>=");
                        DESK_TYPES_IDX=index(var, "<25417>=");
                        if(DESK_IDX > 0)
                        {
                                CHILD_WITH_DESK=CHILD_WITH_DESK+1;
                                if(source_system_idx > 0) {
                                        source_system_child_desk[source_system] += 1;
                                }
                                source_child_desk[source] += 1;
                        }
                        if(DESK_TYPES_IDX > 0)
                        {
                                CHILD_WITH_DESK_TYPES=CHILD_WITH_DESK_TYPES+1;
                                desk_type=substr(var, DESK_TYPES_IDX + DESK_TYPE_LENGTH, 1);
                                if(source_system_idx > 0) {
                                        source_system_child_desk_types[source_system] += 1;
                                }
                                source_child_desk_types[source] += 1;
                        }
                        if(DESK_IDX <= 0 && DESK_TYPES_IDX <= 0)
                        {
                                CHILD_WITH_NONE=CHILD_WITH_NONE+1;
                                if(source_system_idx > 0) {
                                        source_system_child_desk_none[source_system] += 1;
                                }
                                source_child_desk_none[source] += 1;
                        }

                } \
                ENDED=getline var; \
        }  \
} \
END \
{ \

	print "----------Consolidated Sum for Parent and Children-------------"; \
	
        print "Number of Parents:" PARENT_SUM; \
        print "Number of Children:" CHILD_SUM; \

        print "Number of Parents with No Desk ID's:" PARENTS_WITH_NONE; \
        print "Number of Parents with Desk ID's:" PARENTS_WITH_DESK; \
        print "Number of Parents with No Desk Types:" PARENTS_WITH_DESK_TYPES; \

        print "Number of Children with No Desk ID's:" CHILD_WITH_NONE; \
        print "Number of Children with Desk ID's:" CHILD_WITH_DESK; \
        print "Number of Children with No Desk Types:" CHILD_WITH_DESK_TYPES; \
        
        print "----------Consolidated Sum for each SourceSystem on Parent------"; \

        for (var in source_system_parent_sum) \
                print var ",  Source System Parent sum " source_system_parent_sum[var]; \
        for (var in source_system_parent_desk) \
                print var ",  Source System Parent With Desk " source_system_parent_desk[var]; \
        for (var in source_system_parent_desk_types) \
                print var ",  Source System Parent With DeskType " source_system_parent_desk_types[var]; \
        for (var in source_system_parent_desk_none) \
                print var ",  Source System Parent With no Desk or DeskType " source_system_parent_desk_none[var]; \
                
        print "----------Consolidated Sum for each SourceSystem on Children-----"; \

        for (var in source_system_child_sum) \
                print var ",  Source System Child sum " source_system_child_sum[var]; \
        for (var in source_system_child_desk) \
                print var ",  Source System Child With Desk " source_system_child_desk[var]; \
        for (var in source_system_child_desk_types) \
                print var ",  Source System Child With DeskType " source_system_child_desk_types[var]; \
        for (var in source_system_child_desk_none) \
                print var ",  Source System Child With no Desk or DeskType " source_system_child_desk_none[var]; \
                
        print "----------Consolidated Sum for each Source on Parent--------------"; \

        for (var in source_parent_sum) \
                print var ",  Source Parent sum " source_parent_sum[var]; \
        for (var in source_parent_desk) \
                print var ",  Source Parent With Desk " source_parent_desk[var]; \
        for (var in source_parent_desk_types) \
                print var ",  Source Parent With DeskType " source_parent_desk_types[var]; \
        for (var in source_parent_desk_none) \
                print var ",  Source Parent With no Desk or DeskType " source_parent_desk_none[var]; \
                
        print "----------Consolidated Sum for each Source on Children-------------"; \
               

        for (var in source_child_sum) \
                print var ",  Source Child sum " source_child_sum[var]; \
        for (var in source_child_desk) \
                print var ",  Source Child With Desk " source_child_desk[var]; \
        for (var in source_child_desk_types) \
                print var ", Source Child With DeskType " source_child_desk_types[var]; \
        for (var in source_child_desk_none) \
                print var ", Source Child With no Desk or DeskType " source_child_desk_none[var]; \
} \

Open in new window

line 25 ~ 29 can be combined into
line 33 ~ 35 and line 68 ~ 70
to reduce one if condition check.
@wesly_chen:
Can you please elaborate?

Thanks.
line 25-29
--------
if(source_system_idx > 0) \
{ \
        match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
        source_system=source_system_arr[1];
} \
------
And the following if--else statemnet
if(parent == -1)
    # line 33-35 ----
    if(source_system_idx > 0) {
        source_system_parent_sum[source_system] += 1;
    }
...
else
{
    # line 68-70 ----
    if(source_system_idx > 0) {
         source_system_child_sum[source_system] += 1;
    }
...
======================
They both check if condition, if(source_system_idx > 0)

So merge line 25-29 into
line 33 ~ 35 and line 68 ~ 70



if(parent == -1)
    # line 33-35 ----
    if(source_system_idx > 0) {
        match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
        source_system=source_system_arr[1];
        source_system_parent_sum[source_system] += 1;
    }
...
else
{
    # line 68-70 ----
    if(source_system_idx > 0) {
         match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
         source_system=source_system_arr[1];
         source_system_child_sum[source_system] += 1;
    }
...

Open in new window

I thought that match (to obtain the substring) might have some impact on performance. Hence, I tried to obtain the substring using match once at the start after getting each line.
in if--else, it is run once only, too. But you reduce one extra "if" check.
Okay.

I have one more question:

Is this the best way to get a substring in awk?

match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr);  \
source_system=source_system_arr[1];

And also, I am getting three substrings.
one for parentID, one for Source, and one for SourceSystem.

For that it processes the line three times, right?

Is there a way to get all the three substring processing the line only once?

I am running the script on a 107GB file. Lot of orders yesterday. It is taking like 2.5 hours to run the script. Hence I asked.
ASKER CERTIFIED SOLUTION
Avatar of wesly_chen
wesly_chen
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for suggestions.