radix655
asked on
Script to get count of parent Orders
The message will be like this:
20110803 22:25:03.887491 [ORDER INS ] [Major=O][Minor=I][src=som ething][Id =1][ref=67 4][SourceS ystemTimeS tamp=2011- 08-04 01:10:03.979][SourceSystem Code=1903] [SourceSys temCode<25 000>=1903] [Source<25 102>=gtsto re1][Paren tID<25101> =-1][Side< 54>=2][Ord erID<25100 >=1][RefId <25105>=10 0000][Tif< 59>=1][Ord erType<40> =2][Capaci ty<528>=A] [Destinati on<25104>= router][Cl OrdId<11>= 0029986701 0002][MaxF loor<111>= 100][Sourc eSystem<51 77>=RUBY]. .....[some tags deleted]
I need to find out for each SourceSystem
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag
3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag
This is a huge file. Performance need not be optimal but should not run for more than an hour.
Even a perl script would be fine as well.
20110803 22:25:03.887491 [ORDER INS ] [Major=O][Minor=I][src=som
I need to find out for each SourceSystem
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag
3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
2) how many orders have ParentID == -1
3) how many orders do not have DeskId tag
This is a huge file. Performance need not be optimal but should not run for more than an hour.
Even a perl script would be fine as well.
ASKER
@Admin
Please edit this question to post it in Perl script. I can't seem to select zones when I edit.
Please edit this question to post it in Perl script. I can't seem to select zones when I edit.
Under
Programming -> Prog languages -> Perl
Please provide more data and sample output for experts to understand your questions as my first post.
Programming -> Prog languages -> Perl
Please provide more data and sample output for experts to understand your questions as my first post.
ASKER
20110803 22:25:03.887491 [ORDER INS ] [Major=O][Minor=I][src=som ething][Id =1][ref=67 4][SourceS ystemTimeS tamp=2011- 08-04 01:10:03.979][SourceSystem Code=1903] [SourceSys temCode<25 000>=1903] [Source<25 102>=gtsto re1][Paren tID<25101> =255555][S ide<54>=2] [OrderID<2 5100>=1][R efId<25105 >=100000][ Tif<59>=1] [OrderType <40>=2][Ca pacity<528 >=A][Desti nation<251 04>=router ][ClOrdId< 11>=002998 67010002][ MaxFloor<1 11>=100][S ourceSyste m<5177>=RU BY]
20110803 22:25:03.887492 [ORDER INS ] [Major=O][Minor=I][src=som ething][Id =1][ref=67 4][SourceS ystemTimeS tamp=2011- 08-04 01:10:03.979][SourceSystem Code=1903] [SourceSys temCode<25 000>=1903] [Source<25 102>=gtsto re2][Paren tID<25101> =-1][Side< 54>=2][Ord erID<25100 >=1][RefId <25105>=10 0000][Tif< 59>=1][Ord erType<40> =2][Capaci ty<528>=A] [Destinati on<25104>= router][Cl OrdId<11>= 0029986701 0002][MaxF loor<111>= 100][Sourc eSystem<51 77>=RUBY2] [DeskID=10 1]
20110803 22:25:03.887493 [ORDER INS ] [Major=O][Minor=I][src=som ething][Id =1][ref=67 4][SourceS ystemTimeS tamp=2011- 08-04 01:10:03.979][SourceSystem Code=1903] [SourceSys temCode<25 000>=1903] [Source<25 102>=gtsto re3][Paren tID<25101> =-1][Side< 54>=2][Ord erID<25100 >=1][RefId <25105>=10 0000][Tif< 59>=1][Ord erType<40> =2][Capaci ty<528>=A] [Destinati on<25104>= router][Cl OrdId<11>= 0029986701 0002][MaxF loor<111>= 100][Sourc eSystem<51 77>=RUBY]
1) how many orders have ParentID != -1
SourceSystem RUBY has 1 Child
SourceSystem RUBY2 has 0 Parent
2) how many orders have ParentID == -1
SourceSystem RUBY has 1 Parent
SourceSystem RUBY2 has 1 Parent
3) how many orders do not have DeskId tag
SourceSystem RUBY has 2 Orders with no DeskID
SourceSystem RUBY2 has 0 Orders with no DeskID
3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
Source gtstore1 has 1 Child
Source gtstore2 has 0 Child
Source gtstore3 has 0 Child
2) how many orders have ParentID == -1
Source gtstore1 has 0 Parent
Source gtstore2 has 1 Parent
Source gtstore3 has 1 Parent
3) how many orders do not have DeskId tag
Source gtstore1 has 1 Orders with no DeskID
Source gtstore2 has 0 Orders with no DeskID
Source gtstore3 has 1 Orders with no DeskID
20110803 22:25:03.887492 [ORDER INS ] [Major=O][Minor=I][src=som
20110803 22:25:03.887493 [ORDER INS ] [Major=O][Minor=I][src=som
1) how many orders have ParentID != -1
SourceSystem RUBY has 1 Child
SourceSystem RUBY2 has 0 Parent
2) how many orders have ParentID == -1
SourceSystem RUBY has 1 Parent
SourceSystem RUBY2 has 1 Parent
3) how many orders do not have DeskId tag
SourceSystem RUBY has 2 Orders with no DeskID
SourceSystem RUBY2 has 0 Orders with no DeskID
3) For each SourceSystem, there are multiple Source(s)
For each Source in SourceSystem:
1) how many orders have ParentID != -1
Source gtstore1 has 1 Child
Source gtstore2 has 0 Child
Source gtstore3 has 0 Child
2) how many orders have ParentID == -1
Source gtstore1 has 0 Parent
Source gtstore2 has 1 Parent
Source gtstore3 has 1 Parent
3) how many orders do not have DeskId tag
Source gtstore1 has 1 Orders with no DeskID
Source gtstore2 has 0 Orders with no DeskID
Source gtstore3 has 1 Orders with no DeskID
ASKER
I wrote the Shell script. If there are any performance improvements that can be done. Please let me know.
#!/bin/gawk -f
BEGIN \
{ \
PARENT_SUM=0; \
CHILD_SUM=0; \
PARENTS_WITH_DESK=0; \
PARENTS_WITH_DESK_TYPES=0; \
PARENTS_WITH_NONE=0; \
CHILD_WITH_DESK=0; \
CHILD_WITH_DESK_TYPES=0; \
CHILD_WITH_NONE=0; \
DESK_TYPE_LENGTH=length("<25417>=");
} \
{ \
ENDED=1; \
var=$0; \
source_system_idx=0;
while(ENDED==1) \
{ \
source_system_idx=index(var,"<5177>="); \
match(var, /ParentID<25101>=([^\]]*)\]/, parent_arr); \
parent=parent_arr[1]; \
match(var, /Source<25102>=([^\]]*)\]/, source_arr); \
source=source_arr[1]; \
if(source_system_idx > 0) \
{ \
match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr); \
source_system=source_system_arr[1];
} \
if(parent == -1) \
{ \
PARENT_SUM=PARENT_SUM+1; \
if(source_system_idx > 0) {
source_system_parent_sum[source_system] += 1;
}
source_parent_sum[source] += 1;
DESK_IDX=index(var, "<25418>=");
DESK_TYPES_IDX=index(var, "<25417>=");
if(DESK_IDX > 0)
{
PARENTS_WITH_DESK=PARENTS_WITH_DESK+1;
if(source_system_idx > 0) {
source_system_parent_desk[source_system] += 1;
}
source_parent_desk[source] += 1;
}
if(DESK_TYPES_IDX > 0)
{
PARENTS_WITH_DESK_TYPES=PARENTS_WITH_DESK_TYPES+1;
desk_type=substr(var, DESK_TYPES_IDX + DESK_TYPE_LENGTH, 1);
if(source_system_idx > 0) {
source_system_parent_desk_types[source_system] += 1;
}
source_parent_desk_types[source] += 1;
}
if(DESK_IDX <= 0 && DESK_TYPES_IDX <= 0)
{
PARENTS_WITH_NONE=PARENTS_WITH_NONE+1;
if(source_system_idx > 0) {
source_system_parent_desk_none[source_system] += 1;
}
source_parent_desk_none[source] += 1;
}
} \
else \
{ \
CHILD_SUM=CHILD_SUM+1; \
if(source_system_idx > 0) {
source_system_child_sum[source_system] += 1;
}
source_child_sum[source] += 1;
DESK_IDX=index(var, "<25418>=");
DESK_TYPES_IDX=index(var, "<25417>=");
if(DESK_IDX > 0)
{
CHILD_WITH_DESK=CHILD_WITH_DESK+1;
if(source_system_idx > 0) {
source_system_child_desk[source_system] += 1;
}
source_child_desk[source] += 1;
}
if(DESK_TYPES_IDX > 0)
{
CHILD_WITH_DESK_TYPES=CHILD_WITH_DESK_TYPES+1;
desk_type=substr(var, DESK_TYPES_IDX + DESK_TYPE_LENGTH, 1);
if(source_system_idx > 0) {
source_system_child_desk_types[source_system] += 1;
}
source_child_desk_types[source] += 1;
}
if(DESK_IDX <= 0 && DESK_TYPES_IDX <= 0)
{
CHILD_WITH_NONE=CHILD_WITH_NONE+1;
if(source_system_idx > 0) {
source_system_child_desk_none[source_system] += 1;
}
source_child_desk_none[source] += 1;
}
} \
ENDED=getline var; \
} \
} \
END \
{ \
print "----------Consolidated Sum for Parent and Children-------------"; \
print "Number of Parents:" PARENT_SUM; \
print "Number of Children:" CHILD_SUM; \
print "Number of Parents with No Desk ID's:" PARENTS_WITH_NONE; \
print "Number of Parents with Desk ID's:" PARENTS_WITH_DESK; \
print "Number of Parents with No Desk Types:" PARENTS_WITH_DESK_TYPES; \
print "Number of Children with No Desk ID's:" CHILD_WITH_NONE; \
print "Number of Children with Desk ID's:" CHILD_WITH_DESK; \
print "Number of Children with No Desk Types:" CHILD_WITH_DESK_TYPES; \
print "----------Consolidated Sum for each SourceSystem on Parent------"; \
for (var in source_system_parent_sum) \
print var ", Source System Parent sum " source_system_parent_sum[var]; \
for (var in source_system_parent_desk) \
print var ", Source System Parent With Desk " source_system_parent_desk[var]; \
for (var in source_system_parent_desk_types) \
print var ", Source System Parent With DeskType " source_system_parent_desk_types[var]; \
for (var in source_system_parent_desk_none) \
print var ", Source System Parent With no Desk or DeskType " source_system_parent_desk_none[var]; \
print "----------Consolidated Sum for each SourceSystem on Children-----"; \
for (var in source_system_child_sum) \
print var ", Source System Child sum " source_system_child_sum[var]; \
for (var in source_system_child_desk) \
print var ", Source System Child With Desk " source_system_child_desk[var]; \
for (var in source_system_child_desk_types) \
print var ", Source System Child With DeskType " source_system_child_desk_types[var]; \
for (var in source_system_child_desk_none) \
print var ", Source System Child With no Desk or DeskType " source_system_child_desk_none[var]; \
print "----------Consolidated Sum for each Source on Parent--------------"; \
for (var in source_parent_sum) \
print var ", Source Parent sum " source_parent_sum[var]; \
for (var in source_parent_desk) \
print var ", Source Parent With Desk " source_parent_desk[var]; \
for (var in source_parent_desk_types) \
print var ", Source Parent With DeskType " source_parent_desk_types[var]; \
for (var in source_parent_desk_none) \
print var ", Source Parent With no Desk or DeskType " source_parent_desk_none[var]; \
print "----------Consolidated Sum for each Source on Children-------------"; \
for (var in source_child_sum) \
print var ", Source Child sum " source_child_sum[var]; \
for (var in source_child_desk) \
print var ", Source Child With Desk " source_child_desk[var]; \
for (var in source_child_desk_types) \
print var ", Source Child With DeskType " source_child_desk_types[var]; \
for (var in source_child_desk_none) \
print var ", Source Child With no Desk or DeskType " source_child_desk_none[var]; \
} \
line 25 ~ 29 can be combined into
line 33 ~ 35 and line 68 ~ 70
to reduce one if condition check.
line 33 ~ 35 and line 68 ~ 70
to reduce one if condition check.
ASKER
@wesly_chen:
Can you please elaborate?
Thanks.
Can you please elaborate?
Thanks.
line 25-29
--------
if(source_system_idx > 0) \
{ \
match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system_arr); \
source_system=source_syste m_arr[1];
} \
------
And the following if--else statemnet
if(parent == -1)
# line 33-35 ----
if(source_system_idx > 0) {
source_system_parent_sum[s ource_syst em] += 1;
}
...
else
{
# line 68-70 ----
if(source_system_idx > 0) {
source_system_child_sum[so urce_syste m] += 1;
}
...
======================
They both check if condition, if(source_system_idx > 0)
So merge line 25-29 into
line 33 ~ 35 and line 68 ~ 70
--------
if(source_system_idx > 0) \
{ \
match(var, /SourceSystem<5177>=([^\]]
source_system=source_syste
} \
------
And the following if--else statemnet
if(parent == -1)
# line 33-35 ----
if(source_system_idx > 0) {
source_system_parent_sum[s
}
...
else
{
# line 68-70 ----
if(source_system_idx > 0) {
source_system_child_sum[so
}
...
======================
They both check if condition, if(source_system_idx > 0)
So merge line 25-29 into
line 33 ~ 35 and line 68 ~ 70
if(parent == -1)
# line 33-35 ----
if(source_system_idx > 0) {
match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr); \
source_system=source_system_arr[1];
source_system_parent_sum[source_system] += 1;
}
...
else
{
# line 68-70 ----
if(source_system_idx > 0) {
match(var, /SourceSystem<5177>=([^\]]*)\]/, source_system_arr); \
source_system=source_system_arr[1];
source_system_child_sum[source_system] += 1;
}
...
ASKER
I thought that match (to obtain the substring) might have some impact on performance. Hence, I tried to obtain the substring using match once at the start after getting each line.
in if--else, it is run once only, too. But you reduce one extra "if" check.
ASKER
Okay.
I have one more question:
Is this the best way to get a substring in awk?
match(var, /SourceSystem<5177>=([^\]] *)\]/, source_system_arr); \
source_system=source_syste m_arr[1];
And also, I am getting three substrings.
one for parentID, one for Source, and one for SourceSystem.
For that it processes the line three times, right?
Is there a way to get all the three substring processing the line only once?
I am running the script on a 107GB file. Lot of orders yesterday. It is taking like 2.5 hours to run the script. Hence I asked.
I have one more question:
Is this the best way to get a substring in awk?
match(var, /SourceSystem<5177>=([^\]]
source_system=source_syste
And also, I am getting three substrings.
one for parentID, one for Source, and one for SourceSystem.
For that it processes the line three times, right?
Is there a way to get all the three substring processing the line only once?
I am running the script on a 107GB file. Lot of orders yesterday. It is taking like 2.5 hours to run the script. Hence I asked.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for suggestions.
Besides, please provide longer message (at least includes 3~5 SourceSystem) in Code or File.