Solved

Change in Input feed based on condition file

Posted on 2010-08-21
19
233 Views
Last Modified: 2012-05-10
My requirement is as follows-

In the below input file –

(In the input file there are 10 records, starting with TSCM. In UNIX file 1 record is is in 1 line only without any newline character )

Transaction code is at position 31:40.
Business code is from position 318:321

TSCM00000005837               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]3007[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]7777[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CM0019NM +[/b]0000000002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0440[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0500[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0292[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7843[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0012[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0379[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7847[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]3400[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045

Open in new window




For record 1-10 the combination of transaction code and business code is as following-

CM0002N  -               3007
CM0002N  -               7777
CM0019NM +             0440
CM0019NM +             0500
CN0001N  -                0292
CN0001N  -                7843
CN0001N  -                0012
CN0001P  -                0379
CN0001P  -                7847
CN0001P  -                3400

Open in new window


Further we have got a condition file which is having 3 values (comma separated file)

The first value is business code, second is transaction code and third is derived transaction code. The file is given below


3007,CM0002N  -,CMCAMTN  -
3037,CM0002N  -,CMCAMTN  -
3059,CM0002N  -,CMCAMTN  -
3007,CM0002N  +,CMCAMTN  +
3037,CM0002N  +,CMCAMTN  +
3059,CM0002N  +,CMCAMTN  +
3007,CM0002P  -,CMCAMTP  -
3037,CM0002P  -,CMCAMTP  -
3059,CM0002P  -,CMCAMTP  -
3007,CM0002P  +,CMCAMTP  +
3037,CM0002P  +,CMCAMTP  +
3059,CM0002P  +,CMCAMTP  +
0440,CM0019N  -,CMDDPTN  -
0440,CM0019N  +,CMDDPTN  +
0440,CM0019NM -,CMDDPTNM -
0440,CM0019NM +,CMDDPTNM +
0292,CN0001N  -,CNQCSHN  -
0379,CN0001N  -,CNQCSHN  -
1038,CN0001N  -,CNQCSHN  -
7810,CN0001N  -,CNQCSHN  -
7811,CN0001N  -,CNQCSHN  -
7812,CN0001N  -,CNQCSHN  -
7842,CN0001N  -,CNQCSHN  -
7843,CN0001N  -,CNQCSHN  -
0292,CN0001N  +,CNQCSHN  +
0379,CN0001N  +,CNQCSHN  +
1038,CN0001N  +,CNQCSHN  +
7810,CN0001N  +,CNQCSHN  +
7811,CN0001N  +,CNQCSHN  +
7812,CN0001N  +,CNQCSHN  +
7842,CN0001N  +,CNQCSHN  +
7843,CN0001N  +,CNQCSHN  +
0292,CN0001P  -,CNQCSHP  -
0379,CN0001P  -,CNQCSHP  -
1038,CN0001P  -,CNQCSHP  -
7810,CN0001P  -,CNQCSHP  -
7811,CN0001P  -,CNQCSHP  -
7812,CN0001P  -,CNQCSHP  -
7842,CN0001P  -,CNQCSHP  -
7843,CN0001P  -,CNQCSHP  -
7719,CN0001P  -,CNCSHBP  -
7800,CN0001P  -,CNCSHBP  -
7801,CN0001P  -,CNCSHBP  -
7802,CN0001P  -,CNCSHBP  -
7830,CN0001P  -,CNCSHBP  -
7831,CN0001P  -,CNCSHBP  -
7846,CN0001P  -,CNCSHBP  -
7847,CN0001P  -,CNCSHBP  -

Open in new window



The requirement is-

Check the business code and transaction code from the input file.

Compare the values with the conditions given in condition file. If the business code ( position 318:321) and transaction code (position 31:40)  matches with the first and second value of any record in condition file, replace the transaction code with the third value of condition sheet.


After running the command/script the output file should be like this (Change in transaction code, position 31:40 wherever condition is met)


TSCM00000005837               [b]CMCAMTN  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]3007[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]7777[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CMDDPTNM +[/b]0000000002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0440[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0500[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CNQCSHN  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0292[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CNQCSHN  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7843[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0012[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093               [b]CNQCSHP  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0379[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CNCSHBP  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7847[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]3400[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045

Open in new window



My Input file is having around 1 million records and these needs to be checked against all the conditions in condition sheet.

In the input file the transaction code and business code is in   - tag.
0
Comment
Question by:n78298
  • 10
  • 9
19 Comments
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33492062
I believe the attached script will do what you require, it takes two parameters the first is the input file, the second is the comparison file (as given in code listing #3)

i.e.
./script.sh inputFile lookupFile > outputFile

There were a couple in your example data that don't have matches in listing #3 i presume this is because they are only example sets, in the case it cant find a match in the comparison/lookup file it will output the line as it was passed in, if you want to change this behaviour if you look at lines 10 & 11 in the script provided.

I have based this script on the assumption that the input file contains fixed width fields as that is what appears to be the case in your example data, the comparison/lookup file is treated as CSV.

The requirements for the attached script are:
bash, cut, grep  these have been tested on a Linux install, unfortunately i don't have access to a unix terminal so it could be the syntax for cut/grep is a bit different (shouldn't be but might)

script.sh
0
 

Author Comment

by:n78298
ID: 33492634
Thanks for understanding the problem and giving the solution.

Firstly the things ( assumptions) which are absolutely correct-

It takes two parameters the first is the input file, the second is the comparison file
In case it cant find a match in the comparison/lookup file it will output the line as it was passed in,
Input file contains fixed width fields
The comparison/lookup file is treated as CSV (commas separated file)

However with all above things the solution is not working for me. I don’t have bash , my UNiX shell is ksh.

When I run with first line as #!/bin/bash , nothing happens. There is no value assigned to COMPLINE in all 10 examples.

When I Run with #!/bin/ksh , this error occurs

"${line:108:4},${line:33:10}":bad subsitution

0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33492732
Hi, hopefully the attached file will work, apparently the version of ksh that comes with some unix distributions is quite old as recent versions do support this syntax, anyway i have substituted these with awk commands which adds awk as a requirement but should work ok with ksh.


script.sh
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33492773
One more comment, i just noticed that the b tags in the code are probably where you were trying to emphasise the key columns, in the positioning i have included these so the position numbers in lines 3-6 will need to be tweaked slightly to:

  SUB1=$(echo "$line" | awk '{print substr($0,99,4)}')
  SUB2=$(echo "$line" | awk '{print substr($0,31,10)}')
  LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
  LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
0
 

Author Comment

by:n78298
ID: 33492983
Thanks a lot for your contant help. I have implemented this and the code is working. But with so many assignments performence is extremely poor. For 200 records it took 10 seconds which means for 100000 records ( my input filesize in production) it may take 80-90 minutes which is unacceptable as per job time windows. Can't we make extensive usage of sed or awk to make this task fast.

The final code which worked on my system is



while read line; do

  COMPLINE="" 

 SUB1=$(echo "$line" | awk '{print substr($0,318,4)}')

 SUB2=$(echo "$line" | awk '{print substr($0,31,10)}') 

  

  #COMPLINE=$(grep "${line:318:4},${line:31:10}" $2)

  COMPLINE=`cat "$2" | grep "$SUB1,$SUB2"`

  if [ ${#COMPLINE} -gt 0 ]; then

    REPLACEMENT=$(echo "${COMPLINE}" | cut -d "," -f 3)

  LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')

 LINEEND=$(echo "$line" | awk '{print substr($0,41)}')

    echo "${LINESTART}${REPLACEMENT}${LINEEND}" 

  else

    echo "${line}"

    #echo "NOMATCH! ${line:108:4},${line:33:10}" 

  fi

done < $1

Open in new window

0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494065
At the moment i would imagine it is the number of processes it is having to spawn and having to process the comparison table each time that is slowing it down.

What would be an acceptable memory footprint for the process? I'm thinking that holding the comparison table in memory as a hash map should speed things up nicely but if you have a 1millon row comparison table this would consume around 30mb of memory
0
 

Author Comment

by:n78298
ID: 33494187
I m not very sure of this. But yes , if you can help we can try these things.

I still feel that if use awk for the problem , (  declarations , if clause all can happen in awk) , it will definitely speeden up the things.

Can you look into this ascpect as well.
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494213
Attached is an AWK only script which will read the comparison table into memory first which combined with its single process model is significantly faster (prev script was 48sec/10,000 records, this one is  0.02sec/10,000 records)

On my test dataset with 10,048 rows in the comparison file and 10,000 rows in the main input file  the memory footprint of the awk was 11,936kb but with a larger comparison file this will grow..

Usage: ./Q26419674 comparison_file input_file > output_file
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494214
Sorry forgot to attach the file
Q26419674
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:n78298
ID: 33494300
Sorry, but this is giving some funny syntax errors. I am not able to gigure out what's wrong.

[ief /home/ief]$ new1 condition.csv df > output1
awk: syntax error near line 15
awk: illegal statement near line 15
awk: syntax error near line 21
awk: bailing out near line 21

condition.csv is the comparision file
df is the input file
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494304
Could you let me know the output of:
awk --version

Also did you need to make any changes to the file to get to this point, i only ask as the line numbers don't seem to tie up.
0
 

Author Comment

by:n78298
ID: 33494305
Removed all the blank files and comments to get th exact lines where error is coming.

IT now shows-

[ief /home/ief]$ new1 condition.csv df
awk: syntax error near line 13
awk: illegal statement near line 13
awk: syntax error near line 19
awk: bailing out near line 19
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494326
if /usr/bin/nawk exists on your system could you try changing the first line from:
#!/bin/awk -f

to:
#!/usr/bin/nawk -f
0
 

Author Comment

by:n78298
ID: 33494570
Thank for the solution and constant help.

The nawk thing is working for me and  performance is quiet good. With 100000 rows the script worked successfully in 50 seconds. But I have realised that my files can as big as 2.5 million rows. ( 2500000 rows)

However I would like to clarify that my comparision file is static. It will only have around 100 rows.

With below clarifications could upu please explain what memory would be used.

Input file having 2.5 million rows
Comparision file having 100 rows
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494931
Awk will process the main input file a record/line at a time, so only one line will ever be held in memory, the output is flushed after each record which will be redirected to disk so the memory usage should be minimal and O(1).

[b]100 records in comp, 10,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-10000 > /dev/null
        Command being timed: "./Q26419674 comp in-10000"
        User time (seconds): 0.02
        System time (seconds): 0.00
        Percent of CPU this job got: 92%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
        [b]Maximum resident set size (kbytes): 3984[/b]


[b]100 records in comp, 100,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-100000 > /dev/null
        Command being timed: "./Q26419674 comp in-100000"
        User time (seconds): 0.21
        System time (seconds): 0.02
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.24
        [b]Maximum resident set size (kbytes): 3984[/b]
0
 

Author Comment

by:n78298
ID: 33496395
Thanks a lot for your detailed explanation.

So this means there should not be any memory related issues as only one line will ever be held in memory.

I will get access to environments tomorrow where I will be able to test this solution with 2.5 million rows.

I will keep you updated.
0
 

Author Comment

by:n78298
ID: 33500018
Amazing results. 2.5 million records getting processed in less than 3 minutes.  

If you can confirm that there will be no memory related issues , I can close on this solution and integrate this solution into my application.
0
 
LVL 7

Accepted Solution

by:
mcuk_storm earned 500 total points
ID: 33503338
There shouldn't be any issues with memory, as you can see from the two outputs i posted earlier the memory consumption of processing 10,000 and 100,000 records was the same, the memory consuption is based primarily on some overhead + the comparison table loading into memory.
0
 

Author Closing Comment

by:n78298
ID: 33556056
Thanks a lot.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Over the years I've spent many an hour playing on hardened, DMZ'd servers, with only a sub-set of the usual GNU toy's to keep me company; frequently I've needed to save and send log or data extracts from these server back to my PC, or to others, and…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now