Link to home
Start Free TrialLog in
Avatar of n78298
n78298

asked on

Change in Input feed based on condition file

My requirement is as follows-

In the below input file –

(In the input file there are 10 records, starting with TSCM. In UNIX file 1 record is is in 1 line only without any newline character )

Transaction code is at position 31:40.
Business code is from position 318:321

TSCM00000005837               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]3007[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]7777[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CM0019NM +[/b]0000000002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0440[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0500[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0292[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7843[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0012[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0379[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7847[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]3400[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045

Open in new window




For record 1-10 the combination of transaction code and business code is as following-

CM0002N  -               3007
CM0002N  -               7777
CM0019NM +             0440
CM0019NM +             0500
CN0001N  -                0292
CN0001N  -                7843
CN0001N  -                0012
CN0001P  -                0379
CN0001P  -                7847
CN0001P  -                3400

Open in new window


Further we have got a condition file which is having 3 values (comma separated file)

The first value is business code, second is transaction code and third is derived transaction code. The file is given below


3007,CM0002N  -,CMCAMTN  -
3037,CM0002N  -,CMCAMTN  -
3059,CM0002N  -,CMCAMTN  -
3007,CM0002N  +,CMCAMTN  +
3037,CM0002N  +,CMCAMTN  +
3059,CM0002N  +,CMCAMTN  +
3007,CM0002P  -,CMCAMTP  -
3037,CM0002P  -,CMCAMTP  -
3059,CM0002P  -,CMCAMTP  -
3007,CM0002P  +,CMCAMTP  +
3037,CM0002P  +,CMCAMTP  +
3059,CM0002P  +,CMCAMTP  +
0440,CM0019N  -,CMDDPTN  -
0440,CM0019N  +,CMDDPTN  +
0440,CM0019NM -,CMDDPTNM -
0440,CM0019NM +,CMDDPTNM +
0292,CN0001N  -,CNQCSHN  -
0379,CN0001N  -,CNQCSHN  -
1038,CN0001N  -,CNQCSHN  -
7810,CN0001N  -,CNQCSHN  -
7811,CN0001N  -,CNQCSHN  -
7812,CN0001N  -,CNQCSHN  -
7842,CN0001N  -,CNQCSHN  -
7843,CN0001N  -,CNQCSHN  -
0292,CN0001N  +,CNQCSHN  +
0379,CN0001N  +,CNQCSHN  +
1038,CN0001N  +,CNQCSHN  +
7810,CN0001N  +,CNQCSHN  +
7811,CN0001N  +,CNQCSHN  +
7812,CN0001N  +,CNQCSHN  +
7842,CN0001N  +,CNQCSHN  +
7843,CN0001N  +,CNQCSHN  +
0292,CN0001P  -,CNQCSHP  -
0379,CN0001P  -,CNQCSHP  -
1038,CN0001P  -,CNQCSHP  -
7810,CN0001P  -,CNQCSHP  -
7811,CN0001P  -,CNQCSHP  -
7812,CN0001P  -,CNQCSHP  -
7842,CN0001P  -,CNQCSHP  -
7843,CN0001P  -,CNQCSHP  -
7719,CN0001P  -,CNCSHBP  -
7800,CN0001P  -,CNCSHBP  -
7801,CN0001P  -,CNCSHBP  -
7802,CN0001P  -,CNCSHBP  -
7830,CN0001P  -,CNCSHBP  -
7831,CN0001P  -,CNCSHBP  -
7846,CN0001P  -,CNCSHBP  -
7847,CN0001P  -,CNCSHBP  -

Open in new window



The requirement is-

Check the business code and transaction code from the input file.

Compare the values with the conditions given in condition file. If the business code ( position 318:321) and transaction code (position 31:40)  matches with the first and second value of any record in condition file, replace the transaction code with the third value of condition sheet.


After running the command/script the output file should be like this (Change in transaction code, position 31:40 wherever condition is met)


TSCM00000005837               [b]CMCAMTN  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]3007[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]7777[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CMDDPTNM +[/b]0000000002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0440[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0500[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CNQCSHN  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0292[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CNQCSHN  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7843[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0012[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093               [b]CNQCSHP  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0379[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CNCSHBP  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7847[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]3400[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045

Open in new window



My Input file is having around 1 million records and these needs to be checked against all the conditions in condition sheet.

In the input file the transaction code and business code is in   - tag.
Avatar of mcuk_storm
mcuk_storm
Flag of United Kingdom of Great Britain and Northern Ireland image

I believe the attached script will do what you require, it takes two parameters the first is the input file, the second is the comparison file (as given in code listing #3)

i.e.
./script.sh inputFile lookupFile > outputFile

There were a couple in your example data that don't have matches in listing #3 i presume this is because they are only example sets, in the case it cant find a match in the comparison/lookup file it will output the line as it was passed in, if you want to change this behaviour if you look at lines 10 & 11 in the script provided.

I have based this script on the assumption that the input file contains fixed width fields as that is what appears to be the case in your example data, the comparison/lookup file is treated as CSV.

The requirements for the attached script are:
bash, cut, grep  these have been tested on a Linux install, unfortunately i don't have access to a unix terminal so it could be the syntax for cut/grep is a bit different (shouldn't be but might)

script.sh
Avatar of n78298
n78298

ASKER

Thanks for understanding the problem and giving the solution.

Firstly the things ( assumptions) which are absolutely correct-

It takes two parameters the first is the input file, the second is the comparison file
In case it cant find a match in the comparison/lookup file it will output the line as it was passed in,
Input file contains fixed width fields
The comparison/lookup file is treated as CSV (commas separated file)

However with all above things the solution is not working for me. I don’t have bash , my UNiX shell is ksh.

When I run with first line as #!/bin/bash , nothing happens. There is no value assigned to COMPLINE in all 10 examples.

When I Run with #!/bin/ksh , this error occurs

"${line:108:4},${line:33:10}":bad subsitution

Hi, hopefully the attached file will work, apparently the version of ksh that comes with some unix distributions is quite old as recent versions do support this syntax, anyway i have substituted these with awk commands which adds awk as a requirement but should work ok with ksh.


script.sh
One more comment, i just noticed that the b tags in the code are probably where you were trying to emphasise the key columns, in the positioning i have included these so the position numbers in lines 3-6 will need to be tweaked slightly to:

  SUB1=$(echo "$line" | awk '{print substr($0,99,4)}')
  SUB2=$(echo "$line" | awk '{print substr($0,31,10)}')
  LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
  LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
Avatar of n78298

ASKER

Thanks a lot for your contant help. I have implemented this and the code is working. But with so many assignments performence is extremely poor. For 200 records it took 10 seconds which means for 100000 records ( my input filesize in production) it may take 80-90 minutes which is unacceptable as per job time windows. Can't we make extensive usage of sed or awk to make this task fast.

The final code which worked on my system is



while read line; do
  COMPLINE="" 
 SUB1=$(echo "$line" | awk '{print substr($0,318,4)}')
 SUB2=$(echo "$line" | awk '{print substr($0,31,10)}') 
  
  #COMPLINE=$(grep "${line:318:4},${line:31:10}" $2)
  COMPLINE=`cat "$2" | grep "$SUB1,$SUB2"`
  if [ ${#COMPLINE} -gt 0 ]; then
    REPLACEMENT=$(echo "${COMPLINE}" | cut -d "," -f 3)
  LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
 LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
    echo "${LINESTART}${REPLACEMENT}${LINEEND}" 
  else
    echo "${line}"
    #echo "NOMATCH! ${line:108:4},${line:33:10}" 
  fi
done < $1

Open in new window

At the moment i would imagine it is the number of processes it is having to spawn and having to process the comparison table each time that is slowing it down.

What would be an acceptable memory footprint for the process? I'm thinking that holding the comparison table in memory as a hash map should speed things up nicely but if you have a 1millon row comparison table this would consume around 30mb of memory
Avatar of n78298

ASKER

I m not very sure of this. But yes , if you can help we can try these things.

I still feel that if use awk for the problem , (  declarations , if clause all can happen in awk) , it will definitely speeden up the things.

Can you look into this ascpect as well.
Attached is an AWK only script which will read the comparison table into memory first which combined with its single process model is significantly faster (prev script was 48sec/10,000 records, this one is  0.02sec/10,000 records)

On my test dataset with 10,048 rows in the comparison file and 10,000 rows in the main input file  the memory footprint of the awk was 11,936kb but with a larger comparison file this will grow..

Usage: ./Q26419674 comparison_file input_file > output_file
Sorry forgot to attach the file
Q26419674
Avatar of n78298

ASKER

Sorry, but this is giving some funny syntax errors. I am not able to gigure out what's wrong.

[ief /home/ief]$ new1 condition.csv df > output1
awk: syntax error near line 15
awk: illegal statement near line 15
awk: syntax error near line 21
awk: bailing out near line 21

condition.csv is the comparision file
df is the input file
Could you let me know the output of:
awk --version

Also did you need to make any changes to the file to get to this point, i only ask as the line numbers don't seem to tie up.
Avatar of n78298

ASKER

Removed all the blank files and comments to get th exact lines where error is coming.

IT now shows-

[ief /home/ief]$ new1 condition.csv df
awk: syntax error near line 13
awk: illegal statement near line 13
awk: syntax error near line 19
awk: bailing out near line 19
if /usr/bin/nawk exists on your system could you try changing the first line from:
#!/bin/awk -f

to:
#!/usr/bin/nawk -f
Avatar of n78298

ASKER

Thank for the solution and constant help.

The nawk thing is working for me and  performance is quiet good. With 100000 rows the script worked successfully in 50 seconds. But I have realised that my files can as big as 2.5 million rows. ( 2500000 rows)

However I would like to clarify that my comparision file is static. It will only have around 100 rows.

With below clarifications could upu please explain what memory would be used.

Input file having 2.5 million rows
Comparision file having 100 rows
Awk will process the main input file a record/line at a time, so only one line will ever be held in memory, the output is flushed after each record which will be redirected to disk so the memory usage should be minimal and O(1).

[b]100 records in comp, 10,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-10000 > /dev/null
        Command being timed: "./Q26419674 comp in-10000"
        User time (seconds): 0.02
        System time (seconds): 0.00
        Percent of CPU this job got: 92%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
        [b]Maximum resident set size (kbytes): 3984[/b]


[b]100 records in comp, 100,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-100000 > /dev/null
        Command being timed: "./Q26419674 comp in-100000"
        User time (seconds): 0.21
        System time (seconds): 0.02
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.24
        [b]Maximum resident set size (kbytes): 3984[/b]
Avatar of n78298

ASKER

Thanks a lot for your detailed explanation.

So this means there should not be any memory related issues as only one line will ever be held in memory.

I will get access to environments tomorrow where I will be able to test this solution with 2.5 million rows.

I will keep you updated.
Avatar of n78298

ASKER

Amazing results. 2.5 million records getting processed in less than 3 minutes.  

If you can confirm that there will be no memory related issues , I can close on this solution and integrate this solution into my application.
ASKER CERTIFIED SOLUTION
Avatar of mcuk_storm
mcuk_storm
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of n78298

ASKER

Thanks a lot.