n78298
asked on
Change in Input feed based on condition file
My requirement is as follows-
In the below input file –
(In the input file there are 10 records, starting with TSCM. In UNIX file 1 record is is in 1 line only without any newline character )
Transaction code is at position 31:40.
Business code is from position 318:321
For record 1-10 the combination of transaction code and business code is as following-
Further we have got a condition file which is having 3 values (comma separated file)
The first value is business code, second is transaction code and third is derived transaction code. The file is given below
The requirement is-
Check the business code and transaction code from the input file.
Compare the values with the conditions given in condition file. If the business code ( position 318:321) and transaction code (position 31:40) matches with the first and second value of any record in condition file, replace the transaction code with the third value of condition sheet.
After running the command/script the output file should be like this (Change in transaction code, position 31:40 wherever condition is met)
My Input file is having around 1 million records and these needs to be checked against all the conditions in condition sheet.
In the input file the transaction code and business code is in - tag.
In the below input file –
(In the input file there are 10 records, starting with TSCM. In UNIX file 1 record is is in 1 line only without any newline character )
Transaction code is at position 31:40.
Business code is from position 318:321
TSCM00000005837 [b]CM0002N -[/b]0000000001906.072010-12-10XML MM 201002081000000 YORK
00[b]3007[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112 [b]CM0002N -[/b]0000000001906.072010-12-10XML MM 201002081000000 YORK
00[b]7777[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282 [b]CM0019NM +[/b]0000000002254.982010-12-10XML MM 201002081000001 YORK
00[b]0440[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215 [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0500[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215 [b]CN0001N -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0292[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282 [b]CN0001N -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]7843[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107 [b]CN0001N -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0012[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093 [b]CN0001P -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0379[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646 [b]CN0001P -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]7847[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646 [b]CN0001P -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]3400[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
For record 1-10 the combination of transaction code and business code is as following-
CM0002N - 3007
CM0002N - 7777
CM0019NM + 0440
CM0019NM + 0500
CN0001N - 0292
CN0001N - 7843
CN0001N - 0012
CN0001P - 0379
CN0001P - 7847
CN0001P - 3400
Further we have got a condition file which is having 3 values (comma separated file)
The first value is business code, second is transaction code and third is derived transaction code. The file is given below
3007,CM0002N -,CMCAMTN -
3037,CM0002N -,CMCAMTN -
3059,CM0002N -,CMCAMTN -
3007,CM0002N +,CMCAMTN +
3037,CM0002N +,CMCAMTN +
3059,CM0002N +,CMCAMTN +
3007,CM0002P -,CMCAMTP -
3037,CM0002P -,CMCAMTP -
3059,CM0002P -,CMCAMTP -
3007,CM0002P +,CMCAMTP +
3037,CM0002P +,CMCAMTP +
3059,CM0002P +,CMCAMTP +
0440,CM0019N -,CMDDPTN -
0440,CM0019N +,CMDDPTN +
0440,CM0019NM -,CMDDPTNM -
0440,CM0019NM +,CMDDPTNM +
0292,CN0001N -,CNQCSHN -
0379,CN0001N -,CNQCSHN -
1038,CN0001N -,CNQCSHN -
7810,CN0001N -,CNQCSHN -
7811,CN0001N -,CNQCSHN -
7812,CN0001N -,CNQCSHN -
7842,CN0001N -,CNQCSHN -
7843,CN0001N -,CNQCSHN -
0292,CN0001N +,CNQCSHN +
0379,CN0001N +,CNQCSHN +
1038,CN0001N +,CNQCSHN +
7810,CN0001N +,CNQCSHN +
7811,CN0001N +,CNQCSHN +
7812,CN0001N +,CNQCSHN +
7842,CN0001N +,CNQCSHN +
7843,CN0001N +,CNQCSHN +
0292,CN0001P -,CNQCSHP -
0379,CN0001P -,CNQCSHP -
1038,CN0001P -,CNQCSHP -
7810,CN0001P -,CNQCSHP -
7811,CN0001P -,CNQCSHP -
7812,CN0001P -,CNQCSHP -
7842,CN0001P -,CNQCSHP -
7843,CN0001P -,CNQCSHP -
7719,CN0001P -,CNCSHBP -
7800,CN0001P -,CNCSHBP -
7801,CN0001P -,CNCSHBP -
7802,CN0001P -,CNCSHBP -
7830,CN0001P -,CNCSHBP -
7831,CN0001P -,CNCSHBP -
7846,CN0001P -,CNCSHBP -
7847,CN0001P -,CNCSHBP -
The requirement is-
Check the business code and transaction code from the input file.
Compare the values with the conditions given in condition file. If the business code ( position 318:321) and transaction code (position 31:40) matches with the first and second value of any record in condition file, replace the transaction code with the third value of condition sheet.
After running the command/script the output file should be like this (Change in transaction code, position 31:40 wherever condition is met)
TSCM00000005837 [b]CMCAMTN -[/b]0000000001906.072010-12-10XML MM 201002081000000 YORK
00[b]3007[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112 [b]CM0002N -[/b]0000000001906.072010-12-10XML MM 201002081000000 YORK
00[b]7777[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282 [b]CMDDPTNM +[/b]0000000002254.982010-12-10XML MM 201002081000001 YORK
00[b]0440[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215 [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0500[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215 [b]CNQCSHN -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0292[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282 [b]CNQCSHN -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]7843[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107 [b]CN0001N -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0012[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093 [b]CNQCSHP -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]0379[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646 [b]CNCSHBP -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]7847[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646 [b]CN0001P -[/b]0000400002254.982010-12-10XML MM 201002081000001 YORK
00[b]3400[/b]XML MM 2010000000000* 00000000*
2010-12-10-00.00.00123456789123450 GB 5045
My Input file is having around 1 million records and these needs to be checked against all the conditions in condition sheet.
In the input file the transaction code and business code is in - tag.
ASKER
Thanks for understanding the problem and giving the solution.
Firstly the things ( assumptions) which are absolutely correct-
It takes two parameters the first is the input file, the second is the comparison file
In case it cant find a match in the comparison/lookup file it will output the line as it was passed in,
Input file contains fixed width fields
The comparison/lookup file is treated as CSV (commas separated file)
However with all above things the solution is not working for me. I don’t have bash , my UNiX shell is ksh.
When I run with first line as #!/bin/bash , nothing happens. There is no value assigned to COMPLINE in all 10 examples.
When I Run with #!/bin/ksh , this error occurs
"${line:108:4},${line:33:1 0}":bad subsitution
Firstly the things ( assumptions) which are absolutely correct-
It takes two parameters the first is the input file, the second is the comparison file
In case it cant find a match in the comparison/lookup file it will output the line as it was passed in,
Input file contains fixed width fields
The comparison/lookup file is treated as CSV (commas separated file)
However with all above things the solution is not working for me. I don’t have bash , my UNiX shell is ksh.
When I run with first line as #!/bin/bash , nothing happens. There is no value assigned to COMPLINE in all 10 examples.
When I Run with #!/bin/ksh , this error occurs
"${line:108:4},${line:33:1
Hi, hopefully the attached file will work, apparently the version of ksh that comes with some unix distributions is quite old as recent versions do support this syntax, anyway i have substituted these with awk commands which adds awk as a requirement but should work ok with ksh.
script.sh
script.sh
One more comment, i just noticed that the b tags in the code are probably where you were trying to emphasise the key columns, in the positioning i have included these so the position numbers in lines 3-6 will need to be tweaked slightly to:
SUB1=$(echo "$line" | awk '{print substr($0,99,4)}')
SUB2=$(echo "$line" | awk '{print substr($0,31,10)}')
LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
SUB1=$(echo "$line" | awk '{print substr($0,99,4)}')
SUB2=$(echo "$line" | awk '{print substr($0,31,10)}')
LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
ASKER
Thanks a lot for your contant help. I have implemented this and the code is working. But with so many assignments performence is extremely poor. For 200 records it took 10 seconds which means for 100000 records ( my input filesize in production) it may take 80-90 minutes which is unacceptable as per job time windows. Can't we make extensive usage of sed or awk to make this task fast.
The final code which worked on my system is
The final code which worked on my system is
while read line; do
COMPLINE=""
SUB1=$(echo "$line" | awk '{print substr($0,318,4)}')
SUB2=$(echo "$line" | awk '{print substr($0,31,10)}')
#COMPLINE=$(grep "${line:318:4},${line:31:10}" $2)
COMPLINE=`cat "$2" | grep "$SUB1,$SUB2"`
if [ ${#COMPLINE} -gt 0 ]; then
REPLACEMENT=$(echo "${COMPLINE}" | cut -d "," -f 3)
LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
echo "${LINESTART}${REPLACEMENT}${LINEEND}"
else
echo "${line}"
#echo "NOMATCH! ${line:108:4},${line:33:10}"
fi
done < $1
At the moment i would imagine it is the number of processes it is having to spawn and having to process the comparison table each time that is slowing it down.
What would be an acceptable memory footprint for the process? I'm thinking that holding the comparison table in memory as a hash map should speed things up nicely but if you have a 1millon row comparison table this would consume around 30mb of memory
What would be an acceptable memory footprint for the process? I'm thinking that holding the comparison table in memory as a hash map should speed things up nicely but if you have a 1millon row comparison table this would consume around 30mb of memory
ASKER
I m not very sure of this. But yes , if you can help we can try these things.
I still feel that if use awk for the problem , ( declarations , if clause all can happen in awk) , it will definitely speeden up the things.
Can you look into this ascpect as well.
I still feel that if use awk for the problem , ( declarations , if clause all can happen in awk) , it will definitely speeden up the things.
Can you look into this ascpect as well.
Attached is an AWK only script which will read the comparison table into memory first which combined with its single process model is significantly faster (prev script was 48sec/10,000 records, this one is 0.02sec/10,000 records)
On my test dataset with 10,048 rows in the comparison file and 10,000 rows in the main input file the memory footprint of the awk was 11,936kb but with a larger comparison file this will grow..
Usage: ./Q26419674 comparison_file input_file > output_file
On my test dataset with 10,048 rows in the comparison file and 10,000 rows in the main input file the memory footprint of the awk was 11,936kb but with a larger comparison file this will grow..
Usage: ./Q26419674 comparison_file input_file > output_file
Sorry forgot to attach the file
Q26419674
Q26419674
ASKER
Sorry, but this is giving some funny syntax errors. I am not able to gigure out what's wrong.
[ief /home/ief]$ new1 condition.csv df > output1
awk: syntax error near line 15
awk: illegal statement near line 15
awk: syntax error near line 21
awk: bailing out near line 21
condition.csv is the comparision file
df is the input file
[ief /home/ief]$ new1 condition.csv df > output1
awk: syntax error near line 15
awk: illegal statement near line 15
awk: syntax error near line 21
awk: bailing out near line 21
condition.csv is the comparision file
df is the input file
Could you let me know the output of:
awk --version
Also did you need to make any changes to the file to get to this point, i only ask as the line numbers don't seem to tie up.
awk --version
Also did you need to make any changes to the file to get to this point, i only ask as the line numbers don't seem to tie up.
ASKER
Removed all the blank files and comments to get th exact lines where error is coming.
IT now shows-
[ief /home/ief]$ new1 condition.csv df
awk: syntax error near line 13
awk: illegal statement near line 13
awk: syntax error near line 19
awk: bailing out near line 19
IT now shows-
[ief /home/ief]$ new1 condition.csv df
awk: syntax error near line 13
awk: illegal statement near line 13
awk: syntax error near line 19
awk: bailing out near line 19
if /usr/bin/nawk exists on your system could you try changing the first line from:
#!/bin/awk -f
to:
#!/usr/bin/nawk -f
#!/bin/awk -f
to:
#!/usr/bin/nawk -f
ASKER
Thank for the solution and constant help.
The nawk thing is working for me and performance is quiet good. With 100000 rows the script worked successfully in 50 seconds. But I have realised that my files can as big as 2.5 million rows. ( 2500000 rows)
However I would like to clarify that my comparision file is static. It will only have around 100 rows.
With below clarifications could upu please explain what memory would be used.
Input file having 2.5 million rows
Comparision file having 100 rows
The nawk thing is working for me and performance is quiet good. With 100000 rows the script worked successfully in 50 seconds. But I have realised that my files can as big as 2.5 million rows. ( 2500000 rows)
However I would like to clarify that my comparision file is static. It will only have around 100 rows.
With below clarifications could upu please explain what memory would be used.
Input file having 2.5 million rows
Comparision file having 100 rows
Awk will process the main input file a record/line at a time, so only one line will ever be held in memory, the output is flushed after each record which will be redirected to disk so the memory usage should be minimal and O(1).
[b]100 records in comp, 10,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-10000 > /dev/null
Command being timed: "./Q26419674 comp in-10000"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
[b]Maximum resident set size (kbytes): 3984[/b]
[b]100 records in comp, 100,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-100000 > /dev/null
Command being timed: "./Q26419674 comp in-100000"
User time (seconds): 0.21
System time (seconds): 0.02
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.24
[b]Maximum resident set size (kbytes): 3984[/b]
[b]100 records in comp, 10,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-10000 > /dev/null
Command being timed: "./Q26419674 comp in-10000"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 92%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
[b]Maximum resident set size (kbytes): 3984[/b]
[b]100 records in comp, 100,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-100000 > /dev/null
Command being timed: "./Q26419674 comp in-100000"
User time (seconds): 0.21
System time (seconds): 0.02
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.24
[b]Maximum resident set size (kbytes): 3984[/b]
ASKER
Thanks a lot for your detailed explanation.
So this means there should not be any memory related issues as only one line will ever be held in memory.
I will get access to environments tomorrow where I will be able to test this solution with 2.5 million rows.
I will keep you updated.
So this means there should not be any memory related issues as only one line will ever be held in memory.
I will get access to environments tomorrow where I will be able to test this solution with 2.5 million rows.
I will keep you updated.
ASKER
Amazing results. 2.5 million records getting processed in less than 3 minutes.
If you can confirm that there will be no memory related issues , I can close on this solution and integrate this solution into my application.
If you can confirm that there will be no memory related issues , I can close on this solution and integrate this solution into my application.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks a lot.
i.e.
./script.sh inputFile lookupFile > outputFile
There were a couple in your example data that don't have matches in listing #3 i presume this is because they are only example sets, in the case it cant find a match in the comparison/lookup file it will output the line as it was passed in, if you want to change this behaviour if you look at lines 10 & 11 in the script provided.
I have based this script on the assumption that the input file contains fixed width fields as that is what appears to be the case in your example data, the comparison/lookup file is treated as CSV.
The requirements for the attached script are:
bash, cut, grep these have been tested on a Linux install, unfortunately i don't have access to a unix terminal so it could be the syntax for cut/grep is a bit different (shouldn't be but might)
script.sh