Solved

Change in Input feed based on condition file

Posted on 2010-08-21
19
241 Views
Last Modified: 2012-05-10
My requirement is as follows-

In the below input file –

(In the input file there are 10 records, starting with TSCM. In UNIX file 1 record is is in 1 line only without any newline character )

Transaction code is at position 31:40.
Business code is from position 318:321

TSCM00000005837               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]3007[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]7777[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CM0019NM +[/b]0000000002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0440[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0500[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0292[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7843[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0012[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0379[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7847[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]3400[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045

Open in new window




For record 1-10 the combination of transaction code and business code is as following-

CM0002N  -               3007
CM0002N  -               7777
CM0019NM +             0440
CM0019NM +             0500
CN0001N  -                0292
CN0001N  -                7843
CN0001N  -                0012
CN0001P  -                0379
CN0001P  -                7847
CN0001P  -                3400

Open in new window


Further we have got a condition file which is having 3 values (comma separated file)

The first value is business code, second is transaction code and third is derived transaction code. The file is given below


3007,CM0002N  -,CMCAMTN  -
3037,CM0002N  -,CMCAMTN  -
3059,CM0002N  -,CMCAMTN  -
3007,CM0002N  +,CMCAMTN  +
3037,CM0002N  +,CMCAMTN  +
3059,CM0002N  +,CMCAMTN  +
3007,CM0002P  -,CMCAMTP  -
3037,CM0002P  -,CMCAMTP  -
3059,CM0002P  -,CMCAMTP  -
3007,CM0002P  +,CMCAMTP  +
3037,CM0002P  +,CMCAMTP  +
3059,CM0002P  +,CMCAMTP  +
0440,CM0019N  -,CMDDPTN  -
0440,CM0019N  +,CMDDPTN  +
0440,CM0019NM -,CMDDPTNM -
0440,CM0019NM +,CMDDPTNM +
0292,CN0001N  -,CNQCSHN  -
0379,CN0001N  -,CNQCSHN  -
1038,CN0001N  -,CNQCSHN  -
7810,CN0001N  -,CNQCSHN  -
7811,CN0001N  -,CNQCSHN  -
7812,CN0001N  -,CNQCSHN  -
7842,CN0001N  -,CNQCSHN  -
7843,CN0001N  -,CNQCSHN  -
0292,CN0001N  +,CNQCSHN  +
0379,CN0001N  +,CNQCSHN  +
1038,CN0001N  +,CNQCSHN  +
7810,CN0001N  +,CNQCSHN  +
7811,CN0001N  +,CNQCSHN  +
7812,CN0001N  +,CNQCSHN  +
7842,CN0001N  +,CNQCSHN  +
7843,CN0001N  +,CNQCSHN  +
0292,CN0001P  -,CNQCSHP  -
0379,CN0001P  -,CNQCSHP  -
1038,CN0001P  -,CNQCSHP  -
7810,CN0001P  -,CNQCSHP  -
7811,CN0001P  -,CNQCSHP  -
7812,CN0001P  -,CNQCSHP  -
7842,CN0001P  -,CNQCSHP  -
7843,CN0001P  -,CNQCSHP  -
7719,CN0001P  -,CNCSHBP  -
7800,CN0001P  -,CNCSHBP  -
7801,CN0001P  -,CNCSHBP  -
7802,CN0001P  -,CNCSHBP  -
7830,CN0001P  -,CNCSHBP  -
7831,CN0001P  -,CNCSHBP  -
7846,CN0001P  -,CNCSHBP  -
7847,CN0001P  -,CNCSHBP  -

Open in new window



The requirement is-

Check the business code and transaction code from the input file.

Compare the values with the conditions given in condition file. If the business code ( position 318:321) and transaction code (position 31:40)  matches with the first and second value of any record in condition file, replace the transaction code with the third value of condition sheet.


After running the command/script the output file should be like this (Change in transaction code, position 31:40 wherever condition is met)


TSCM00000005837               [b]CMCAMTN  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]3007[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008112               [b]CM0002N  -[/b]0000000001906.072010-12-10XML MM 201002081000000   YORK
 00[b]7777[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CMDDPTNM +[/b]0000000002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0440[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CM0019NM +[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0500[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000000215               [b]CNQCSHN  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0292[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM80000005282               [b]CNQCSHN  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7843[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008107               [b]CN0001N  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0012[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000008093               [b]CNQCSHP  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]0379[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CNCSHBP  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]7847[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045
TSCM00000002646               [b]CN0001P  -[/b]0000400002254.982010-12-10XML MM 201002081000001   YORK
 00[b]3400[/b]XML MM 2010000000000*                   00000000*
                                                                                  2010-12-10-00.00.00123456789123450 GB 5045

Open in new window



My Input file is having around 1 million records and these needs to be checked against all the conditions in condition sheet.

In the input file the transaction code and business code is in   - tag.
0
Comment
Question by:n78298
  • 10
  • 9
19 Comments
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33492062
I believe the attached script will do what you require, it takes two parameters the first is the input file, the second is the comparison file (as given in code listing #3)

i.e.
./script.sh inputFile lookupFile > outputFile

There were a couple in your example data that don't have matches in listing #3 i presume this is because they are only example sets, in the case it cant find a match in the comparison/lookup file it will output the line as it was passed in, if you want to change this behaviour if you look at lines 10 & 11 in the script provided.

I have based this script on the assumption that the input file contains fixed width fields as that is what appears to be the case in your example data, the comparison/lookup file is treated as CSV.

The requirements for the attached script are:
bash, cut, grep  these have been tested on a Linux install, unfortunately i don't have access to a unix terminal so it could be the syntax for cut/grep is a bit different (shouldn't be but might)

script.sh
0
 

Author Comment

by:n78298
ID: 33492634
Thanks for understanding the problem and giving the solution.

Firstly the things ( assumptions) which are absolutely correct-

It takes two parameters the first is the input file, the second is the comparison file
In case it cant find a match in the comparison/lookup file it will output the line as it was passed in,
Input file contains fixed width fields
The comparison/lookup file is treated as CSV (commas separated file)

However with all above things the solution is not working for me. I don’t have bash , my UNiX shell is ksh.

When I run with first line as #!/bin/bash , nothing happens. There is no value assigned to COMPLINE in all 10 examples.

When I Run with #!/bin/ksh , this error occurs

"${line:108:4},${line:33:10}":bad subsitution

0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33492732
Hi, hopefully the attached file will work, apparently the version of ksh that comes with some unix distributions is quite old as recent versions do support this syntax, anyway i have substituted these with awk commands which adds awk as a requirement but should work ok with ksh.


script.sh
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33492773
One more comment, i just noticed that the b tags in the code are probably where you were trying to emphasise the key columns, in the positioning i have included these so the position numbers in lines 3-6 will need to be tweaked slightly to:

  SUB1=$(echo "$line" | awk '{print substr($0,99,4)}')
  SUB2=$(echo "$line" | awk '{print substr($0,31,10)}')
  LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
  LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
0
 

Author Comment

by:n78298
ID: 33492983
Thanks a lot for your contant help. I have implemented this and the code is working. But with so many assignments performence is extremely poor. For 200 records it took 10 seconds which means for 100000 records ( my input filesize in production) it may take 80-90 minutes which is unacceptable as per job time windows. Can't we make extensive usage of sed or awk to make this task fast.

The final code which worked on my system is



while read line; do
  COMPLINE="" 
 SUB1=$(echo "$line" | awk '{print substr($0,318,4)}')
 SUB2=$(echo "$line" | awk '{print substr($0,31,10)}') 
  
  #COMPLINE=$(grep "${line:318:4},${line:31:10}" $2)
  COMPLINE=`cat "$2" | grep "$SUB1,$SUB2"`
  if [ ${#COMPLINE} -gt 0 ]; then
    REPLACEMENT=$(echo "${COMPLINE}" | cut -d "," -f 3)
  LINESTART=$(echo "$line" | awk '{print substr($0,0,30)}')
 LINEEND=$(echo "$line" | awk '{print substr($0,41)}')
    echo "${LINESTART}${REPLACEMENT}${LINEEND}" 
  else
    echo "${line}"
    #echo "NOMATCH! ${line:108:4},${line:33:10}" 
  fi
done < $1

Open in new window

0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494065
At the moment i would imagine it is the number of processes it is having to spawn and having to process the comparison table each time that is slowing it down.

What would be an acceptable memory footprint for the process? I'm thinking that holding the comparison table in memory as a hash map should speed things up nicely but if you have a 1millon row comparison table this would consume around 30mb of memory
0
 

Author Comment

by:n78298
ID: 33494187
I m not very sure of this. But yes , if you can help we can try these things.

I still feel that if use awk for the problem , (  declarations , if clause all can happen in awk) , it will definitely speeden up the things.

Can you look into this ascpect as well.
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494213
Attached is an AWK only script which will read the comparison table into memory first which combined with its single process model is significantly faster (prev script was 48sec/10,000 records, this one is  0.02sec/10,000 records)

On my test dataset with 10,048 rows in the comparison file and 10,000 rows in the main input file  the memory footprint of the awk was 11,936kb but with a larger comparison file this will grow..

Usage: ./Q26419674 comparison_file input_file > output_file
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494214
Sorry forgot to attach the file
Q26419674
0
 

Author Comment

by:n78298
ID: 33494300
Sorry, but this is giving some funny syntax errors. I am not able to gigure out what's wrong.

[ief /home/ief]$ new1 condition.csv df > output1
awk: syntax error near line 15
awk: illegal statement near line 15
awk: syntax error near line 21
awk: bailing out near line 21

condition.csv is the comparision file
df is the input file
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494304
Could you let me know the output of:
awk --version

Also did you need to make any changes to the file to get to this point, i only ask as the line numbers don't seem to tie up.
0
 

Author Comment

by:n78298
ID: 33494305
Removed all the blank files and comments to get th exact lines where error is coming.

IT now shows-

[ief /home/ief]$ new1 condition.csv df
awk: syntax error near line 13
awk: illegal statement near line 13
awk: syntax error near line 19
awk: bailing out near line 19
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494326
if /usr/bin/nawk exists on your system could you try changing the first line from:
#!/bin/awk -f

to:
#!/usr/bin/nawk -f
0
 

Author Comment

by:n78298
ID: 33494570
Thank for the solution and constant help.

The nawk thing is working for me and  performance is quiet good. With 100000 rows the script worked successfully in 50 seconds. But I have realised that my files can as big as 2.5 million rows. ( 2500000 rows)

However I would like to clarify that my comparision file is static. It will only have around 100 rows.

With below clarifications could upu please explain what memory would be used.

Input file having 2.5 million rows
Comparision file having 100 rows
0
 
LVL 7

Expert Comment

by:mcuk_storm
ID: 33494931
Awk will process the main input file a record/line at a time, so only one line will ever be held in memory, the output is flushed after each record which will be redirected to disk so the memory usage should be minimal and O(1).

[b]100 records in comp, 10,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-10000 > /dev/null
        Command being timed: "./Q26419674 comp in-10000"
        User time (seconds): 0.02
        System time (seconds): 0.00
        Percent of CPU this job got: 92%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
        [b]Maximum resident set size (kbytes): 3984[/b]


[b]100 records in comp, 100,000 records in input[/b]
/usr/bin/time -v ./Q26419674 comp in-100000 > /dev/null
        Command being timed: "./Q26419674 comp in-100000"
        User time (seconds): 0.21
        System time (seconds): 0.02
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.24
        [b]Maximum resident set size (kbytes): 3984[/b]
0
 

Author Comment

by:n78298
ID: 33496395
Thanks a lot for your detailed explanation.

So this means there should not be any memory related issues as only one line will ever be held in memory.

I will get access to environments tomorrow where I will be able to test this solution with 2.5 million rows.

I will keep you updated.
0
 

Author Comment

by:n78298
ID: 33500018
Amazing results. 2.5 million records getting processed in less than 3 minutes.  

If you can confirm that there will be no memory related issues , I can close on this solution and integrate this solution into my application.
0
 
LVL 7

Accepted Solution

by:
mcuk_storm earned 500 total points
ID: 33503338
There shouldn't be any issues with memory, as you can see from the two outputs i posted earlier the memory consumption of processing 10,000 and 100,000 records was the same, the memory consuption is based primarily on some overhead + the comparison table loading into memory.
0
 

Author Closing Comment

by:n78298
ID: 33556056
Thanks a lot.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Over the years I've spent many an hour playing on hardened, DMZ'd servers, with only a sub-set of the usual GNU toy's to keep me company; frequently I've needed to save and send log or data extracts from these server back to my PC, or to others, and…
Background Still having to process all these year-end "csv" files received from all these sources (including Government entities), sometimes we have the need to examine the contents due to data error, etc... As a "Unix" shop, our only readily …
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question