Link to home
Start Free TrialLog in
Avatar of ajay chaudhary
ajay chaudhary

asked on

Do replace operation and awk to sum multiple columns if another column has duplicate values

Need your help to resolve the below query.  
I want to sum up the values for column4,column6,column7, column8,column10,column11 if value in column1 is duplicate.  
Also need to make duplicate rows as single row in output file i.e (column3 can have any of value from duplicates values in output file of column3 ex : can contain either abcd123,qwert or 345rty).   Also need to append all duplicate values of column3 of input file in column 12 of output file expect the one which we have already stored in column3 of output file (ex. column12 of output file can contain only in form qwert^345rty if value abcd123 is used in column 3 of outputfile

input file

   
    a|b|c|d|e|f|g|h|i|j|k
IN27201800024099|a|abcd123|2.01|ad|5|56|6|rr|1|2
IN27201800023963|b|zxcv123|3|rt|67|6|61|ty|6|3
IN27201800024099|a||qwert|4|ad|5|6|1|rr|7.45|4
IN27201800024099|a|345rty|5|ad|5|6|1|rr|8|4
IN27201800023963|b|1234ret|7|rt|5|5|1|ty|56|5
IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5

Open in new window


output file

   
 a|b|c|d|e|f|g|h|i|j|k|l
IN27201800024099|a|abcd123|11.01|ad|15|68|8|rr|16.45|10|qwert^345rty
IN27201800023963|b|1234ret|10|rt|72|11|62|ty|62|8|zxcv123
IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5|qazxswe

Open in new window


Tried below code, but it is not working and also no clue how to complete the code to get correct output

    awk 'BEGIN {FS=OFS="|"} FNR==1 {a[$1]+= (f4[key]+=$4;f6[key]+=$6;f7[key]+=$7;f8[key]+=$8;f10[key]+=$10;f11[key]+=$11;)} input.txt > output.txt

Open in new window

Avatar of noci
noci

awk might be able to do it, but i chose perl...
FYI, your input example has a few problems, line the number of columns isn;t consistent
if the is true for the real input set then you are in trouble....
I assumed line 1 = OK. otherwise adjust the script accordingly.
(Per arrays start from 0 by default, i didn;t change that. fn references are 1 based)

#!/usr/bin/env perl 

use strict;
use warnings;
use utf8;

my ($line, $header);
my %store;
my @fields;

$header = <>;           # read line 1, header...
chomp($header);
while ($line = <>) {
        chomp($line);
        @fields = split(/\|/,$line);
        if ( defined ($store{$fields[0]}{'f2'})) {
                $store{$fields[0]}{'f4'} += $fields[3];
                $store{$fields[0]}{'f6'} += $fields[5];
                $store{$fields[0]}{'f7'} += $fields[6];
                $store{$fields[0]}{'f8'} += $fields[7];
                $store{$fields[0]}{'f10'} += $fields[9];
                $store{$fields[0]}{'f11'} += $fields[10];
                if ($store{$fields[0]}{'f12'} eq "") {
                        $store{$fields[0]}{'f12'} = $fields[2];
                } else {
                        $store{$fields[0]}{'f12'} .= "^".$fields[2];
                }
        } else {
                $store{$fields[0]}{'f2'} = $fields[1];
                $store{$fields[0]}{'f3'} = $fields[2];
                $store{$fields[0]}{'f4'} = $fields[3];
                $store{$fields[0]}{'f5'} = $fields[4];
                $store{$fields[0]}{'f6'} = $fields[5];
                $store{$fields[0]}{'f7'} = $fields[6];
                $store{$fields[0]}{'f8'} = $fields[7];
                $store{$fields[0]}{'f9'} = $fields[8];
                $store{$fields[0]}{'f10'} = $fields[9];
                $store{$fields[0]}{'f11'} = $fields[10];
                $store{$fields[0]}{'f12'} = "";
        }
}

print $header."|l"."\n";
foreach my $k (sort (keys %store)) {
        $line = $k;
        foreach my $f ( 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12' ) {
                $line .= "|" . $store{$k}{$f};
        }
        print $line."\n";
}

Open in new window

Avatar of ajay chaudhary

ASKER

Hi Noci,
I have done the correction in input and output file, Please check
Now column for input are aligned with output file
Hi noci,
could you please explain the code once or if you can put comments to understand
would be a great help
Hi noci,

what is the purpose of this
if ( defined ($store{$fields[0]}{'f2'}))
defined checks if a name  exists in a 'hash' (like %store is)
fields[0] is the primary key (Column 0) and 'f2' is the column 2 value... if that exists a record has been seen before so the addition & concatenation must happen.

otherwise it is a new record and that is handled in the else part.
Here with some comments:
#!/usr/bin/env perl 

use strict;   # enforce checking of variables 
use warnings;
use utf8;

my ($line, $header);   # declare globals, line buffers
my %store;     # internal database
my @fields;    # split line

$header = <>;           # read line 1, header...
chomp($header);    # remove end of line (\r\n or \n)
while ($line = <>) {
        chomp($line);    # remove end of line \r\n or \n
        @fields = split(/\|/,$line);    # split line in fields
        if ( defined ($store{$fields[0]}{'f2'})) {    # check if we have seen this one before
                $store{$fields[0]}{'f4'} += $fields[3];    # Yep, so add some columns
                $store{$fields[0]}{'f6'} += $fields[5];
                $store{$fields[0]}{'f7'} += $fields[6];
                $store{$fields[0]}{'f8'} += $fields[7];
                $store{$fields[0]}{'f10'} += $fields[9];
                $store{$fields[0]}{'f11'} += $fields[10];
                if ($store{$fields[0]}{'f12'} eq "") {      # the tag needs to be appended handle first. vs. next cases
                        $store{$fields[0]}{'f12'} = $fields[2];
                } else {
                        $store{$fields[0]}{'f12'} .= "^".$fields[2];
                }
        } else {        # new record... fill all fields...
                $store{$fields[0]}{'f2'} = $fields[1];
                $store{$fields[0]}{'f3'} = $fields[2];
                $store{$fields[0]}{'f4'} = $fields[3];
                $store{$fields[0]}{'f5'} = $fields[4];
                $store{$fields[0]}{'f6'} = $fields[5];
                $store{$fields[0]}{'f7'} = $fields[6];
                $store{$fields[0]}{'f8'} = $fields[7];
                $store{$fields[0]}{'f9'} = $fields[8];
                $store{$fields[0]}{'f10'} = $fields[9];
                $store{$fields[0]}{'f11'} = $fields[10];
                $store{$fields[0]}{'f12'} = "";
        }
}
# all has been read. So print header and
print $header."|l"."\n";
foreach my $k (sort (keys %store)) {     # Walk the database on Primary key
        $line = $k;
        foreach my $f ( 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12' ) {     # Join all fields, by appending them to the key
                $line .= "|" . $store{$k}{$f};
        }
        print $line."\n";          # print the line
}

Open in new window

/bin/awk here taken 616630400 nano seconds (approximately).
perl taken                   549079900 nano seconds (approximately).
>> test result varies.
Code using /bin/awk (112 lines of code)
/bin/awk -F"|" '{
	if ( 1 == NR)
	{
		printf( "%s|l\n", $0);
	}
	else
	{
		DUP_ROW[$1]++;
		F2[$1]=$2;
		if ( "" == F5[$1] )
		{
			F5[$1]=$5;
		}
		if ( "" == F3[$1] )
		{
			F3[$1]=$3;
		}
		else
		{
			if ( "" == F3_END[$1] )
			{
				F3_END[$1]= $3;
			}
			else
			{
				F3_END[$1]= F3_END[$1]"^"$3;
			}
		}
		F4[$1]+=sprintf("%f",$4);
		F6[$1]+=sprintf("%f",$6);
		F7[$1]+=sprintf("%f",$7);
		F8[$1]+=sprintf("%f",$8);
		F9[$1]=$9;
		F10[$1]+=sprintf("%f",$10);
		F11[$1]+=sprintf("%f",$11);
	}
}
END {
	for( Val in DUP_ROW)
	{
		PRINTF_FORMAT="%s|%s|%s|"; # Val F2[Val] F3[Val]
		if ( 0 == index( F4[Val], ".") ) # F4[Val]
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.0f|";
		}
		else
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.2f|";
		}
		PRINTF_FORMAT=PRINTF_FORMAT"%s|"; # F5[Val]
		if ( 0 == index( F6[Val], ".") ) # F6[Val]
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.0f|"; # F6[Val]
		}
		else
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.2f|"; # F6[Val]
		}
		if ( 0 == index( F7[Val], ".") ) # F7[Val]
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.0f|"; # F7[Val]
		}
		else
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.2f|"; # F7[Val]
		}
		if ( 0 == index( F8[Val], ".") ) # F8[Val]
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.0f|"; # F8[Val]
		}
		else
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.2f|"; # F8[Val]
		}
		PRINTF_FORMAT=PRINTF_FORMAT"%s|";
		if ( 0 == index( F10[Val], ".") ) # F10[Val]
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.0f|"; # F10[Val]
		}
		else
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.2f|"; # F10[Val]
		}
		if ( 0 == index( F11[Val], ".") ) # F11[Val]
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.0f|"; # F11[Val]
		}
		else
		{
			PRINTF_FORMAT=PRINTF_FORMAT"%.2f|"; # F11[Val]
		}
		PRINTF_FORMAT=PRINTF_FORMAT"%s\n";
		# Use the following lines if you prefer qazxswe at the end of line for IN27201800024098
		# if ( "" == F3_END[Val])
		# {
		# 	F3_END[Val]=F3[Val];
		# }
		printf( PRINTF_FORMAT, \
			Val,              \
			F2[Val], \
			F3[Val],          \
			F4[Val],          \
			F5[Val],          \
			F6[Val],          \
			F7[Val],          \
			F8[Val],          \
			F9[Val],          \
			F10[Val],         \
			F11[Val],         \
			F3_END[Val]);
	}
}' ./input.txt

Open in new window

Current output (Similar output obtained using noci.pl):
$ ./29097990.sh
    a|b|c|d|e|f|g|h|i|j|k|l
IN27201800023963|b|zxcv123|10|rt|72|11|62|ty|62|8|1234ret
IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5|
IN27201800024099|a|abcd123|11.01|ad|15|68|8|rr|16.45|10|qwert^345rty

Open in new window



a)
// output file
..
// IN27201800023963|b|1234ret|10|rt|72|11|62|ty|62|8|zxcv123
This line needs to be
IN27201800023963|b|zxcv123|10|rt|72|11|62|ty|62|8|1234ret

b)
// output file
..
// IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5|qazxswe
Provided script displaying
IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5|
If you need => IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5|qazxswe
Update the following in given code:
Replace:
                  # Use the following lines if you prefer qazxswe at the end of line for IN27201800024098
                  # if ( "" == F3_END[Val])
                  # {
                  #       F3_END[Val]=F3[Val];
                  # }
With:
                  # Use the following lines if you prefer qazxswe at the end of line for IN27201800024098
                  if ( "" == F3_END[Val])
                  {
                        F3_END[Val]=F3[Val];
                  }
Here goes output after removing those comments for these lines:
$ ./29097990.sh
    a|b|c|d|e|f|g|h|i|j|k|l
IN27201800023963|b|zxcv123|10|rt|72|11|62|ty|62|8|1234ret
IN27201800024098|f|qazxswe|80|ty|6|6|1|4rght|765|5|qazxswe
IN27201800024099|a|abcd123|11.01|ad|15|68|8|rr|16.45|10|qwert^345rty
$

Open in new window

Comments on the :
DUP_ROW[$1]++; # Create array DUP_ROW["IN27201800024099"], DUP_ROW["IN27201800023963"] and DUP_ROW["IN27201800024098"]
F2[$1]=$2;     # Save second column in F2["IN27201800024099"] or F2["IN27201800023963"] or F2["IN27201800024098"]
# Like the same way save each columns F3, F4... F11
F4[$1]+=sprintf("%f",$4); # Add $4 using float.
if ( 0 == index( F4[Val], ".") ) # F4[Val] # if F4 having . operator, use %.0f else use %.2f to print float values using printf.
PRINTF_FORMAT="%s|%s|%s|";#printf first argument format
PRINTF_FORMAT=PRINTF_FORMAT"%.0f|";# append the string "%.0f|"

Provided data:
input.txt
    a|b|c|d|e|f|g|h|i|j|k
Expected output:
          a|b|c|d|e|f|g|h|i|j|k|l

Hence following line was written:
            printf( "%s|l\n", $0);
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.