Solved

Perl expert complex data manupulation

Posted on 2014-01-10
13
299 Views
Last Modified: 2014-01-16
Hi ,

I have a data which need to be manipulated as below.

filename1.csv
Pat_name,Pat_number,pat_id
raj,123,3
dinesh,,4

filename2.csv
Pat_name,Pat_number
raj,123


compare both files on only same column names and get the desired output .
If the value in rows match = success, else = failure
In this case only 2 columns present in both files (pat_id can be ignored)

output.csv

Pat_name_file1Name,Pat_name_file2Name,Match,Pat_number_Filename1,Pat_number_Filename2,Match
raj,raj,Success,123,123,Success
dinesh,,Failure,,,Success


PS. My original files has around 10 columns .

Thanks
Magento
0
Comment
Question by:magento
  • 7
  • 6
13 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 39772887
Do you mean something like this?
perl -F, -lane '$m{$F[0]}[1-@ARGV]=$F[1];END{print join",",exists $m{$_}[0]&&$_,exists $m{$_}[1]&&$_,($#{$m{$_}}?"Success":"Failure"),@{$m{$_}}[0,1],$m{$_}[0]eq$m{$_}[1]?"Success":"Failure" for keys %m}' filename1.csv filename2.csv
0
 
LVL 5

Author Comment

by:magento
ID: 39772900
Hi Ozo,

Getting the below error..

C:\Expert-Exchange\perl>perl -F, -lane '$m{$F[0]}[1-@ARGV]=$F[1];END{print join
,",exists $m{$_}[0]&&$_,exists $m{$_}[1]&&$_,($#{$m{$_}}?"Success":"Failure"),@
$m{$_}}[0,1],$m{$_}[0]eq$m{$_}[1]?"Success":"Failure" for keys %m}' 1.txt 2.txt

Can't find string terminator "'" anywhere before EOF at -e line 1.

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 39772905
It looks like you are using the DOS shell, which uses different quoting rules.
Try
perl -F, -lane "$m{$F[0]}[1-@ARGV]=$F[1];END{print join',',exists $m{$_}[0]&&$_,exists $m{$_}[1]&&$_,($#{$m{$_}}?'Success':'Failure'),@{$m{$_}}[0,1],$m{$_}[0]eq$m{$_}[1]?'Success':'Failure' for keys %m}" filename1.csv filename2.csv
0
 
LVL 5

Author Comment

by:magento
ID: 39772913
Hi Ozo ,

It works only for 2 columns , my file has ~ 10 columns.

Also the header is not working for my original files.

Please advice.
1.csv
2.csv
0
 
LVL 84

Expert Comment

by:ozo
ID: 39773222
What should the results be when given 1.csv and 2.csv as input?
0
 
LVL 5

Author Comment

by:magento
ID: 39773244
Hi Ozo,

The result should be in 3.csv

The idea is to combine columns present in both files and check for matches.
so columnName of file1 is matched with columName of file2 and a match result should say success Iif matched) or false (if not matched)

If you need the sample from original files, please tell me i will attach it.

The header should be modified with filenames.

1.csv
col1,col2,col3,col4
a       b       c      d

2.csv
col1,col2,col4
a         b     e

3.csv
col1_1 col1_2 match col2_1 col2_2 match col4_1 cpl4_2 match
a              a        Success  b       b     Success    d       e      false


Thanks,
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 
LVL 84

Accepted Solution

by:
ozo earned 500 total points
ID: 39773331
Do you mean like this?
#!/usr/bin/perl
use strict;
use warnings;
my(@c,@m);
while( <> ){
    chomp;
    my @F=split/,/; 
    @c=@F,next unless @c;
    @{$m[1-@ARGV]}{@c}=@F;
    close ARGV,undef @c if 2..2;
}
$\=$/;
my @k=sort grep $m[1]{$_},keys %{$m[0]};
print join"\t",map{$_."_1",$_."_2","match"}@k;
print join"\t",map{$m[0]{$_},$m[1]{$_},$m[0]{$_}eq$m[1]{$_}?"Success":"false"}@k;

Open in new window

0
 
LVL 5

Author Comment

by:magento
ID: 39773477
Hi Ozo ,

I run ur code and it produces the output in command prompt which was looks like  the below.

But it only prints for 1st record only also not comma separated.

C:\Expert-Exchange\perl>perl test.pl 1.csv 2.csv
Activity Status_1       Activity Status_2       match   Available Quantity_1
Available Quantity_2    match   Collection Status_1     Collection Status_2
match   Container name_1        Container name_2        match   Created On_1
Created On_2    match   Initial Quantity_1      Initial Quantity_2      match
Is Available_1  Is Available_2  match   Label_1 Label_2 match   Lineage_1
Lineage_2       match   Pathological Status_1   Pathological Status_2   match
Position one_1  Position one_2  match   Position two_1  Position two_2  match
SCG name_1      SCG name_2      match   Specimen Class_1        Specimen Class_2
        match   Tissue Side_1   Tissue Side_2   match   Tissue Site_1   Tissue S
ite_2   match   Type_1  Type_2  match
Active  Active  Success 13.8    13.8    Success Collected       Collected
Success D6 S4 R16 B5    D6 S4 R16 B5    Success 9/2/2008        9/2/2008 0:00
false   13.8    13.8    Success TRUE    1       false   B0006697        B0006697
        Success New     New     Success Not Specified   Not Specified   Success
6       9       false   9       6       false   A RANDOMIZED  PHASE II TRIAL _30
834_180328      A RANDOMIZED  PHASE II TRIAL _30834_180328      Success Molecula
r       Molecular       Success Not Specified   Not Specified   Success Blood
Blood   Success DNA     DNA     Success

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 39773489
3.csv in http:#a39773244 prints for 1st record only also not comma separated

If there are more records, how would you want to determine which _1 record is to be matched with which _2 record?
0
 
LVL 5

Author Comment

by:magento
ID: 39773517
Hi Ozo ,

I am sorry i have posted that only as a sample.

It should be in csv format.

>>>If there are more records, how would you want to determine which _1 record is to be matched with which _2 record?

The matching will be based on column name.

File1.csv has a column named Activity Status then it should be matched with the same column name (Activity Status) in file 2.csv.

Thanks
0
 
LVL 84

Expert Comment

by:ozo
ID: 39773536
It looks like lots of records have the same Activity Status
Which Active in 1.csv goes with which Active in 2.csv?
0
 
LVL 5

Author Comment

by:magento
ID: 39773562
Ok , here is the column in 1.csv

SCG name,Label,Barcode,External Id Name#1,External Id Value#1,External Id Name#2,External Id Value#2,Created On,Activity Status,Is Available,Collection Status,Specimen Class,Type,Tissue Site,Tissue Side,Initial Quantity,Available Quantity,Lineage,Concentration,Pathological Status,Container name,Position one,Position two,Comments

Here is the column in 2.csv

SCG name,Label,Created On,Activity Status,Is Available,Collection Status,Specimen Class,Type,Tissue Site,Tissue Side,Initial Quantity,Available Quantity,Lineage,Pathological Status,Container name,Position one,Position two,Comments

SCG name in 1.csv need to match with the same name in 2.csv . In 2.csv the column name SCG name is present and hence it will match with them.
If a column in 1.csv not present in 2.csv (in this case Barcode,External Id Name#1,External Id Value#1,External Id Name#2,External Id Value#2,Concentration) those columns should be removed or ignored.

The same way row1 of column SCG name match with row1 of SCG name in 2.csv
Hope i was clear now.

Thanks ozo for ur help on this.
0
 
LVL 84

Expert Comment

by:ozo
ID: 39774601
It looks like several records have the same SCG name
which A RANDOMIZED  PHASE II TRIAL _30834_180328 in 1.csv goes with which A RANDOMIZED  PHASE II TRIAL _30834_180328 in 2.csv?
0

Featured Post

3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
The viewer will learn how to dynamically set the form action using jQuery.

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

24 Experts available now in Live!

Get 1:1 Help Now