Solved

Perl CSV Manipulation - Split Single Record into Multiple Records

Posted on 2011-02-26
3
575 Views
Last Modified: 2012-05-11
This is a followup to a previous question I asked http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_26845234.html

I have other data files that are part of the data migration that do not need the data to be cleansed but need to split the Notes column based on a timestamp RegEx.

For example, my first data file is like this

"Company Name", "Notes"
"Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this to be in the format

"Company A","September 17, 2010 3:28 PM This is the first note."
"Company A","September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note."
"Company B","May 15, 2010 1:12 PM This is the second note."
"Company B","May 31, 2010 1:00 PM This is the third note."

I also have another data file that is similar but has more columns.

"Firstname", "Lastname", "Company Name", "Notes"
"First","Last","Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this output to be like

"First","Last","Company A","September 17, 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15, 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31, 2010 1:00 PM This is the third note."

Attached are the data samples in .csv format.

It would be great if this could be done in a single script that does both formats.
sample1.csv
sample2.csv
0
Comment
Question by:mikedgibson
  • 2
3 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34989329
Hi.  I'll look at it.
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34989357
I think this will also handle the data from your other (first) question.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\,/g;      # restore commas between fields
        s/\"//g;

        @z =  split(/\,/);
        $b = pop(@z);
        $c = '"' . join('","',@z) . '"';
        $b =~ s/\<br\>/\|\|/ig;
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+$//;
            print "$c\,\"$x\"\n";
        }
    }
}

Open in new window



c:\temp>perl foo.pl sample1.csv
"Company A","September 17 2010 3:28 PM This is the first note."
"Company A","September 17 2010 4:28 PM This is the second note."
"Company B","May 1 2010 4:18 PM This is the first note."
"Company B","May 15 2010 1:12 PM This is the second note."
"Company B","May 31 2010 1:00 PM This is the third note."

c:\temp>perl foo.pl sample2.csv
"First","Last","Company A","September 17 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31 2010 1:00 PM This is the third note."

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
ID: 34989719

#!/usr/bin/perl

use strict;
use warnings;

die "usage: $0 <csvfile>" if ! @ARGV;

my $csvfile = shift;

open my $csv_fh, '<', $csvfile or die "can't open '$csvfile' $!";

my @header = split(/,\s?/, <$csv_fh>);
chomp $header[-1];
s/"//g for @header;

while ( my $line = <$csv_fh> ) {
    chomp $line;
    next if $line =~ /^\s*$/;
    
    my %csv_fields;
    @csv_fields{@header} = split(/,\s?/, $line, scalar @header);
    s/"//g for @csv_fields{@header};
    
    $csv_fields{'Notes'} = [ split(/<br>/i, $csv_fields{'Notes'}) ];
    
    for my $i ( 0..$#{ $csv_fields{'Notes'} } ) {
        print qq("$csv_fields{$_}",) for @header[0..$#header -1];
        print qq("$csv_fields{'Notes'}->[$i]"\n);
    }
}

Open in new window

0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
compress files in RAR using perl 13 80
Port 80 requests 16 95
Bash one liner, start and end time calculation 13 110
Perl regex to replace any capital letters not preceded by ">" 6 118
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …

786 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question