Solved

Perl CSV Manipulation - Split Single Record into Multiple Records

Posted on 2011-02-26
3
590 Views
Last Modified: 2012-05-11
This is a followup to a previous question I asked http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_26845234.html

I have other data files that are part of the data migration that do not need the data to be cleansed but need to split the Notes column based on a timestamp RegEx.

For example, my first data file is like this

"Company Name", "Notes"
"Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this to be in the format

"Company A","September 17, 2010 3:28 PM This is the first note."
"Company A","September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note."
"Company B","May 15, 2010 1:12 PM This is the second note."
"Company B","May 31, 2010 1:00 PM This is the third note."

I also have another data file that is similar but has more columns.

"Firstname", "Lastname", "Company Name", "Notes"
"First","Last","Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this output to be like

"First","Last","Company A","September 17, 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15, 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31, 2010 1:00 PM This is the third note."

Attached are the data samples in .csv format.

It would be great if this could be done in a single script that does both formats.
sample1.csv
sample2.csv
0
Comment
Question by:mikedgibson
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
3 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34989329
Hi.  I'll look at it.
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34989357
I think this will also handle the data from your other (first) question.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\,/g;      # restore commas between fields
        s/\"//g;

        @z =  split(/\,/);
        $b = pop(@z);
        $c = '"' . join('","',@z) . '"';
        $b =~ s/\<br\>/\|\|/ig;
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+$//;
            print "$c\,\"$x\"\n";
        }
    }
}

Open in new window



c:\temp>perl foo.pl sample1.csv
"Company A","September 17 2010 3:28 PM This is the first note."
"Company A","September 17 2010 4:28 PM This is the second note."
"Company B","May 1 2010 4:18 PM This is the first note."
"Company B","May 15 2010 1:12 PM This is the second note."
"Company B","May 31 2010 1:00 PM This is the third note."

c:\temp>perl foo.pl sample2.csv
"First","Last","Company A","September 17 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31 2010 1:00 PM This is the third note."

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
ID: 34989719

#!/usr/bin/perl

use strict;
use warnings;

die "usage: $0 <csvfile>" if ! @ARGV;

my $csvfile = shift;

open my $csv_fh, '<', $csvfile or die "can't open '$csvfile' $!";

my @header = split(/,\s?/, <$csv_fh>);
chomp $header[-1];
s/"//g for @header;

while ( my $line = <$csv_fh> ) {
    chomp $line;
    next if $line =~ /^\s*$/;
    
    my %csv_fields;
    @csv_fields{@header} = split(/,\s?/, $line, scalar @header);
    s/"//g for @csv_fields{@header};
    
    $csv_fields{'Notes'} = [ split(/<br>/i, $csv_fields{'Notes'}) ];
    
    for my $i ( 0..$#{ $csv_fields{'Notes'} } ) {
        print qq("$csv_fields{$_}",) for @header[0..$#header -1];
        print qq("$csv_fields{'Notes'}->[$i]"\n);
    }
}

Open in new window

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question