Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Perl CSV Manipulation - Split Single Record into Multiple Records

Posted on 2011-02-26
3
Medium Priority
?
605 Views
Last Modified: 2012-05-11
This is a followup to a previous question I asked http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_26845234.html

I have other data files that are part of the data migration that do not need the data to be cleansed but need to split the Notes column based on a timestamp RegEx.

For example, my first data file is like this

"Company Name", "Notes"
"Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this to be in the format

"Company A","September 17, 2010 3:28 PM This is the first note."
"Company A","September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note."
"Company B","May 15, 2010 1:12 PM This is the second note."
"Company B","May 31, 2010 1:00 PM This is the third note."

I also have another data file that is similar but has more columns.

"Firstname", "Lastname", "Company Name", "Notes"
"First","Last","Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this output to be like

"First","Last","Company A","September 17, 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15, 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31, 2010 1:00 PM This is the third note."

Attached are the data samples in .csv format.

It would be great if this could be done in a single script that does both formats.
sample1.csv
sample2.csv
0
Comment
Question by:mikedgibson
  • 2
3 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34989329
Hi.  I'll look at it.
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 2000 total points
ID: 34989357
I think this will also handle the data from your other (first) question.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\,/g;      # restore commas between fields
        s/\"//g;

        @z =  split(/\,/);
        $b = pop(@z);
        $c = '"' . join('","',@z) . '"';
        $b =~ s/\<br\>/\|\|/ig;
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+$//;
            print "$c\,\"$x\"\n";
        }
    }
}

Open in new window



c:\temp>perl foo.pl sample1.csv
"Company A","September 17 2010 3:28 PM This is the first note."
"Company A","September 17 2010 4:28 PM This is the second note."
"Company B","May 1 2010 4:18 PM This is the first note."
"Company B","May 15 2010 1:12 PM This is the second note."
"Company B","May 31 2010 1:00 PM This is the third note."

c:\temp>perl foo.pl sample2.csv
"First","Last","Company A","September 17 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31 2010 1:00 PM This is the third note."

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
ID: 34989719

#!/usr/bin/perl

use strict;
use warnings;

die "usage: $0 <csvfile>" if ! @ARGV;

my $csvfile = shift;

open my $csv_fh, '<', $csvfile or die "can't open '$csvfile' $!";

my @header = split(/,\s?/, <$csv_fh>);
chomp $header[-1];
s/"//g for @header;

while ( my $line = <$csv_fh> ) {
    chomp $line;
    next if $line =~ /^\s*$/;
    
    my %csv_fields;
    @csv_fields{@header} = split(/,\s?/, $line, scalar @header);
    s/"//g for @csv_fields{@header};
    
    $csv_fields{'Notes'} = [ split(/<br>/i, $csv_fields{'Notes'}) ];
    
    for my $i ( 0..$#{ $csv_fields{'Notes'} } ) {
        print qq("$csv_fields{$_}",) for @header[0..$#header -1];
        print qq("$csv_fields{'Notes'}->[$i]"\n);
    }
}

Open in new window

0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

916 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question