Solved

Perl CSV Manipulation - Split Single Record into Multiple Records

Posted on 2011-02-26
3
582 Views
Last Modified: 2012-05-11
This is a followup to a previous question I asked http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_26845234.html

I have other data files that are part of the data migration that do not need the data to be cleansed but need to split the Notes column based on a timestamp RegEx.

For example, my first data file is like this

"Company Name", "Notes"
"Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this to be in the format

"Company A","September 17, 2010 3:28 PM This is the first note."
"Company A","September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note."
"Company B","May 15, 2010 1:12 PM This is the second note."
"Company B","May 31, 2010 1:00 PM This is the third note."

I also have another data file that is similar but has more columns.

"Firstname", "Lastname", "Company Name", "Notes"
"First","Last","Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this output to be like

"First","Last","Company A","September 17, 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15, 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31, 2010 1:00 PM This is the third note."

Attached are the data samples in .csv format.

It would be great if this could be done in a single script that does both formats.
sample1.csv
sample2.csv
0
Comment
Question by:mikedgibson
  • 2
3 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34989329
Hi.  I'll look at it.
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34989357
I think this will also handle the data from your other (first) question.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\,/g;      # restore commas between fields
        s/\"//g;

        @z =  split(/\,/);
        $b = pop(@z);
        $c = '"' . join('","',@z) . '"';
        $b =~ s/\<br\>/\|\|/ig;
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+$//;
            print "$c\,\"$x\"\n";
        }
    }
}

Open in new window



c:\temp>perl foo.pl sample1.csv
"Company A","September 17 2010 3:28 PM This is the first note."
"Company A","September 17 2010 4:28 PM This is the second note."
"Company B","May 1 2010 4:18 PM This is the first note."
"Company B","May 15 2010 1:12 PM This is the second note."
"Company B","May 31 2010 1:00 PM This is the third note."

c:\temp>perl foo.pl sample2.csv
"First","Last","Company A","September 17 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31 2010 1:00 PM This is the third note."

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
ID: 34989719

#!/usr/bin/perl

use strict;
use warnings;

die "usage: $0 <csvfile>" if ! @ARGV;

my $csvfile = shift;

open my $csv_fh, '<', $csvfile or die "can't open '$csvfile' $!";

my @header = split(/,\s?/, <$csv_fh>);
chomp $header[-1];
s/"//g for @header;

while ( my $line = <$csv_fh> ) {
    chomp $line;
    next if $line =~ /^\s*$/;
    
    my %csv_fields;
    @csv_fields{@header} = split(/,\s?/, $line, scalar @header);
    s/"//g for @csv_fields{@header};
    
    $csv_fields{'Notes'} = [ split(/<br>/i, $csv_fields{'Notes'}) ];
    
    for my $i ( 0..$#{ $csv_fields{'Notes'} } ) {
        print qq("$csv_fields{$_}",) for @header[0..$#header -1];
        print qq("$csv_fields{'Notes'}->[$i]"\n);
    }
}

Open in new window

0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to delete rows in a file based on the file name. 5 61
PERL variable conundrum 9 94
what are these modules in perl script 4 103
Log File Creation with Header and Footer 17 128
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question