Solved

Perl CSV Manipulation - Split Single Record into Multiple Records

Posted on 2011-02-26
3
556 Views
Last Modified: 2012-05-11
This is a followup to a previous question I asked http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_26845234.html

I have other data files that are part of the data migration that do not need the data to be cleansed but need to split the Notes column based on a timestamp RegEx.

For example, my first data file is like this

"Company Name", "Notes"
"Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this to be in the format

"Company A","September 17, 2010 3:28 PM This is the first note."
"Company A","September 17, 2010 4:28 PM This is the second note."
"Company B","May 1, 2010 4:18 PM This is the first note."
"Company B","May 15, 2010 1:12 PM This is the second note."
"Company B","May 31, 2010 1:00 PM This is the third note."

I also have another data file that is similar but has more columns.

"Firstname", "Lastname", "Company Name", "Notes"
"First","Last","Company A","September 17, 2010 3:28 PM This is the first note.<BR>September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note.<BR>May 15, 2010 1:12 PM This is the second note.<BR>May 31, 2010 1:00 PM This is the third note."

I need this output to be like

"First","Last","Company A","September 17, 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17, 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1, 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15, 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31, 2010 1:00 PM This is the third note."

Attached are the data samples in .csv format.

It would be great if this could be done in a single script that does both formats.
sample1.csv
sample2.csv
0
Comment
Question by:mikedgibson
  • 2
3 Comments
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
Hi.  I'll look at it.
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
Comment Utility
I think this will also handle the data from your other (first) question.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\,/g;      # restore commas between fields
        s/\"//g;

        @z =  split(/\,/);
        $b = pop(@z);
        $c = '"' . join('","',@z) . '"';
        $b =~ s/\<br\>/\|\|/ig;
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+$//;
            print "$c\,\"$x\"\n";
        }
    }
}

Open in new window



c:\temp>perl foo.pl sample1.csv
"Company A","September 17 2010 3:28 PM This is the first note."
"Company A","September 17 2010 4:28 PM This is the second note."
"Company B","May 1 2010 4:18 PM This is the first note."
"Company B","May 15 2010 1:12 PM This is the second note."
"Company B","May 31 2010 1:00 PM This is the third note."

c:\temp>perl foo.pl sample2.csv
"First","Last","Company A","September 17 2010 3:28 PM This is the first note."
"First","Last","Company A","September 17 2010 4:28 PM This is the second note."
"First","Last","Company B","May 1 2010 4:18 PM This is the first note."
"First","Last","Company B","May 15 2010 1:12 PM This is the second note."
"First","Last","Company B","May 31 2010 1:00 PM This is the third note."

Open in new window

0
 
LVL 28

Expert Comment

by:FishMonger
Comment Utility

#!/usr/bin/perl

use strict;
use warnings;

die "usage: $0 <csvfile>" if ! @ARGV;

my $csvfile = shift;

open my $csv_fh, '<', $csvfile or die "can't open '$csvfile' $!";

my @header = split(/,\s?/, <$csv_fh>);
chomp $header[-1];
s/"//g for @header;

while ( my $line = <$csv_fh> ) {
    chomp $line;
    next if $line =~ /^\s*$/;
    
    my %csv_fields;
    @csv_fields{@header} = split(/,\s?/, $line, scalar @header);
    s/"//g for @csv_fields{@header};
    
    $csv_fields{'Notes'} = [ split(/<br>/i, $csv_fields{'Notes'}) ];
    
    for my $i ( 0..$#{ $csv_fields{'Notes'} } ) {
        print qq("$csv_fields{$_}",) for @header[0..$#header -1];
        print qq("$csv_fields{'Notes'}->[$i]"\n);
    }
}

Open in new window

0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now