CSV manipulation in Perl

I am trying to perform a data migration and my source information is in a .csv file similar to the following:

"Company","Notes (All)"
"Company A","September 17, 2010 3:28 PM This is the contents of the note.\r\nThey can be multiline and contain ""double quotes"".March 23, 2009 8:57 AM There are also multiple notes within the same field."
"Company B","March 17, 2004 10:55 AM This is the second company's notes.\r\nAlso, these notes sometimes contain commas within the text field."

My main priority is sanitizing the "Notes  (All)" field to strip out any characters that will give me issues during the import such as:

 \r
\n
"
,

However, I obviously need to keep the \r\n at the end of every row, the double quotes around the text fields and the commas between the fields and I am looking to do this in perl if possible.

Ideally, a perl script that does the above would be great but if possible I would like it if it could go one step further and split up the notes field based on the timestamp entries. If I take the example above, this is how I would like to sanitized output to look:

"Company","Notes (All)"
"Company A","September 17, 2010 3:28 PM This is the contents of the note. They can be multiline and contain 'double quotes'."
"Company A","March 23, 2009 8:57 AM There are also multiple notes within the same field."
"Company B","March 17, 2004 10:55 AM This is the second company's notes. Also these notes sometimes contain commas within the text field."

I would like "" changed to ', \r\n within the field changed to space, remove all commas and when there are multiple notes in the field, output an additional entry.

Mike
LVL 2
mikedgibsonAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

sjklein42Commented:
Are the "\r\n" actually backslashes and "n"s and "r"s, or are they really newlines?  It might be best if you attached the raw input file.

The Perl script is not hard, but please supply the raw input so we can test.
0
sjklein42Commented:
This may do what you want.  It does not yet split multiple notes into separate lines.  Please try it.

while ( <> )
{
	s/[\r\n]//g;		# trim

	s/\"\"/\'/g;		# double-double quotes ("") go to single quote
	s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
	s/\s+/ /g;			# collapse multiple spaces

	s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
	s/\,//g;			# get rid of remaining commas (inside fields)
	s/\:\:\:/\"\,\"/g;	# restore commas between fields

	print "$_\n";
}

Open in new window

0
sjklein42Commented:
This version should split your multi-note lines

while ( <> )
{
	s/[\r\n]//g;		# trim

	s/\"\"/\'/g;		# double-double quotes ("") go to single quote
	s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
	s/\s+/ /g;			# collapse multiple spaces

	s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
	s/\,//g;			# get rid of remaining commas (inside fields)
	s/\:\:\:/\"\,\"/g;	# restore commas between fields


	s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
	@x = split(/\n/);
	$c = shift(@x);

	foreach $x (@x) { print "$c$x\n"; }
}

Open in new window

0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

mikedgibsonAuthor Commented:
Sorry, they are indeed newlines and not literally \r\n. I should have been more clear. I will test the script you provided.
0
sjklein42Commented:
The script as written treated them literally as \r\n.  I will have to change that.

Can you please post as an attachment a real test file.
0
sjklein42Commented:
This version should handle input records that are split across multiple lines (embedded newlines):

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
sjklein42Commented:
I guess it doesn't need to check for literal \r and \n anymore.  Took out line 14.   ;)

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
mikedgibsonAuthor Commented:
OK. In the first script the following seem to work

1) Replacement of double double quotes with single quotes
2) Collapsing multiple spaces
3) Getting rid of commas

What doesn't seem to be working still is the conversion of \r \n to spaces. Is this because it isn't literally \r\n? Could we do something more generic there and simply convert all non printable chars (00-1F) to a # symbol? I tried this but doesn't seem to work

s/\\x00-\x1F]/#/g;

Mike
0
mikedgibsonAuthor Commented:
You are updating faster than I am. I can't post a real file because there is confidential data included.
0
sjklein42Commented:
This should handle all flavors of \r and \n, literal or not.

Understand about the data, but if this doesn't work, how about taking your example from the first posting,  making that into a real data file and posting that.

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
		s/[\r\n]/ /g;		# \r and \n go to spaces
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
mikedgibsonAuthor Commented:
Also, it appears as though the contents of the notes themselves have dates that match the regex you were using to create a new entry so that most likely won't work. That was a nice to have anyway.
0
mikedgibsonAuthor Commented:
Here is a sample data file.
data.csv
0
mikedgibsonAuthor Commented:
Unless you can think of some innovative way to determine if a data is the start of a new note or just a date within the text I think we'll need to drop that part.
0
sjklein42Commented:
Ok.  This version does not try to split on dates.

It also ignores the header line.

I might work on it some more later.  I think we can use the fact that there is always a line break before the subrecords (yes?).  That plus the subrecord starts with a date should be enough to go on.

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		print "$_\n";

		## s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		## @x = split(/\n/);
		## $c = shift(@x);
		## foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
mikedgibsonAuthor Commented:
Hmm that's a good point .. The sub records always have the date at the beginning of a new line so we could use that to identify a sub record, as long as we haven't already removed the \r\n before looking for the subrecords.
0
mikedgibsonAuthor Commented:
I tried the latest script you sent. It just hangs and doesn't return anything.
0
sjklein42Commented:
There may still be a problem with blank lines.

I'll add a fix for it.

I'll get back to you kn a few hours - dinner time.
0
mikedgibsonAuthor Commented:
No rush. Just need this in the new day or so.
0
sjklein42Commented:
<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		##print "$_\n";

		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x) { print "$c\,$x\n"; }
	}
}

Open in new window


Using your input produced this output:

c:\temp>perl foo.pl foo.csv
"Company A","September 17 2010 3:28 PM This is the contents of the note. They can be multiline and contain 'double quotes'.
"Company A","March 23 2009 8:57 AM There are also multiple notes within the same field.
"Company A","July 2 2010 6:56 AM they can also contain dates inside of them such as Sent: Friday September 17 2010 3:24 PM"
"Company B","March 17 2004 10:55 AM This is the second company's notes. Also these notes sometimes contain commas within the text field."

Open in new window

0
mikedgibsonAuthor Commented:
This is great, I just ran it against my real data and it was very close. Just one additional request if possible. There are lines in the file like the following that don't indicate a new sub record and are simply a part of the previous record but they are matching the sub record regex. Can the sub record regex be improved to not match when Sent: is at the start of the line?

Sent: Wednesday August 11 2010 11:45 AM

The regex
0
sjklein42Commented:
Glad we're getting there, but I'm confused about your comment.

The test data we're using already has a "Sent:  date" in the company A record.  Am I handling it the wrong way? How so?  Or are you describing a different case?  Please give me a complete input test record and desired output so I can see what you're talking about.
0
mikedgibsonAuthor Commented:
Sorry, this is getting confusing to explain. It is a new case that I didn't realize was in my data so I didn't include it in the sample. I am attaching a new data file with this sample. Basically there is an email within the Notes field that shouldn't be treated as a sub record but your RegEx that finds sub records is matching on the Sent: field in the email.
data.csv
0
sjklein42Commented:
That was easy.  It was really a bug in my code.  Missing ^ on line 15.  Try this

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		##print "$_\n";

		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x) { print "$c\,$x\n"; }
	}
}

Open in new window

0
sjklein42Commented:
And one more change to fix a problem with missing double quotes at the end of some output lines.

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x)
		{
			$x =~ s/\s+\"$/\"/;
			print "$c\,$x\n";
		}
	}
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mikedgibsonAuthor Commented:
Nailed it. This is perfect! Thank you very much for your assistance.
0
sjklein42Commented:
A pleasure.  Some people do crosswords, I like PERL scripts.  :)
0
mikedgibsonAuthor Commented:
sjklein42, sorry to add a comment after this has been closed but I found one small issue with the script. For notes records that end in "" the script converts it to ' but no double quote is added to the end. I added a new sample data file with this scenario. Do you think you could take a quick look?
data.csv
0
sjklein42Commented:
This should work better.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\"\,\"/g;  # restore commas between fields

        ($c,$b) =  split(/\,/);
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+\"$/\"/;
            print "$c\,$x\n";
        }
    }
}

Open in new window

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.