Solved

CSV manipulation in Perl

Posted on 2011-02-24
28
617 Views
Last Modified: 2012-05-11
I am trying to perform a data migration and my source information is in a .csv file similar to the following:

"Company","Notes (All)"
"Company A","September 17, 2010 3:28 PM This is the contents of the note.\r\nThey can be multiline and contain ""double quotes"".March 23, 2009 8:57 AM There are also multiple notes within the same field."
"Company B","March 17, 2004 10:55 AM This is the second company's notes.\r\nAlso, these notes sometimes contain commas within the text field."

My main priority is sanitizing the "Notes  (All)" field to strip out any characters that will give me issues during the import such as:

 \r
\n
"
,

However, I obviously need to keep the \r\n at the end of every row, the double quotes around the text fields and the commas between the fields and I am looking to do this in perl if possible.

Ideally, a perl script that does the above would be great but if possible I would like it if it could go one step further and split up the notes field based on the timestamp entries. If I take the example above, this is how I would like to sanitized output to look:

"Company","Notes (All)"
"Company A","September 17, 2010 3:28 PM This is the contents of the note. They can be multiline and contain 'double quotes'."
"Company A","March 23, 2009 8:57 AM There are also multiple notes within the same field."
"Company B","March 17, 2004 10:55 AM This is the second company's notes. Also these notes sometimes contain commas within the text field."

I would like "" changed to ', \r\n within the field changed to space, remove all commas and when there are multiple notes in the field, output an additional entry.

Mike
0
Comment
Question by:mikedgibson
  • 15
  • 13
28 Comments
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
Are the "\r\n" actually backslashes and "n"s and "r"s, or are they really newlines?  It might be best if you attached the raw input file.

The Perl script is not hard, but please supply the raw input so we can test.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
This may do what you want.  It does not yet split multiple notes into separate lines.  Please try it.

while ( <> )
{
	s/[\r\n]//g;		# trim

	s/\"\"/\'/g;		# double-double quotes ("") go to single quote
	s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
	s/\s+/ /g;			# collapse multiple spaces

	s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
	s/\,//g;			# get rid of remaining commas (inside fields)
	s/\:\:\:/\"\,\"/g;	# restore commas between fields

	print "$_\n";
}

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
This version should split your multi-note lines

while ( <> )
{
	s/[\r\n]//g;		# trim

	s/\"\"/\'/g;		# double-double quotes ("") go to single quote
	s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
	s/\s+/ /g;			# collapse multiple spaces

	s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
	s/\,//g;			# get rid of remaining commas (inside fields)
	s/\:\:\:/\"\,\"/g;	# restore commas between fields


	s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
	@x = split(/\n/);
	$c = shift(@x);

	foreach $x (@x) { print "$c$x\n"; }
}

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
Sorry, they are indeed newlines and not literally \r\n. I should have been more clear. I will test the script you provided.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
The script as written treated them literally as \r\n.  I will have to change that.

Can you please post as an attachment a real test file.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
This version should handle input records that are split across multiple lines (embedded newlines):

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
I guess it doesn't need to check for literal \r and \n anymore.  Took out line 14.   ;)

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
OK. In the first script the following seem to work

1) Replacement of double double quotes with single quotes
2) Collapsing multiple spaces
3) Getting rid of commas

What doesn't seem to be working still is the conversion of \r \n to spaces. Is this because it isn't literally \r\n? Could we do something more generic there and simply convert all non printable chars (00-1F) to a # symbol? I tried this but doesn't seem to work

s/\\x00-\x1F]/#/g;

Mike
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
You are updating faster than I am. I can't post a real file because there is confidential data included.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
This should handle all flavors of \r and \n, literal or not.

Understand about the data, but if this doesn't work, how about taking your example from the first posting,  making that into a real data file and posting that.

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
		s/[\r\n]/ /g;		# \r and \n go to spaces
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
Also, it appears as though the contents of the notes themselves have dates that match the regex you were using to create a new entry so that most likely won't work. That was a nice to have anyway.
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
Here is a sample data file.
data.csv
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
Unless you can think of some innovative way to determine if a data is the start of a new note or just a date within the text I think we'll need to drop that part.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
Ok.  This version does not try to split on dates.

It also ignores the header line.

I might work on it some more later.  I think we can use the fact that there is always a line break before the subrecords (yes?).  That plus the subrecord starts with a date should be enough to go on.

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		print "$_\n";

		## s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		## @x = split(/\n/);
		## $c = shift(@x);
		## foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
Hmm that's a good point .. The sub records always have the date at the beginning of a new line so we could use that to identify a sub record, as long as we haven't already removed the \r\n before looking for the subrecords.
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
I tried the latest script you sent. It just hangs and doesn't return anything.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
There may still be a problem with blank lines.

I'll add a fix for it.

I'll get back to you kn a few hours - dinner time.
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
No rush. Just need this in the new day or so.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		##print "$_\n";

		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x) { print "$c\,$x\n"; }
	}
}

Open in new window


Using your input produced this output:

c:\temp>perl foo.pl foo.csv
"Company A","September 17 2010 3:28 PM This is the contents of the note. They can be multiline and contain 'double quotes'.
"Company A","March 23 2009 8:57 AM There are also multiple notes within the same field.
"Company A","July 2 2010 6:56 AM they can also contain dates inside of them such as Sent: Friday September 17 2010 3:24 PM"
"Company B","March 17 2004 10:55 AM This is the second company's notes. Also these notes sometimes contain commas within the text field."

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
This is great, I just ran it against my real data and it was very close. Just one additional request if possible. There are lines in the file like the following that don't indicate a new sub record and are simply a part of the previous record but they are matching the sub record regex. Can the sub record regex be improved to not match when Sent: is at the start of the line?

Sent: Wednesday August 11 2010 11:45 AM

The regex
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
Glad we're getting there, but I'm confused about your comment.

The test data we're using already has a "Sent:  date" in the company A record.  Am I handling it the wrong way? How so?  Or are you describing a different case?  Please give me a complete input test record and desired output so I can see what you're talking about.
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
Sorry, this is getting confusing to explain. It is a new case that I didn't realize was in my data so I didn't include it in the sample. I am attaching a new data file with this sample. Basically there is an email within the Notes field that shouldn't be treated as a sub record but your RegEx that finds sub records is matching on the Sent: field in the email.
data.csv
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
That was easy.  It was really a bug in my code.  Missing ^ on line 15.  Try this

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		##print "$_\n";

		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x) { print "$c\,$x\n"; }
	}
}

Open in new window

0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
Comment Utility
And one more change to fix a problem with missing double quotes at the end of some output lines.

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x)
		{
			$x =~ s/\s+\"$/\"/;
			print "$c\,$x\n";
		}
	}
}

Open in new window

0
 
LVL 2

Author Closing Comment

by:mikedgibson
Comment Utility
Nailed it. This is perfect! Thank you very much for your assistance.
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
A pleasure.  Some people do crosswords, I like PERL scripts.  :)
0
 
LVL 2

Author Comment

by:mikedgibson
Comment Utility
sjklein42, sorry to add a comment after this has been closed but I found one small issue with the script. For notes records that end in "" the script converts it to ' but no double quote is added to the end. I added a new sample data file with this scenario. Do you think you could take a quick look?
data.csv
0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
This should work better.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\"\,\"/g;  # restore commas between fields

        ($c,$b) =  split(/\,/);
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+\"$/\"/;
            print "$c\,$x\n";
        }
    }
}

Open in new window

0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This tutorial demonstrates a quick way of adding group price to multiple Magento products.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now