Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

CSV manipulation in Perl

Posted on 2011-02-24
28
622 Views
Last Modified: 2012-05-11
I am trying to perform a data migration and my source information is in a .csv file similar to the following:

"Company","Notes (All)"
"Company A","September 17, 2010 3:28 PM This is the contents of the note.\r\nThey can be multiline and contain ""double quotes"".March 23, 2009 8:57 AM There are also multiple notes within the same field."
"Company B","March 17, 2004 10:55 AM This is the second company's notes.\r\nAlso, these notes sometimes contain commas within the text field."

My main priority is sanitizing the "Notes  (All)" field to strip out any characters that will give me issues during the import such as:

 \r
\n
"
,

However, I obviously need to keep the \r\n at the end of every row, the double quotes around the text fields and the commas between the fields and I am looking to do this in perl if possible.

Ideally, a perl script that does the above would be great but if possible I would like it if it could go one step further and split up the notes field based on the timestamp entries. If I take the example above, this is how I would like to sanitized output to look:

"Company","Notes (All)"
"Company A","September 17, 2010 3:28 PM This is the contents of the note. They can be multiline and contain 'double quotes'."
"Company A","March 23, 2009 8:57 AM There are also multiple notes within the same field."
"Company B","March 17, 2004 10:55 AM This is the second company's notes. Also these notes sometimes contain commas within the text field."

I would like "" changed to ', \r\n within the field changed to space, remove all commas and when there are multiple notes in the field, output an additional entry.

Mike
0
Comment
Question by:mikedgibson
  • 15
  • 13
28 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34972141
Are the "\r\n" actually backslashes and "n"s and "r"s, or are they really newlines?  It might be best if you attached the raw input file.

The Perl script is not hard, but please supply the raw input so we can test.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34972515
This may do what you want.  It does not yet split multiple notes into separate lines.  Please try it.

while ( <> )
{
	s/[\r\n]//g;		# trim

	s/\"\"/\'/g;		# double-double quotes ("") go to single quote
	s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
	s/\s+/ /g;			# collapse multiple spaces

	s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
	s/\,//g;			# get rid of remaining commas (inside fields)
	s/\:\:\:/\"\,\"/g;	# restore commas between fields

	print "$_\n";
}

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34972723
This version should split your multi-note lines

while ( <> )
{
	s/[\r\n]//g;		# trim

	s/\"\"/\'/g;		# double-double quotes ("") go to single quote
	s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
	s/\s+/ /g;			# collapse multiple spaces

	s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
	s/\,//g;			# get rid of remaining commas (inside fields)
	s/\:\:\:/\"\,\"/g;	# restore commas between fields


	s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
	@x = split(/\n/);
	$c = shift(@x);

	foreach $x (@x) { print "$c$x\n"; }
}

Open in new window

0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 2

Author Comment

by:mikedgibson
ID: 34974261
Sorry, they are indeed newlines and not literally \r\n. I should have been more clear. I will test the script you provided.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34974344
The script as written treated them literally as \r\n.  I will have to change that.

Can you please post as an attachment a real test file.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34974426
This version should handle input records that are split across multiple lines (embedded newlines):

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34974441
I guess it doesn't need to check for literal \r and \n anymore.  Took out line 14.   ;)

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34974731
OK. In the first script the following seem to work

1) Replacement of double double quotes with single quotes
2) Collapsing multiple spaces
3) Getting rid of commas

What doesn't seem to be working still is the conversion of \r \n to spaces. Is this because it isn't literally \r\n? Could we do something more generic there and simply convert all non printable chars (00-1F) to a # symbol? I tried this but doesn't seem to work

s/\\x00-\x1F]/#/g;

Mike
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34974736
You are updating faster than I am. I can't post a real file because there is confidential data included.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34974801
This should handle all flavors of \r and \n, literal or not.

Understand about the data, but if this doesn't work, how about taking your example from the first posting,  making that into a real data file and posting that.

while ( <> )
{
	s/[\r\n]//g;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]//g;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\\[rn]/ /g;		# "\r" and "\n" go to spaces
		s/[\r\n]/ /g;		# \r and \n go to spaces
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		@x = split(/\n/);
		$c = shift(@x);

		foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34974821
Also, it appears as though the contents of the notes themselves have dates that match the regex you were using to create a new entry so that most likely won't work. That was a nice to have anyway.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34974851
Here is a sample data file.
data.csv
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34974863
Unless you can think of some innovative way to determine if a data is the start of a new note or just a date within the text I think we'll need to drop that part.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34975012
Ok.  This version does not try to split on dates.

It also ignores the header line.

I might work on it some more later.  I think we can use the fact that there is always a line break before the subrecords (yes?).  That plus the subrecord starts with a date should be enough to go on.

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$_ .= ' ' . <>;		# grab another line
			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote
		s/\s+/ /g;			# collapse multiple spaces

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		print "$_\n";

		## s/([A-Z][a-z]+ [0-9]+ [0-9]+ [0-9]+:[0-9]+)/\n$1/g;
		## @x = split(/\n/);
		## $c = shift(@x);
		## foreach $x (@x) { print "$c$x\n"; }
	}
}

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34975061
Hmm that's a good point .. The sub records always have the date at the beginning of a new line so we could use that to identify a sub record, as long as we haven't already removed the \r\n before looking for the subrecords.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34975082
I tried the latest script you sent. It just hangs and doesn't return anything.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34975129
There may still be a problem with blank lines.

I'll add a fix for it.

I'll get back to you kn a few hours - dinner time.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34975138
No rush. Just need this in the new day or so.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34976035
<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		##print "$_\n";

		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x) { print "$c\,$x\n"; }
	}
}

Open in new window


Using your input produced this output:

c:\temp>perl foo.pl foo.csv
"Company A","September 17 2010 3:28 PM This is the contents of the note. They can be multiline and contain 'double quotes'.
"Company A","March 23 2009 8:57 AM There are also multiple notes within the same field.
"Company A","July 2 2010 6:56 AM they can also contain dates inside of them such as Sent: Friday September 17 2010 3:24 PM"
"Company B","March 17 2004 10:55 AM This is the second company's notes. Also these notes sometimes contain commas within the text field."

Open in new window

0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34976357
This is great, I just ran it against my real data and it was very close. Just one additional request if possible. There are lines in the file like the following that don't indicate a new sub record and are simply a part of the previous record but they are matching the sub record regex. Can the sub record regex be improved to not match when Sent: is at the start of the line?

Sent: Wednesday August 11 2010 11:45 AM

The regex
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34976410
Glad we're getting there, but I'm confused about your comment.

The test data we're using already has a "Sent:  date" in the company A record.  Am I handling it the wrong way? How so?  Or are you describing a different case?  Please give me a complete input test record and desired output so I can see what you're talking about.
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34976497
Sorry, this is getting confusing to explain. It is a new case that I didn't realize was in my data so I didn't include it in the sample. I am attaching a new data file with this sample. Basically there is an email within the Notes field that shouldn't be treated as a sub record but your RegEx that finds sub records is matching on the Sent: field in the email.
data.csv
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34976539
That was easy.  It was really a bug in my code.  Missing ^ on line 15.  Try this

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields

		##print "$_\n";

		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x) { print "$c\,$x\n"; }
	}
}

Open in new window

0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34976587
And one more change to fix a problem with missing double quotes at the end of some output lines.

<>;		# ignore header line

while ( <> )
{
	s/[\r\n]+$//;		# trim

	if ( $_ ne '' )		# ignore blank lines
	{
		while ( ! ( /\"$/ ) )	# if the last character is not a double quote...
		{
			$next = <>;		# grab another line
			if ( $next eq '' ) { last; }	# end-of-file

			# insert "special" indicator before subrecords starting with a date
			if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

			$_ .= ' ' . $next;

			s/[\r\n]+$//;		# trim again
		}

		s/\"\"/\'/g;		# double-double quotes ("") go to single quote

		s/\"\,\"/\:\:\:/g;	# temporarily hide commas between fields
		s/\,//g;			# get rid of remaining commas (inside fields)
		s/\:\:\:/\"\,\"/g;	# restore commas between fields


		($c,$b) =  split(/\,/);
		@x = split(/\|\|/,$b);
		foreach $x (@x)
		{
			$x =~ s/\s+\"$/\"/;
			print "$c\,$x\n";
		}
	}
}

Open in new window

0
 
LVL 2

Author Closing Comment

by:mikedgibson
ID: 34976746
Nailed it. This is perfect! Thank you very much for your assistance.
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34976769
A pleasure.  Some people do crosswords, I like PERL scripts.  :)
0
 
LVL 2

Author Comment

by:mikedgibson
ID: 34988849
sjklein42, sorry to add a comment after this has been closed but I found one small issue with the script. For notes records that end in "" the script converts it to ' but no double quote is added to the end. I added a new sample data file with this scenario. Do you think you could take a quick look?
data.csv
0
 
LVL 16

Expert Comment

by:sjklein42
ID: 34989058
This should work better.

<>;     # ignore header line

while ( <> )
{
    s/[\r\n]+$//;       # trim

    if ( $_ ne '' )     # ignore blank lines
    {
        while ( 1 )
        {
            s/\"\"/\'/g;            # double-double quotes ("") go to single quote
            if ( /\"$/ ) { last; }  # if the last character is a double quote, we have a complete record

            $next = <>;             # grab another line
            if ( $next eq '' ) { last; }    # end-of-file

            # insert "special" indicator before subrecords starting with a date
            if ( $next =~ /^[A-Z][a-z]+ [0-9]+\, [0-9]+ [0-9]+\:[0-9]+/ ) { $next = '"||"' . $next; }

            $_ .= ' ' . $next;

            s/[\r\n]+$//;       # trim again
        }

        s/\"\,\"/\:\:\:/g;  # temporarily hide commas between fields
        s/\,//g;            # get rid of remaining commas (inside fields)
        s/\:\:\:/\"\,\"/g;  # restore commas between fields

        ($c,$b) =  split(/\,/);
        @x = split(/\|\|/,$b);
        foreach $x (@x)
        {
            $x =~ s/\s+\"$/\"/;
            print "$c\,$x\n";
        }
    }
}

Open in new window

0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
perl script help 12 105
Reading fields from the text file. 4 101
unable to remove invisible characters with perl 2 91
what are these modules in perl script 4 103
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question