• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2713
  • Last Modified:

remove line break to create a string

hi, I have a text file that has user submitted data in the following format:

line11;line12;line13

line21;line22;line23

liine31;line32;line33

basically all the lines have some data that are separated by semicolon so that I can import them into Excel. However, it turns out that some of the lines are of the following format:

linex1;linex2
;linex3;

linex4;linex5

i.e. there can be multiple blank lines in between the 'components' of line x.

I want a perl script that reads the file and combines the 'components' of lines that resemble line x to what lines 1, 2 and 3 look like. i.e.
linex1;linex2;linex3;linex4

so at the end, my file should look like following:

line11;line12;line13

line21;line22;line23

liine31;line32;line33
.
.
.
linex1;linex2;linex3
0
IUAATech
Asked:
IUAATech
  • 9
  • 8
  • 4
  • +1
2 Solutions
 
tel2Commented:
Hi IUAATech,

Not the best or most concise solution, but try this:

perl -pe "s/^$/<EOL>/;s/\n//;s/<EOL>/\n/" infile.txt >outfile.txt

If running Perl under UNIX, change the " to '.

Assuming inputfile.txt contains:
line11;line12;line13

line21;line22;line23

liine31;line32;line33

linex1;linex2
;linex3;

linex4;linex5

Then outfile.txt should receive this:
line11;line12;line13
line21;line22;line23
liine31;line32;line33
linex1;linex2;linex3;
linex4;linex5

That's what I got when I ran it.  Is that what you need?
NOTE: The text "<EOL>" can be anything you expect never to be found in the input file.  If it is found, my method will fail (part of the reason I say this solution is not the best).
0
 
tel2Commented:
PS: If you want to remove the redundant? ";"s at the end of the lines, you can do this:
perl -pe "s/^$/<EOL>/;s/\n//;s/<EOL>/\n/;s/;+$//" infile.txt >outfile.txt

If you need an explanation of any of the above, let me know.
0
 
IUAATechAuthor Commented:
I am using mac OS X.

I get "Illegal division by zero at -e line 1, <> line 1"

actually, I want the contents of a particular line to be on the same line. i.e. line x's content should look like:
linex1;linex2;linex3;linex4;linex5

and I want the extra space in between the lines.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
tel2Commented:
1. Re the error message, perhaps Mac is like UNIX.  Try ' instead of " (or vise virsa if that's what you were doing).  Let me know what happens with this.

2. Are you now saying that instead of records being delimited by blank lines, they are delimited by changes in the text (ie: "line", "linex", etc) before the number in each field?
0
 
ozoCommented:
linex1;linex2
;linex3;

linex4;linex5

i.e. there can be multiple blank lines in between the 'components' of line x.

I want a perl script that reads the file and combines the 'components' of lines that resemble line x to what lines 1, 2 and 3 look like. i.e.
linex1;linex2;linex3;linex4

How did you decide that it should be
linex1;linex2;linex3;linex4
rather than
linex1;linex2;linex3;linex4;linex5
or
linex1;linex2;linex3;
linex4;linex5
or
linex1;linex2;
linex3;linex4;linex5
?
0
 
ArdemusCommented:
How about:

my @lines = split /[^;]\n/, $file;

For each @lines {
  s/\n//g
}

$file = join "\n", @lines
print $file

I haven't tested it yet, but maybe it'll give you an idea.  I'll give it a shot and post changes if needed.

Ardemus
0
 
ArdemusCommented:
Sokay, I've played around with it a bit.  I was overlooking a detail about your data.  Since it's semicolon delimited data I just remove the whitespace around the semicolons and then cleanup at the end by clearing out any extra white space.

#! Perl
use strict;
use warnings;
my $file;

while (<>) {$file .= $_}
$file =~ s/;\s*/;/g;  # Remove whitespace after a semicolon
$file =~ s/\s*;/;/g;  # Remove whitepace before a semicolon
$file =~ s/\s+/\n/g;  # Reduce any other whitespace to a single carriage return

print $file;

It works on this data:
------------------------------------
linex1;linex2
;linex3;

linex4;linex5
linex1;
linex2;
linex3;


linex4;linex5

linex1;
linex2
;linex3;

linex4;linex5
0
 
tel2Commented:
Using the basis of the brilliant Ardemus technique and input data...
  perl -0 -pe "s/\s*;\s*/;/g;s/\s+/\n/g;s/\n+/\n\n/g" infile.txt >outfile.txt
gives this output:
  linex1;linex2;linex3;linex4;linex5

  linex1;linex2;linex3;linex4;linex5

  linex1;linex2;linex3;linex4;linex5

BTW: How's that division by zero error message?  If you still get it with the above, try ' instead of ", and report back.
0
 
IUAATechAuthor Commented:
tel2, actually when I try with double quotes, I get "Illegal variable name" and when I try with a single quote, I get
syntax error at -e line 1, near "EOL>"
Backslash found where operator expected at -e line 1, near "/;s/\"
        (Missing operator before \?)
Execution of -e aborted due to compilation errors.

hmm.....
0
 
IUAATechAuthor Commented:
sorry guys, I guess I should have explained better.
Basically, the data has fields separated by a semicolon and each field can have more than one word. So you can have something like

line11-word1 line11-word2 line11-word3; line12; line13-word1 line13-word2;

So the code above doesn't quite work since I guess you were assuming that I have just one word in each of my field. Hope this makes sense........

0
 
ArdemusCommented:
Try using the file I posted above (things can be pretty particular on the command line, files tend to be fairly standard).  You might have to put your perl path in the bang at the top instead of just perl (I'm on windows).

0
 
ArdemusCommented:
Oh, then change the last cleanup line to this:

$file =~ s/(\s)+/$1/g;  # Reduce any other whitespace to a single carriage return

It will preserve your spaces and tabs, just reduce them all to a single rather than multiples.  Or you can remove the line entirely.
0
 
IUAATechAuthor Commented:
lets say the test file has something like the following:
---------------------------------------------------------
myusername;myclass;There are many times when a professors work goes un-noticed, and I feel that this is the case with Mrs. X. She teaches an  math course (among others) where Jr's and Sr's learn how to create plans for a retail season. I do not think that some students recognize the importance of what they are learning and why she pushes us so hard to do well.
Mrs. Paul deserves this award because of her hard work as a professor and dedication outside of the classrom with the XXX Organization (AMO). This award is a symbol of gratitude for all she does and has done, I could never express how grateful I am for her positive influences, but I feel that I can do this besides the simple "Thank You's".

She deserves to be recognized for what she does, by the people she influences the most!!! ; yes; Mon Feb 21 16:32:58 2005
----------------------------------------------------------

your code doesn't quite merge all this text into a single line.

Here is the code I used:

#! Perl
use strict;
use warnings;
my $file;

while (<>) {$file .= $_}
$file =~ s/;\s*/;/g;  # Remove whitespace after a semicolon
$file =~ s/\s*;/;/g;  # Remove whitepace before a semicolon
$file =~ s/(\s)+/$1/g;  # Reduce any other whitespace to a single carriage return

print $file;

--------------------------

tried it using:
perl script.pl infile.txt > outfile.txt
0
 
ArdemusCommented:
Ah, you need to provide representative data from the start or we're just spinning our wheels trying to help you.  Ok, so you want to preserve internal spaces but eliminate internal carriage returns?  How does one know the difference between an embedded carriage return and one that separates records?  I see no way to distinguish between:

Line11;Line12
line12b;line13

And

line 11;line12
line 21;line22

You have to figure out what signals a new record.  The only thing I could see in your original data was a carriage return without a related semi-colon.  Now carriage returns without semi-colons are allowed within the data...  Does a record always end with a date?  Does it always start with a valid user name?  Start with that to break up the records, then you can think about correcting the data within a record.
0
 
IUAATechAuthor Commented:
Exactly! I want to eliminate the internal carriage returns. I see what you mean. There is infact no way to differenciate between an ebmedded carriage return and the one that separates records. I guess we will have to make use the of the date field.

Each record starts with a username which is just one word and ends with the date field. Can you use this information to modify your code?
0
 
ArdemusCommented:
It looks like there are a fixed number of fields, you could do something like this:

my $outfile;
while ($file =~ m/([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^\n]*)\n){
  my $outline = join ("", ($1, $2, $3, $4, $5));
  $outline =~ s/\s*/ /g;
  $outfile .= "$outline\n";
}

print $outfile;
0
 
IUAATechAuthor Commented:
I have nine fields for each line. So I guess my code should be:

#!/usr/bin/perl

my $outfile;
while ($file =~ m/([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^\n]*)\n){
    my $outline = join ("", ($1, $2, $3, $4, $5, $6, $7, $8, $9));
    $outline =~ s/\s*/ /g;
    $outfile .= "$outline\n";
}

print $outfile;

However, I get an error message when I try to run the code:
Backslash found where operator expected at removeBreaks.pl line 6, near "$outline =~ s/\"
  (Might be a runaway multi-line // string starting on line 4)
        (Missing operator before \?)
syntax error at removeBreaks.pl line 6, near "$outline =~ s/\"
Substitution pattern not terminated at removeBreaks.pl line 6.

sorry, I am trying to learn Perl :)
0
 
ArdemusCommented:
That's because I forgot to close the first regex.  Here's a quick rundown of what this code is trying to do:

-Always use strict and warnings.  They prevent and catch many errors that you might miss, particularly while you're learning.
-While (<>){} is a special construct in perl.  It looks at the first command line argument and feeds you each line of that file, one line at a time.
-After that, $file contains the entire file as one string.
-The regex is pretty simple, it matches anything that's not a semicolon, then a semicolon.  That repeats for the number of fields in question.  Actually, this is probably better:

m/((?:[^;]*;){8,8}[^\n]*)\n/

The (?:) is a non-capturing parenthesis, so it doesn't fill up the match variables.  {8,8} says to match that at least 8 times, but not more than 8 times.  Then we finish with the final chunk before the closing line break.
Now $1 contains the entire line, so you can put it into outline and replace all sets of whitespace with single spaces.

This is the new code, with that error corrected:

#!/usr/bin/perl

my $outfile;
while ($file =~ m/((?:[^;]*;){8,8}[^\n]*)\n/){
    my $outline = $1;
    $outline =~ s/\s*/ /g;
    $outfile .= "$outline\n";
}

Now, if you get semicolons in your data too, well, you're on your own. ;)

Also, I would recomend "Learning Perl", from O'Reily, that's how I started.  I also read "Mastering Regular Expressions", and I find the ActiveState perl documentation very useful at www.activestate.com

Ardemus
0
 
ArdemusCommented:
I should mention that I haven't used matching like this in a while, and you might need to tweak the code, perhaps with a flag on that first regex (see perlre for details on the flags).
0
 
IUAATechAuthor Commented:
thanks for the advice!
the regex makes perfect sense. Do I need to 'chomp'  $file when I am getting it from the file before I append it to existing $file? I think $outline =~ s/\s*/ /g should be $outline =~ s/\s+/ /g

Here is my final code:
#!/usr/bin/perl

use strict;
use warnings;

my $file;
my $outfile;

while (<>) {$file .= $_}

while ($file =~ m/((?:[^;]*;){8,8}[^\n]*)\n/){
    my $outline = $1;
    $outline =~ s/\s+/ /g;
    $outfile .= "$outline\n";
}
    print $outfile;

when I do perl script.pl infile.txt > outfile.txt it goes into an infinte loop :(
0
 
ArdemusCommented:
I don't have time to continue helping you debug today, but I'll try to put you on the right track.

To answer your question about Chomp: take a moment to think about what would happen if you remove the newline from the end of each line in the original file.

Regarding the infinate loop, the m// is performing the same match over and over.  You'll have to look into perlre or do a web search to figure out how to use a match within a while statement.  There's something minor that I'm forgetting, but you should be able to figure it out with some effort.  Perhaps it needs to be m//g.  In any case, it's the while statement that's causing the loop because it's never turning false.

I'm sure that you can get this going with a little effort and some poking around, or maybe someone else can help out.

Good luck,

Nick
0
 
ozoCommented:
while( $file =~ m/((?:[^;]*;){8,8}[^\n]*)\n/g ){
    my $outline = $1;
    $outline =~ s/\s+/ /g;
    print "$outline\n";
}
0
 
IUAATechAuthor Commented:
perfect! thanks.
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 9
  • 8
  • 4
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now