?
Solved

remove line break to create a string

Posted on 2005-03-17
23
Medium Priority
?
2,709 Views
Last Modified: 2008-01-09
hi, I have a text file that has user submitted data in the following format:

line11;line12;line13

line21;line22;line23

liine31;line32;line33

basically all the lines have some data that are separated by semicolon so that I can import them into Excel. However, it turns out that some of the lines are of the following format:

linex1;linex2
;linex3;

linex4;linex5

i.e. there can be multiple blank lines in between the 'components' of line x.

I want a perl script that reads the file and combines the 'components' of lines that resemble line x to what lines 1, 2 and 3 look like. i.e.
linex1;linex2;linex3;linex4

so at the end, my file should look like following:

line11;line12;line13

line21;line22;line23

liine31;line32;line33
.
.
.
linex1;linex2;linex3
0
Comment
Question by:IUAATech
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 8
  • 4
  • +1
23 Comments
 
LVL 12

Expert Comment

by:tel2
ID: 13569715
Hi IUAATech,

Not the best or most concise solution, but try this:

perl -pe "s/^$/<EOL>/;s/\n//;s/<EOL>/\n/" infile.txt >outfile.txt

If running Perl under UNIX, change the " to '.

Assuming inputfile.txt contains:
line11;line12;line13

line21;line22;line23

liine31;line32;line33

linex1;linex2
;linex3;

linex4;linex5

Then outfile.txt should receive this:
line11;line12;line13
line21;line22;line23
liine31;line32;line33
linex1;linex2;linex3;
linex4;linex5

That's what I got when I ran it.  Is that what you need?
NOTE: The text "<EOL>" can be anything you expect never to be found in the input file.  If it is found, my method will fail (part of the reason I say this solution is not the best).
0
 
LVL 12

Expert Comment

by:tel2
ID: 13569733
PS: If you want to remove the redundant? ";"s at the end of the lines, you can do this:
perl -pe "s/^$/<EOL>/;s/\n//;s/<EOL>/\n/;s/;+$//" infile.txt >outfile.txt

If you need an explanation of any of the above, let me know.
0
 

Author Comment

by:IUAATech
ID: 13569788
I am using mac OS X.

I get "Illegal division by zero at -e line 1, <> line 1"

actually, I want the contents of a particular line to be on the same line. i.e. line x's content should look like:
linex1;linex2;linex3;linex4;linex5

and I want the extra space in between the lines.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 12

Expert Comment

by:tel2
ID: 13570155
1. Re the error message, perhaps Mac is like UNIX.  Try ' instead of " (or vise virsa if that's what you were doing).  Let me know what happens with this.

2. Are you now saying that instead of records being delimited by blank lines, they are delimited by changes in the text (ie: "line", "linex", etc) before the number in each field?
0
 
LVL 84

Expert Comment

by:ozo
ID: 13570473
linex1;linex2
;linex3;

linex4;linex5

i.e. there can be multiple blank lines in between the 'components' of line x.

I want a perl script that reads the file and combines the 'components' of lines that resemble line x to what lines 1, 2 and 3 look like. i.e.
linex1;linex2;linex3;linex4

How did you decide that it should be
linex1;linex2;linex3;linex4
rather than
linex1;linex2;linex3;linex4;linex5
or
linex1;linex2;linex3;
linex4;linex5
or
linex1;linex2;
linex3;linex4;linex5
?
0
 

Expert Comment

by:Ardemus
ID: 13572005
How about:

my @lines = split /[^;]\n/, $file;

For each @lines {
  s/\n//g
}

$file = join "\n", @lines
print $file

I haven't tested it yet, but maybe it'll give you an idea.  I'll give it a shot and post changes if needed.

Ardemus
0
 

Expert Comment

by:Ardemus
ID: 13572072
Sokay, I've played around with it a bit.  I was overlooking a detail about your data.  Since it's semicolon delimited data I just remove the whitespace around the semicolons and then cleanup at the end by clearing out any extra white space.

#! Perl
use strict;
use warnings;
my $file;

while (<>) {$file .= $_}
$file =~ s/;\s*/;/g;  # Remove whitespace after a semicolon
$file =~ s/\s*;/;/g;  # Remove whitepace before a semicolon
$file =~ s/\s+/\n/g;  # Reduce any other whitespace to a single carriage return

print $file;

It works on this data:
------------------------------------
linex1;linex2
;linex3;

linex4;linex5
linex1;
linex2;
linex3;


linex4;linex5

linex1;
linex2
;linex3;

linex4;linex5
0
 
LVL 12

Expert Comment

by:tel2
ID: 13572384
Using the basis of the brilliant Ardemus technique and input data...
  perl -0 -pe "s/\s*;\s*/;/g;s/\s+/\n/g;s/\n+/\n\n/g" infile.txt >outfile.txt
gives this output:
  linex1;linex2;linex3;linex4;linex5

  linex1;linex2;linex3;linex4;linex5

  linex1;linex2;linex3;linex4;linex5

BTW: How's that division by zero error message?  If you still get it with the above, try ' instead of ", and report back.
0
 

Author Comment

by:IUAATech
ID: 13575229
tel2, actually when I try with double quotes, I get "Illegal variable name" and when I try with a single quote, I get
syntax error at -e line 1, near "EOL>"
Backslash found where operator expected at -e line 1, near "/;s/\"
        (Missing operator before \?)
Execution of -e aborted due to compilation errors.

hmm.....
0
 

Author Comment

by:IUAATech
ID: 13575284
sorry guys, I guess I should have explained better.
Basically, the data has fields separated by a semicolon and each field can have more than one word. So you can have something like

line11-word1 line11-word2 line11-word3; line12; line13-word1 line13-word2;

So the code above doesn't quite work since I guess you were assuming that I have just one word in each of my field. Hope this makes sense........

0
 

Expert Comment

by:Ardemus
ID: 13575316
Try using the file I posted above (things can be pretty particular on the command line, files tend to be fairly standard).  You might have to put your perl path in the bang at the top instead of just perl (I'm on windows).

0
 

Expert Comment

by:Ardemus
ID: 13575363
Oh, then change the last cleanup line to this:

$file =~ s/(\s)+/$1/g;  # Reduce any other whitespace to a single carriage return

It will preserve your spaces and tabs, just reduce them all to a single rather than multiples.  Or you can remove the line entirely.
0
 

Author Comment

by:IUAATech
ID: 13575505
lets say the test file has something like the following:
---------------------------------------------------------
myusername;myclass;There are many times when a professors work goes un-noticed, and I feel that this is the case with Mrs. X. She teaches an  math course (among others) where Jr's and Sr's learn how to create plans for a retail season. I do not think that some students recognize the importance of what they are learning and why she pushes us so hard to do well.
Mrs. Paul deserves this award because of her hard work as a professor and dedication outside of the classrom with the XXX Organization (AMO). This award is a symbol of gratitude for all she does and has done, I could never express how grateful I am for her positive influences, but I feel that I can do this besides the simple "Thank You's".

She deserves to be recognized for what she does, by the people she influences the most!!! ; yes; Mon Feb 21 16:32:58 2005
----------------------------------------------------------

your code doesn't quite merge all this text into a single line.

Here is the code I used:

#! Perl
use strict;
use warnings;
my $file;

while (<>) {$file .= $_}
$file =~ s/;\s*/;/g;  # Remove whitespace after a semicolon
$file =~ s/\s*;/;/g;  # Remove whitepace before a semicolon
$file =~ s/(\s)+/$1/g;  # Reduce any other whitespace to a single carriage return

print $file;

--------------------------

tried it using:
perl script.pl infile.txt > outfile.txt
0
 

Expert Comment

by:Ardemus
ID: 13575896
Ah, you need to provide representative data from the start or we're just spinning our wheels trying to help you.  Ok, so you want to preserve internal spaces but eliminate internal carriage returns?  How does one know the difference between an embedded carriage return and one that separates records?  I see no way to distinguish between:

Line11;Line12
line12b;line13

And

line 11;line12
line 21;line22

You have to figure out what signals a new record.  The only thing I could see in your original data was a carriage return without a related semi-colon.  Now carriage returns without semi-colons are allowed within the data...  Does a record always end with a date?  Does it always start with a valid user name?  Start with that to break up the records, then you can think about correcting the data within a record.
0
 

Author Comment

by:IUAATech
ID: 13575949
Exactly! I want to eliminate the internal carriage returns. I see what you mean. There is infact no way to differenciate between an ebmedded carriage return and the one that separates records. I guess we will have to make use the of the date field.

Each record starts with a username which is just one word and ends with the date field. Can you use this information to modify your code?
0
 

Expert Comment

by:Ardemus
ID: 13576044
It looks like there are a fixed number of fields, you could do something like this:

my $outfile;
while ($file =~ m/([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^\n]*)\n){
  my $outline = join ("", ($1, $2, $3, $4, $5));
  $outline =~ s/\s*/ /g;
  $outfile .= "$outline\n";
}

print $outfile;
0
 

Author Comment

by:IUAATech
ID: 13576144
I have nine fields for each line. So I guess my code should be:

#!/usr/bin/perl

my $outfile;
while ($file =~ m/([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^\n]*)\n){
    my $outline = join ("", ($1, $2, $3, $4, $5, $6, $7, $8, $9));
    $outline =~ s/\s*/ /g;
    $outfile .= "$outline\n";
}

print $outfile;

However, I get an error message when I try to run the code:
Backslash found where operator expected at removeBreaks.pl line 6, near "$outline =~ s/\"
  (Might be a runaway multi-line // string starting on line 4)
        (Missing operator before \?)
syntax error at removeBreaks.pl line 6, near "$outline =~ s/\"
Substitution pattern not terminated at removeBreaks.pl line 6.

sorry, I am trying to learn Perl :)
0
 

Expert Comment

by:Ardemus
ID: 13576620
That's because I forgot to close the first regex.  Here's a quick rundown of what this code is trying to do:

-Always use strict and warnings.  They prevent and catch many errors that you might miss, particularly while you're learning.
-While (<>){} is a special construct in perl.  It looks at the first command line argument and feeds you each line of that file, one line at a time.
-After that, $file contains the entire file as one string.
-The regex is pretty simple, it matches anything that's not a semicolon, then a semicolon.  That repeats for the number of fields in question.  Actually, this is probably better:

m/((?:[^;]*;){8,8}[^\n]*)\n/

The (?:) is a non-capturing parenthesis, so it doesn't fill up the match variables.  {8,8} says to match that at least 8 times, but not more than 8 times.  Then we finish with the final chunk before the closing line break.
Now $1 contains the entire line, so you can put it into outline and replace all sets of whitespace with single spaces.

This is the new code, with that error corrected:

#!/usr/bin/perl

my $outfile;
while ($file =~ m/((?:[^;]*;){8,8}[^\n]*)\n/){
    my $outline = $1;
    $outline =~ s/\s*/ /g;
    $outfile .= "$outline\n";
}

Now, if you get semicolons in your data too, well, you're on your own. ;)

Also, I would recomend "Learning Perl", from O'Reily, that's how I started.  I also read "Mastering Regular Expressions", and I find the ActiveState perl documentation very useful at www.activestate.com

Ardemus
0
 

Expert Comment

by:Ardemus
ID: 13576738
I should mention that I haven't used matching like this in a while, and you might need to tweak the code, perhaps with a flag on that first regex (see perlre for details on the flags).
0
 

Author Comment

by:IUAATech
ID: 13577554
thanks for the advice!
the regex makes perfect sense. Do I need to 'chomp'  $file when I am getting it from the file before I append it to existing $file? I think $outline =~ s/\s*/ /g should be $outline =~ s/\s+/ /g

Here is my final code:
#!/usr/bin/perl

use strict;
use warnings;

my $file;
my $outfile;

while (<>) {$file .= $_}

while ($file =~ m/((?:[^;]*;){8,8}[^\n]*)\n/){
    my $outline = $1;
    $outline =~ s/\s+/ /g;
    $outfile .= "$outline\n";
}
    print $outfile;

when I do perl script.pl infile.txt > outfile.txt it goes into an infinte loop :(
0
 

Assisted Solution

by:Ardemus
Ardemus earned 400 total points
ID: 13577996
I don't have time to continue helping you debug today, but I'll try to put you on the right track.

To answer your question about Chomp: take a moment to think about what would happen if you remove the newline from the end of each line in the original file.

Regarding the infinate loop, the m// is performing the same match over and over.  You'll have to look into perlre or do a web search to figure out how to use a match within a while statement.  There's something minor that I'm forgetting, but you should be able to figure it out with some effort.  Perhaps it needs to be m//g.  In any case, it's the while statement that's causing the loop because it's never turning false.

I'm sure that you can get this going with a little effort and some poking around, or maybe someone else can help out.

Good luck,

Nick
0
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
ID: 13578003
while( $file =~ m/((?:[^;]*;){8,8}[^\n]*)\n/g ){
    my $outline = $1;
    $outline =~ s/\s+/ /g;
    print "$outline\n";
}
0
 

Author Comment

by:IUAATech
ID: 13578050
perfect! thanks.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question