• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1824
  • Last Modified:

URGENT: sed or awk help

I need a simple thing done, and while I'm sure sed and/or awk would be the proper tool, I really don't know how to use them yet.

I have a DOS text file that has extraneous CR/LF sequences that I want removed. This is an application log file, and these extra CR/LF sequences break up log entries into two lines.

The current file looks like this:

00:11:22 app node activity description starts here and then<CR><LF>
              there's a static number of spaces and then the rest of the entry followed by a "normal" <CR><LF>
00:11:23 app node next log entry starts here, and I want to keep the *second* <CR><LF>
              from the previous entry <CR><LF>

...and so forth

This is complicated by lines that are NOT broken in this fashion.

So, I need a sed or awk (or if there's a better tool, I'm listening) script that will look for the <CR><LF> followed by a fixed number of spaces, and replace that sequence of bytes with just a single space. A <CR><LF> sequence followed by a number would be ignored (left intact).

I'm looking for something I can run from the command-line, using re-direction symbols is fine.
0
PsiCop
Asked:
PsiCop
  • 5
  • 4
  • 3
  • +5
1 Solution
 
PsiCopAuthor Commented:
The "fixed number" of spaces is 18.
0
 
byttaCommented:
sed and awk are both text based, so this is probably a pain...
But vim can read files as binary...

try (reads from logfile, writes to file2):
vi -b -e -c '%s/\r\n                  / /g' -c "w! file2" -c "q" logfile
or (reads/writes logfile, can destroy it if the expression is wrong):
vi -b -e -c '%s/\r\n                  / /g' -c "wq!" logfile

-b = binary
-e = execute the following -c "commands"

Keep track of the backslashes: -c '%s/\r\n                  / /g' is the same as -c "%s/\\r\\n                  / /g"
0
 
neteducationCommented:
awk '/^[0-9]/{printf($0);}
 /^       /' yourlogfile


printf lines starting with numbers and "print" lines with spaces in front. printf does not print a newline by default, while print does.
if you want  to get rid of the extra spaces, possibly the easiest is

awk '/^[0-9]/{printf($0);}
 /^       /' yourlogfile | sed 's/                  //'



0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

 
Hanno P.S.IT Consultant and Infrastructure ArchitectCommented:
Use awk, not sed (sed is really line-based and it's not easy working on another lines as the one
currently being worked on: You will have to find the line(s) with blanks at the beginning but the
line _before_ it will have to be changed :-(
Another way is to remove the "<CR><LF>" on each line _not_ starting with blanks but a number.
0
 
veedarCommented:
Hve you tried dos2unix for this?  

man dos2unix
0
 
Hanno P.S.IT Consultant and Infrastructure ArchitectCommented:
veedar,
dos2unix will change all occurences of <CR><LF> to <LF> but he wants to have conscutive lines being joined
0
 
PsiCopAuthor Commented:
Yeah, DOS2UNIX is not the tool.

I tried the vi command-line from bytta, and it works very well (i.e. does what I want) except that it leaves 2 spaces between the portions of the log entry that were separated (and now appear on the same line after using that command). Removing the space between the / and /g reduces that to one space between the portions. Is there any way to get rid of that? It is always in the same place (column 80),
0
 
fl_kiwiCommented:
cat <filename> | sed 's/ <CR><LF>//' | sed 's/<CR><LF>//' > <newfilename>

replace <filename> and <newfilename>  with you own filenames.

0
 
neteducationCommented:
psicop: Have you tried the awk/sed-construct I posted... that should get you rid of the extra spaces too.
0
 
byttaCommented:
the vi instruction replaces <CR><LF><18 spaces> with <1 space>
If you there is more than 1 space left, they were already in the log file (probably before <CR><LF>).

So try adding an extra space before \r\n in the command.
But that won't catch lines where there is no space, like your first one:
00:11:22 app node activity description starts here and then<CR><LF>

You proably have to try some regexp's
vi -b -e -c '%s/ ?\r\n                  / /g' -c "w! file2" -c "q" logfile  # zero or one space before <CR><LF>
vi -b -e -c '%s/ *\r\n                  / /g' -c "w! file2" -c "q" logfile  # zero or all spaces before <CR><LF>
vi -b -e -c '%s/ *\r\n           */ /g' -c "w! file2" -c "q" logfile       # if more than 10 spaces at start of line, remove zero or all before <CR><LF> and all after it
0
 
PsiCopAuthor Commented:
Wow. OK, neteducation, I haven't tried the awk/sed construction - yet.

The crisis is past, so it may be next week before I get a chance to figger out what the best solution is, but these are all great ideas, and I want to try them all.
0
 
byttaCommented:
Note that ALL the awk versions fail miserably on log lines that are not split up in two...

example:
00:11:22 app node activity description starts here and then<CR><LF>
              there's a static number of spaces and then the rest of the entry followed by a "normal" <CR><LF>
00:11:23 short log line <CR><LF>
00:11:23 another short log line <CR><LF>
00:11:23 app node next log entry starts here, and I want to keep the *second* <CR><LF>
              from the previous entry <CR><LF>

Using awk/sed or any other text based replace you will get:
00:11:22 app node activity description starts here and then there's a static number of spaces and then the rest of the entry followed by a "normal" <CR><LF>
00:11:23 short log line 00:11:23 another short log line 00:11:23 app node next log entry starts here, and I want to keep the *second* from the previous entry <CR><LF>

0
 
neteducationCommented:
How about this one:

# cat testfile
00:11:22 app node activity description starts here and then
              there's a static number of spaces and then the rest of the entry followed by a "normal"
00:11:23 app node next log entry starts here, and I want to keep the *second*
              from the previous entry
00:11:24 short line
00:11:24 another short line
00:11:24 now a line that
              spans multiple lines
              and if I say multiple
              I mean multiple, not only two

awk '/^[0-9]/ && NR!=1 { print ""}; {printf($0)}; END {print}' testfile | tr -s " "

Will handle multiline and single line and squeezes any double-spaces.
0
 
ahoffmannCommented:
perl -le 'undef $/;$x=<>;$x=~s/\r\n {18}/ /mg;print $x' DOS-textfile
0
 
neteducationCommented:
I'd suggest pointsplit between bytta, neteducation and ahoffmann, all of which have presented different working solutions
0
 
jmcgOwnerCommented:
A word of explanation for my recommendation:

I agree with neteducation that only the solutions offered by bytta, neteducation and ahoffman appear to be responsive to the question as asked.

The awk solution initially offered by neteducation suffered from the flaw that bytta pointed out and from a misuse of 'printf' - unless we assume that the data will never contain a % sign. Using raw input data as a formatting string for 'printf' is a practice to be avoided. You can achieve the desired effect by using a format string of "%s".

I can believe that bytta's vi-based solution can be made to work, but it seems like a bad direction to go given the uncertainty about how to represent multiline patterns in ex substitution commands. It wasn't working for me.

The ahoffman 'perl' solution can clearly work, but I'd have liked to see a version that did not depend on reading the entire file into memory first (slurp mode).

The solution that was missing was the one based on 'sed'. Sed has provisions for handling patterns across multiple lines, but it's a seldom-used feature (often forgotten, it seems!) and I have to go back to the manpages every time I think I might want to use it.

The following sedscript can be placed in a file or, perhaps, embedded in a shell script

: top
N
s/\r*\n       */ /
${
 p;q
}
t top
P
D

If we named the sedscript file as, say, 'unbreak.sed', it can be run from the command line as:

sed -n -f unbreak.sed $logfile >$newlogfile

======

So what does that sedscript do?

:top -- is a label
N -- reads a line into the "pattern space"
s/\r*\n       */ / -- join the two lines together if the pattern's right
${ -- if we're on the last line
 p;q -- just print what we have and quit (we're done)
}
t top -- conditional branch back to label top if above substitution succeeded
P -- print the first line in the "pattern space"
D -- delete the first line in the "pattern space", implicit branch to beginning of script
0
 
ahoffmannCommented:
> ..  but I'd have liked to see a version that did not depend on reading the entire file ..
well, my suggestion can be rewritten to read from STDIN, then only the text 'til a match resides in memory, which finally results in the most (memory-)economic solution :)

> .. missing was the one based on 'sed'.
keep in mind tha sed for big/huge files is a pain. It's dreadful slow and some version (for example Solaris) may crash.
Even I love to use sed and awk for simple things, I've learned to switch to perl when patterns get complicated or you have to deal with huge data.
Nevertheless, the question is talking about a DOS file and sed is always sufficent for 640kb ;-)
0
 
ahoffmannCommented:
jmcg, according your nice sed soltion:
  keep in mind that only Gnu's sed is able to deal with \n and \r in this way, traditional sed needs it verbatim :-(
that's another reason why I posted a perl solution
0
 
jmcgOwnerCommented:
ahoffman -

Good points. I don't have access, currently, to a 'sed' that is quite that "traditional". According to my circa  1984 ULTRIX documentation, the \n should have worked but I see nothing to give me the idea that \r would have been interpreted correctly. I've never encountered the problem of 'sed' having trouble with large files. I wonder why that would be.
0
 
ahoffmannCommented:
> ..  of 'sed' having trouble with large files. I wonder why that would be.
encounterd that starting somewhere with Solaris 2.1, and AFAIK they never fixed it (at least my Solaris 2.5 suffers with files roughly >>100MB)-:
0
 
neteducationCommented:
Well, you have to admit that Solaris 2.5 is kind of outdated by now...

So far I never had problems with sed... but the biggest files I handled with it were around 150MB, cause I agree it's not the fastest thing on earth.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

  • 5
  • 4
  • 3
  • +5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now