Solved

remove msdos newline in html

Posted on 2004-10-19
21
1,335 Views
Last Modified: 2012-05-05
How do I remove ^M in html source?
0
Comment
Question by:minsanco
  • 6
  • 2
  • 2
  • +5
21 Comments
 
LVL 36

Expert Comment

by:Zyloch
ID: 12354091
Hi minsanco,

What do you mean by ^M in HTML source? You mean like this:?

$source=~s/\^M/g;

Regards,
Zyloch
0
 
LVL 48

Expert Comment

by:Tintin
ID: 12354156
Very vague question.

Assuming you are uploading HTML files from a PC to a Unix webserver, make sure you FTP in ASCII mode.
0
 
LVL 84

Expert Comment

by:ozo
ID: 12354167
I think  minsanco means s/\cM//g
0
 
LVL 36

Expert Comment

by:Zyloch
ID: 12354198
O.o, I see (just noticed in my above, I forgot a / lol, should be /\^M//g but shouldn't matter too much now)
0
 
LVL 2

Expert Comment

by:ChrisDrake
ID: 12355724
Nah - none of that works - to remove, you've got to do this:-

$html=~s/[\015\012\r\n]+//isgm;

Or if you want to just replace those MSDOS things with unix ones, do this instead:-

$html=~s/[\015\012\r\n]+/\r/isgm;

Note that perl regexps are optimised to work on only one "line" so the "isgm" bit on the end is essential.

MSDOS and UNIX versions of perl differ, so plain old \r\n doesn't always work, hence including the octal representation is also essential.

If you're on unix, you can also use the "vi" editor - load your file, and type the following:-

9999JZZ

and it'll "join up" all the lines.
0
 
LVL 13

Accepted Solution

by:
gripe earned 20 total points
ID: 12357771
A UNIX newline will be appropriately represented by "\n" if you run the perl command on the UNIX system. The first regex above will remove all line breaks from your input and the second will replace any line breaks with ASCII character 15 (The very character he's trying to get rid of) Additionally, the /ism regex options are not needed. You're not matching any characters with '.' (which is what /s affects) and you're not using '^' or '$' (which is what /m affects). The string is already completely contained within the variable $html and the character class will match anything regardless of new lines.

The regex to remove the extraneous characters from an MS-DOS file is simply:

$html =~ s/[\r]//g; # tested

This will remove all ASCII character 13 (\r or \015 or 'carriage return') from the file. Since a newline in MS-DOS is \n\r (or \012\015), this will leave \n (or \012) which is the correct newline for UNIX.

The real question is 'why?' here. Usually these characters are as a result of FTP transferring a file in binary mode rather than ASCII mode. If this is the case, then specifying ASCII mode transfers for your uploads will do this automatically for you. (the UNIX FTP daemon will strip the unneeded characters)

Additionally, it's possible to do this 'in place' with perl or with other utilities (like some versions of tr):

perl -pi.bak -e's/[\r]//g' file.you.want.to.change
tr -d
0
 
LVL 13

Expert Comment

by:gripe
ID: 12357788
Oops, accidentally hit the submit button while I was trying to find the syntax for in place edit for tr. I think this was a brain fart as I can't find any in place versions. You can do it with IO redirection though:

tr -d '\015' < old.file > new.file

Sorry for the half post.
0
 
LVL 13

Expert Comment

by:gripe
ID: 12357811
Correction to original post:

The first regex above will remove all line breaks from your input and the second will replace any line breaks with ASCII character 15

Should read:

The first regex above will remove all line breaks from your input and the second will replace any line breaks with ASCII character 13

All this octal/decimal conversion's got me wonky.

0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 13

Expert Comment

by:gripe
ID: 12358453
Ugh, I'm having my Monday late this week. The regex is:

$html =~ s/\r//g; # HTML is in variable $html

or

perl -pi.bak -e's/\r//' file.you.want.to.change # in place

No need for the character class in either example and no need for the /g option in the second.

Apologies for the message barf.
0
 
LVL 8

Expert Comment

by:davorg
ID: 12358650
ChrisDrake,

Your solution seems to be rather full of cargo-cult code. All you need is:

$html =~ s/(\015\012?)|\012/\n/g;

This will turn all newlines and carriage returns in $html to whatever is the correct newline character on your current system.

There is no need to include the \n and \r equivalents if you are using the octal character equivalents.

You also have far too many options on your substitution operator.

The /i option turns off case sensitivity and as there are no letters in your regex, it is useless.
The /s option alters the meaning of '.' in a regex and as you don't have any, it is useless.
The /m option alters the meaning of '^' and '$' in a regex and as you don't have any, it is useless.

Please read the section on newline characters in "perldoc perlport" for a more detailed explaination.

hth,

Dave...
0
 
LVL 2

Expert Comment

by:ITG-SSNA
ID: 12361476
It's amazing how much thread the Microsoft Editor newline convention breaks can cause.

Try editing in an ASCII compliant editor, take notepad for example. If you are, then don't FTP in binary mode.
0
 
LVL 2

Expert Comment

by:ChrisDrake
ID: 12367411
Scoff all you like at the "cargo-cult" code - but if you write cross-systems perl code every day, you quickly learn that not every perl interpreter acts the same, and you can guarantee that once he has this newline code in use, he'll add alphabetics and / or "." into it one day - so while other peoples suggestions might work in some cases, my solution always works in every case - catering for the future as well...

It takes a decade of programming to understand why all the above is important - if you can't figure it out now, whack a reminder into your personal organiser for the year 2014 and come back & re-read this - it will all make sence :-)
0
 
LVL 8

Expert Comment

by:davorg
ID: 12367506
I already have well over a decade of experience working with problems like this on multiple platforms and, of course by 2014 we'll all be using Perl 6 rules instead of regeular expressions so it will all need to be rewritten anyway (unless you use the 'perl5' modifier).

And the differences in the ways that Perl compilers work on different systems are all explained clearly in the "perldoc perlport" manual page that I pointed out earlier. If you understand what that document says then you won't need to add random puncuation to your regex in the hope that it starts working.
0
 
LVL 13

Expert Comment

by:gripe
ID: 12372793
> It takes a decade of programming to understand why all the above is important - if you can't figure it out now, whack a
> reminder into your personal organiser for the year 2014 and come back & re-read this - it will all make sence :-)

For someone with a decade of programming experience, I'm very surprised that you have yet to learn how to test your code. None of your examples actually does what the OP wanted, which is remove (only) carriage return (^M / \015 / \r) characters from an HTML file.

With regards to the (possibly) different representations of \n (no, not \r) on different platforms, this is clearly documented and explained in detail in perlport under the section 'Newlines'.

Rather than alluding to some mysterious (and non-existent) mystical reason behind adding characters and constructs to your code that have no significant purpose (and, in the case of your second regex, DO THE COMPLETE OPPOSITE OF WHAT THE OP ASKED FOR), why don't you enlighten us by referencing documentation and/or authoritative information that illustrates your point.

In the meantime, you should take some of the ample time between now and 2014 and read the perl documentation, specifically perldoc perlport (as was already mentioned) and perldoc perlre.

0
 
LVL 2

Expert Comment

by:ITG-SSNA
ID: 12376969
Jcmg I agree,

I've gotten this quite a bit on other threads too. You know we don't have the magic bullet answer in every case. But, to be shot at for even trying is pretty sad. What's worse, I pay every month to the site for it!
0
 
LVL 5

Expert Comment

by:ITcrow
ID: 12420192
In Perl:
$data =~ s/\015$//;  # Replace Control-M with nothing.
$data =~ s/\032//;    # Replace Control-Z with nothing.

In vi / vim :
- In command mode: ( Esc then Shift : )
:1,$ s/^M//g

What you should remember is that ^M in 'vi' is not typed as it is.
It is produced by Ctrl-V Ctrl-M
0
 
LVL 13

Expert Comment

by:gripe
ID: 12672099
While I'm not overly interested in a debate, it's my belief that my answer was the first correct answer to this question as the OP posted it. I would be somewhat miffed if the points awarded to someone else. Happy if the points were awarded to me. And not overly disappointed if the points were refunded. The OP offered no clarification to contradict the validity of my answer and the answers aside from what I offered did not test correctly.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now