Link to home
Start Free TrialLog in
Avatar of RobbieSnr
RobbieSnrFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Regex Problem Using Perl One-liner

I am trying to help a friend sort out a problem with some web sites that he administers. These are Joomla sites and he has found that the files for each of the sites have been hacked and iFrames have been injected. He wants to remove these dubious iFrames from all the sites and asked if I could provide some code to do this which he could run as a command at SSH.

A search of the web showed that one way of removing these was to use grep and sed but my knowledge of bash is limited and I’m much more knowledgable with Perl so I looked for a Perl solution. The offending code is of the form

<!-- . --><iframe width="1px" height="1px" src= “http://abcd/fghi/appropriate/promise-ourselves.php” style="display:block;" ></iframe>
<!-- . -->
I thought that there might be some other legitimate iFrames on the site so I set about producing some code that would remove the <!-- . --> tags and the iFrames within them and produced the following online Perl code:

perl -pi.bak -e '$pattern="<!-- . -->"; s/$pattern.+?$pattern//gs' `find . -name "*" -type f`
The first problem I came across was that this didn’t find the offending code although I had included the ’s’ option to treat everything as a single line. Although the regex I was using worked perfectly well using the Perl code normally I found that the problem seemed to be due to the fact that the file I had been given to test had CRLF line endings. I used another one liner to change all line endings to LF and tried my code again.

The code now seemed to work all right in removing the offending code. However when I tried it on a file that had two lots of that bad code it didn’t remove both bits despite the ‘g’ option in the regex, and in fact it didn’t remove the first bit properly.

Can someone explain why this code didn’t work with the CRLF line endings, and also why the global version didn’t remove all the bad code. Can I make any change to the one liner so that it works properly?
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

perl -p explicitly loops over each line in the input file.  You need to slurp the file in and manipulate it as a file.  Something like:
perl -i.bak -e '$/ = undef; $f = <>; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f' `find . -name "*" -type f`

Open in new window

Avatar of RobbieSnr

ASKER

Very quick and simple solution to my problem.
Many thanks for this quick reply. I used MO=Deparse to find out what was happening and I should have realised that the <> was just bringing in one line at a time and just checking that.

My friend will be very happy when I give him the revised code - he has to remove the offending code from 29 different websites.
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I had seen this -0777 parameter used but didn't register just what it did. I see from a Google search that it slurps the whole file into the $_ variable so this is another way to produce the result I wanted, thanks.
Oh dear, I've been too hasty, neither solution is working as I want it to.

I produced two files, index.html and index2.html. The first one contained that piece of offending iFrame and then I duplicated it, to check that both iFrames would be removed, and this had CRLF endings. The second was identical but with LF endings. I'm attaching both files. I also duplicated the files in subfolders, just to check that the one-liner was working recursively correctly.

I first of all tried to run the code in the main folder by replacing the back ticked 'find' with * so as not to change the sub folders. The version from ozo did remove the iFrames from both files,  with the expected warning about not being able to anything with the folders in that main folder.  However all the line endings were removed, which I didn't want to happen. I then tried the version with the 'find' and it did work recursively all right but with the same problem with the line endings.

I next tried wilcoxon's version, again changing the back ticked 'find' with a *. This removed the iFrames from the index.html file but left the other unchanged, and there was no warning about the folders. As with ozo's version all the line endings had been removed. I then tried the version with the 'find' but it did nothing at all.

Help!
index.html
index2.html
The only line endings that are removed are the ones between the <!-- . -->  <!-- . -->
along with everything else removed between the  <!-- . -->  <!-- . -->
If, instead of removing everything, you want to replace it with \n or \r\n,
you can do
s/$pattern.+?$pattern/\r\n/gs
Without the -p there is no loop around the code, so it only slurps one file
Good point.  I forgot about that part.  Hopefully, this version will fix that.
perl -i.bak -e '$/ = undef; while (@ARGV) { $f = <>; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f; shift @ARGV; }' `find . -name "*" -type f`

Open in new window


Otherwise, it gets longer...
perl -i.bak -e '$/ = undef; foreach my $n (@ARGV) { open(IN,$n) or die $!; $f = <IN>; close IN; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f' `find . -name "*" -type f`

Open in new window

If the -0777 works for you, it will produce shorter code than my solution (but both should work).  The only problem I can think of with it is if any of the files are any form of unicode (0777 is a valid unicode character so could malfunction).
Hi ozo, don't know what I was thinking about when I said the line endings were all removed, of course as you say the only ones in the examples were within the code I wanted to remove. You code works as it should, sorry about doubting that!

You mentioned that there was no looping in wilcoxon's original code. I did try just adding the 'p' but that didn't work.

Wilcoxon, thanks for the other two suggestions. I've tried both - the first one for some reason or other just adjusts the second file, index2.html, in all the folders but leaves the first ones untouched. Your second one failed to compile because of a missing bracket, and I couldn't work out where it should be placed.
The bracket in the second version should go at the end of the perl code section:
perl -i.bak -e '$/ = undef; foreach my $n (@ARGV) { open(IN,$n) or die $!; $f = <IN>; close IN; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f }' `find . -name "*" -type f`

Open in new window

-p adds
  while( <> ){
     ...  
  }continue{ print or die "-p destination: $!\n"; }
around your program
so if you use it,  the $f = <>; and print $f are uncalled-for
You should either include the loop in your own code, or use the <> and print from the -p loop
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Actually, if you want to use a 777 as a Unicode line separator you'd have to specify it as -0x1FF
Is that a recent change?  I know you used to be able to specify octal 777 (and hex 0x1FF will have the same potential issue - it's a valid unicode character (though likely uncommon depending on which unicode encoding the file uses)).
It looks like it's been there since 2003
http://www.nntp.perl.org/group/perl.perl5.changes/2003/04/msg7155.html
If you want to slurp the whole file, you can use -0777, same as pre-Unicode
If you want to separate lines with ¿ (latin small letter o with stroke and acute), you can use -0x1FF
You should also use
perl -Mopen=:utf8
for Unicode files
Many thanks to both of you for your help in finding a solution to my problem, and for the interesting information about how these one-liners work.

I was too hasty in taking wilcoxon's initial solution without checking that it provided the solution I required, and then ozo's one worked (despite my thinking initially that there was a problem with missing line endings!). Ozo's one was more concise and worked perfectly with the websites that had to be amended so I have split the points, giving zoo 100 more than wilcoxon for this reason.