Link to home
Start Free TrialLog in
Avatar of Bidan Zhu
Bidan ZhuFlag for Austria

asked on

Complicated pattern-matching in Perl

Hello,

I am currently upgrading a huge ColdFusion5 project (over 1000 ColdFusion pages) to ColdFusion8, and have the following problem:
In ColdFusion5, you can specify a query like:

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name="something"
       and type="test"
</CFQUERY>

Double-quoted SQL strings are allowed in CF5.

But in ColdFusion8, only single quoted SQL strings are accepted, so the above must be changed to:

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name='something'
       and type='test'
</CFQUERY>

I want to use a perl script to run through all the files to find out where double-quoted strings are used in <cfquery> tag, but this is trickier than I thought because there might be other tags nested in <cfquery>...</cfquery> such as <cfif>, which might legitimately have double-quoted strings as attributes, such as:

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name='something'
     <cfif typeIsSet = "true">
       and type='test'
     </cfif>
</CFQUERY>

Cases like this should NOT strike an alarm.

So, in English, I want to find out if there are double-quoted strings used within <cfquery> tags, which are NOT attributes of another tag.

Could some Perl Regexp guru give me a hand here?
Avatar of rjmedina
rjmedina
Flag of United States of America image

Would you say that it is true that when an item has double quotes it's within <...>?  If so you could do a layered pattern match.  First search for lines where the quotes are not contained within <...> and then change them.  
Avatar of Bidan Zhu

ASKER

Thanks for the replay, but not sure I understand what you mean...

I'm experimenting right now with look-ahead patterns, something in the line of:

        if ($queryContent =~ / ("[^"]*?")  (?= [^>]<) /gsx) {
            print "\n###########\nDouble quoted string: $1\n";
            print "CFQuery:\n$queryContent\n############\n";
        }
where $queryContent contains <cfquery...>...</cfquery>

Basically I'm saying: Find a double quoted string within the <cfquery>...</cfquery>tag, where afterwards a "<" must come before a ">". This way, the double quoted string cannot be within <...>.

This works with my simple test file, but not yet with the real data, trying to figure out why...

And this approach doesn't deal with the rare cases like:

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name=something
         and a<b
         and type="test"
         and c>d
</CFQUERY>


Oops, just discovered that my pattern is totally off, what I meant was something like:
/ ("[^"]*?")  (?= [^>]*<) /gsx
But now it matchs BETWEEN two strings, i.e.:
for

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name=something
     <cfif
       typeIsSet = "true">
       and type="test"
     </cfif>
</CFQUERY>

it matches "> and type="

:-/
Avatar of Adam314
Adam314

Are you only interested in a regex solution?  

I'm not familiar with ColdFusion, but it looks like it is an XML file.  If it is, you could use an XML parser to check for this.
Avatar of ozo
$/='</CFQUERY>';
while( <> ){
    s#(<CFQUERY\b[^>]*>)(.*".*)#$1.join"'",split/"/,$1#es;
    print;
}
#sorry, that should have been
$/='</CFQUERY>';
while( <> ){
    s#(<CFQUERY\b[^>]*>)(.*".*)#$1.join"'",split/"/,$2#es;
    print;
}
So you've already split your code up into CFQUERY snippets which you're assigning to $querycontent and then you're processing each one, which is what I was referring to as a "layered" pattern match.  So you're already doing that, moving on...
I've been distracted (actually working) so I'm still thinking on how to deal with your issue.  The best I've come up with is to exclude the special sitation with an if statement that checks for the "rare" situation and if it exists use a different pattern match to deal with it.
Below is what I have so far, I'll add more in a bit.

$cnt = grep /\</, $queryContent;
if ($cnt > 2) { # two < to account for the open and close of the tag set
	print "there is more than one tag set: $queryContent\n";
	# need different pattern match to use here
} else {
	# do your normal pattern match
}

Open in new window

Thank you all for the replies! I'm at home now and have no way to test out the patterns, so I'll have to wait till Monday to verify the different ideas.

Basically, I believe it should be possible to do what I want with some lookahead patterns, I've got as far as:
m/ ("[^"]*?")  (?! [^>]*<) /gsx

The problem I have with that is, multiple double-quoted strings would mess things up, because for

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name=something
     <cfif  typeIsSet = "true">
       and type="test"
     </cfif>
</CFQUERY>

it matches "> and type="

Now it would work if I know how to tell Perl to only match "paired" double-quotes and not mix them up! Guess I'm gonna have a look of the RegExp::Common and see if anything there can help...
$/='</CFQUERY>';
while( <> ){
    s#(<CFQUERY\b[^>]*>)(.*".*)#
    (my $q=$2)=~s/(<[^?>]*>)|"/$1||"'"/eg;
    $1.$q
    #es;
    print;
}

But that may not handle cases like
        <IMG SRC = "foo.gif" ALT = "A > B">

           <IMG SRC = "foo.gif"
                ALT = "A > B">

           <!-- <A comment> -->

           <script>if (a<b && a>c)</script>

           <# Just data #>

           <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

       If HTML comments include other tags, those solutions would also break
       on text like this:

           <!-- This section commented out.
               <B>You can't see me!</B>
           -->

Hello ozo,

your code seems to work for the most common cases, but strangely, it sometimes replaces the opening <cfquery> with a closing </cfquery>.

I am not savvy enough in reqexp to understand completely what you are doing here, could you please explain a little bit?

I see you are splitting on the closing </cfquery>, and using nested pattern to capture the content in the <cfquery>, but the part:
s/(<[^?>]*>)|"/$1||"'"/eg;

I don't really understand.

Since the ColdFusion code I'm upgrading is VERY old, so there might be a lot of special cases, therefore I don't think I'd "dare" to just run a replace script over it.
Much more likely would I print out all the relevant CFQUERYs with double-quote occurances that need to be replaced, and then go through them manually.
Now I'm at the stage where I have slurped in the file and got the CFQUERYs in an array, could you help me continue from there?

Many thanks!
can you show some examples of when it replaces the opening <cfquery> with a closing </cfquery>?
For example, for the following code:

<CFQUERY Name="query1" DataSource="ds">
    select *
      from test
     where name="something"
     <cfif typeIsSet = "true">
       and type="ghost"
     </cfif>
</CFQUERY>

<CFQUERY Name="query2" DataSource="ds">
    select *
      from test
     where name=something
     <cfif
       typeIsSet = "true">
       and type="test"
     </cfif>
</CFQUERY>

The opening <CFQUERY>s are changed into </CFQUERY>.

In the meantime, I've come up with something half-way workable but pretty ugly -
I do it in multiple passes:

First I substitue the double-quotes with  {{ and }}, this breaks in several cases where there are standalone double quotes (not pairs), but for 95% of the pages, it seems to work alright.
 ($queryContent = $origQuery) =~ s/"([^"]*?)"/{{$1}}/gsx;

Then I filter out the ones that look like they are attributes of a tag, convert them back to double quotes.
$queryContent =~ s/{{ ([^{}<>]*?) }} (?=[^<]*?>) /"$1"/gsx;

For the remaining {{ }},  I print them out as possible candidates for double-quote replacing.

Then I work through the print out and identify the cases which it didn't seem to work, and make an exclusion list.

Still need to write the script to actually change the pages that are "safe" to change, but right now I'm distracted by other works.

Any suggestions are still welcome and appreciated!
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial